【Docker】使用Docker Client和Docker Go SDK为容器分配GPU资源

2023年5月25日下午9:49 • Go语言 • 阅读 68

深度学习的环境配置通常是一项比较麻烦的工作，尤其是在多个用户共享的服务器上。虽然conda集成了virtualenv这样的工具用来隔离不同的依赖环境，但这种解决方案仍然没办法统一地分配计算资源。现在，我们可以通过容器技术为每个用户创建一个属于他们自己的容器，并为容器分配相应的计算资源。目前市面上基于容器的深度学习平台产品已经有很多了，比如超益集伦的AiMax。这款产品本身集成了非常多的功能，但如果你只是需要在容器内调用一下GPU，可以参考下面的步骤。

依赖安装

docker run --gpu 命令依赖于 nvidia Linux 驱动和 nvidia container toolkit，如果你想查看安装文档请点击这里，本节的下文只是安装文档的翻译和提示。

接下来就是安装nvidia container toolkit，我们的服务器需要满足一些先决条件：

GNU/Linux x86_64 内核版本 > 3.10
Docker >= 19.03 （注意不是Docker Desktop，如果你想在自己的台式机上使用toolkit，请安装Docker Engine而不是Docker Desktop，因为Desktop版本都是运行在虚拟机之上的）
NVIDIA GPU 架构 >= Kepler （目前RTX20系显卡是图灵架构，RTX30系显卡是安培架构）
NVIDIA Linux drivers >= 418.81.07

然后就可以正式地在Ubuntu或者Debian上安装NVIDIA Container Toolkit， 如果你想在 CentOS 上或者其他 Linux 发行版上安装，请参考官方的安装文档。

$ curl https://get.docker.com | sh \
  && sudo systemctl --now enable docker

设置 Package Repository和GPG Key

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
      && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
      && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
            sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
            sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

请注意：如果你想安装 NVIDIA Container Toolkit 1.6.0 之前的版本，你应该使用 nvidia-docker repository 而不是上方的 libnvidia-container repositories。
如果遇到问题请直接参考安装手册
安装 nvidia-docker2 应该会自动安装 libnvidia-container-tools libnvidia-container1 等依赖包，如果没有安装可以手动安装

完成前面步骤后安装 nvidia-docker2

$ sudo apt update

$ sudo apt install -y nvidia-docker2

重启 Docker Daemon

$ sudo systemctl restart docker

接下来你就可以通过运行一个CUDA容器测试下安装是否正确。

docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi

Shell 中显示的应该类似于下面的输出：

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   34C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

`--gpus` 用法

注意，如果你安装的是 nvidia-docker2 的话，它在安装时就已经在 Docker 中注册了 NVIDIA Runtime。如果你安装的是 nvidia-docker ，请根据官方文档向Docker注册运行时。
如果你有任何疑问，请移步本节参考的文档

可以使用以 Docker 开头的选项或使用环境变量将 GPU 指定给 Docker CLI。此变量控制在容器内可访问哪些 GPU。

--gpus
NVIDIA_VISIBLE_DEVICES

可能的值描述

或者

逗号分割的GPU UUID(s) 或者 GPU 索引

所有GPU都可被容器访问，默认值

不可访问GPU，但可以使用驱动提供的功能

或者

will have the same behavior as (i.e. neither GPUs nor capabilities are exposed)runc

使用该选项指定 GPU 时，应使用该参数。参数的格式应封装在单引号中，后跟要枚举到容器的设备的双引号。例如：将 GPU 2 和 3 枚举到容器。 --gpus '"device=2,3"'
使用 NVIDIA_VISIBLE_DEVICES 变量时，可能需要设置 --runtime nvidia除非已设置为默认值。

使用 `NVIDIA/go-nvml` 获取 GPU 信息

下面的演示代码获取了 GPU 的各种信息，其他功能请参考 NVML 和 go-nvml 的官方文档。

package main

import (
    "fmt"
    "github.com/NVIDIA/go-nvml/pkg/nvml"
    "log"
)

func main() {
    ret := nvml.Init()
    if ret != nvml.SUCCESS {
        log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
    }
    defer func() {
        ret := nvml.Shutdown()
        if ret != nvml.SUCCESS {
            log.Fatalf("Unable to shutdown NVML: %v", nvml.ErrorString(ret))
        }
    }()

    count, ret := nvml.DeviceGetCount()
    if ret != nvml.SUCCESS {
        log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret))
    }

    for i := 0; i < count; i++ {
        device, ret := nvml.DeviceGetHandleByIndex(i)
        if ret != nvml.SUCCESS {
            log.Fatalf("Unable to get device at index %d: %v", i, nvml.ErrorString(ret))
        }

        // 获取 UUID
        uuid, ret := device.GetUUID()
        if ret != nvml.SUCCESS {
            log.Fatalf("Unable to get uuid of device at index %d: %v", i, nvml.ErrorString(ret))
        }
        fmt.Printf("GPU UUID: %v\n", uuid)

        name, ret := device.GetName()
        if ret != nvml.SUCCESS {
            log.Fatalf("Unable to get name of device at index %d: %v", i, nvml.ErrorString(ret))
        }
        fmt.Printf("GPU Name: %+v\n", name)

        memoryInfo, _ := device.GetMemoryInfo()
        fmt.Printf("Memory Info: %+v\n", memoryInfo)

        powerUsage, _ := device.GetPowerUsage()
        fmt.Printf("Power Usage: %+v\n", powerUsage)

        powerState, _ := device.GetPowerState()
        fmt.Printf("Power State: %+v\n", powerState)

        managementDefaultLimit, _ := device.GetPowerManagementDefaultLimit()
        fmt.Printf("Power Managment Default Limit: %+v\n", managementDefaultLimit)

        version, _ := device.GetInforomImageVersion()
        fmt.Printf("Info Image Version: %+v\n", version)

        driverVersion, _ := nvml.SystemGetDriverVersion()
        fmt.Printf("Driver Version: %+v\n", driverVersion)

        cudaDriverVersion, _ := nvml.SystemGetCudaDriverVersion()
        fmt.Printf("CUDA Driver Version: %+v\n", cudaDriverVersion)

        computeRunningProcesses, _ := device.GetGraphicsRunningProcesses()
        for _, proc := range computeRunningProcesses {
            fmt.Printf("Proc: %+v\n", proc)
        }
    }

    fmt.Println()
}

使用 Docker Go SDK 为容器分配 GPU

首先需要用的的是 ContainerCreate API

// ContainerCreate creates a new container based in the given configuration.

// It can be associated with a name, but it's not mandatory.

func (cli *Client) ContainerCreate(
    ctx context.Context,
    config *container.Config,
    hostConfig *container.HostConfig,
    networkingConfig *network.NetworkingConfig,
    platform *specs.Platform,
    containerName string) (container.ContainerCreateCreatedBody, error)

这个 API 中需要很多用来指定配置的 struct，其中用来请求 GPU 设备的是 container.HostConfig 这个 struct 中的 Resources ，它的类型是 container.Resources ，而在它的里面保存的是 container.DeviceRequest 这个结构体的切片，这个变量会被 GPU 设备的驱动使用。

cli.ContainerCreate API  需要 ---------> container.HostConfig{
                        Resources: container.Resources{
                            DeviceRequests: []container.DeviceRequest {
                                {
                                    Driver:       "nvidia",
                                    Count:        0,
                                    DeviceIDs:    []string{"0"},
                                    Capabilities: [][]string{{"gpu"}},
                                    Options:      nil,
                                }
                            }
                        }
                    }

下面是 container.DeviceRequest 结构体的定义

// DeviceRequest represents a request for devices from a device driver.

// Used by GPU device drivers.

type DeviceRequest struct {
    Driver       string            // 设备驱动名称 这里就填写 "nvidia" 即可
    Count        int               // 请求设备的数量 (-1 = All)
    DeviceIDs    []string          // 可被设备驱动识别的设备ID列表，可以是索引也可以是UUID
    Capabilities [][]string        // An OR list of AND lists of device capabilities (e.g. "gpu")
    Options      map[string]string // Options to pass onto the device driver
}

注意：如果指定了 Count 字段，就无法通过 DeviceIDs 指定 GPU，它们是互斥的。

接下来我们尝试使用 Docker Go SDK 启动一个 pytorch 容器。

首先我们编写一个 test.py 文件，让它在容器内运行，检查 CUDA 是否可用。

test.py
import torch

print("cuda.is_available:", torch.cuda.is_available())

下面是实验代码，启动一个名为 torch_test_1 的容器，并运行 python3 /workspace/test.py 命令，然后从 stdout 和 stderr 获取输出。

package main

import (
    "context"
    "fmt"
    "github.com/docker/docker/api/types"
    "github.com/docker/docker/api/types/container"
    "github.com/docker/docker/client"
    "github.com/docker/docker/pkg/stdcopy"
    "os"
)

var (
    defaultHost = "unix:///var/run/docker.sock"
)

func main() {
    ctx := context.Background()
    cli, err := client.NewClientWithOpts(client.WithHost(defaultHost), client.WithAPIVersionNegotiation())
    if err != nil {
        panic(err)
    }

    resp, err := cli.ContainerCreate(ctx,
        &container.Config{
            Image:     "pytorch/pytorch",
            Cmd:       []string{},
            OpenStdin: true,
            Volumes:   map[string]struct{}{},
            Tty:       true,
        }, &container.HostConfig{
            Binds: []string{/home/joseph/workspace:/workspace},
            Resources: container.Resources{DeviceRequests: []container.DeviceRequest{{
                Driver:       "nvidia",
                Count:        0,
                DeviceIDs:    []string{"0"},  // 这里填写GPU index 或者 GPU UUID 都可以
                Capabilities: [][]string{{"gpu"}},
                Options:      nil,
            }}},
        }, nil, nil, "torch_test_1")
    if err != nil {
        panic(err)
    }

    if err := cli.ContainerStart(ctx, resp.ID, types.ContainerStartOptions{}); err != nil {
        panic(err)
    }

    fmt.Println(resp.ID)

    execConf := types.ExecConfig{
        User:         "",
        Privileged:   false,
        Tty:          false,
        AttachStdin:  false,
        AttachStderr: true,
        AttachStdout: true,
        Detach:       true,
        DetachKeys:   "ctrl-p,q",
        Env:          nil,
        WorkingDir:   "/",
        Cmd:          []string{"python3", "/workspace/test.py"},
    }
    execCreate, err := cli.ContainerExecCreate(ctx, resp.ID, execConf)
    if err != nil {
        panic(err)
    }

    response, err := cli.ContainerExecAttach(ctx, execCreate.ID, types.ExecStartCheck{})
    defer response.Close()
    if err != nil {
        fmt.Println(err)
    }

    // read the output
    _, _ = stdcopy.StdCopy(os.Stdout, os.Stderr, response.Reader)
}

可以看到，程序输出了创建的容器的 Contrainer ID 和执行命令的输出。

$ go build main.go
$ sudo ./main
264535c7086391eab1d74ea48094f149ecda6d25709ac0c6c55c7693c349967b
cuda.is_available: True

接下来使用 docker ps 查看容器状态。

$ docker ps
CONTAINER ID   IMAGE             COMMAND   CREATED         STATUS             PORTS     NAMES
264535c70863   pytorch/pytorch   "bash"    2 minutes ago   Up 2 minutes                 torch_test_1

没问题，Container ID 对得上。

多实例 GPU （MIG）功能允许将基于 NVIDIA Ampere 架构的 GPU（如 NVIDIA A100）安全地分区为多达七个单独的 GPU 实例，用于 CUDA 应用程序，为多个用户提供单独的 GPU 资源，以实现最佳的 GPU 利用率。此功能对于未使 GPU 的计算容量完全饱和的工作负载特别有用，因此用户可能希望并行运行不同的工作负载以最大限度地提高利用率。

Original: https://www.cnblogs.com/joexu01/p/16539619.html
Author: joexu01
Title: 【Docker】使用Docker Client和Docker Go SDK为容器分配GPU资源

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/516237/

转载文章受原作者版权保护。转载请注明原作者出处！

Go语言

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

系统调用跟踪——分析(一)

通过strace工具可跟踪用户进程与Linux内核的调用交互，可看到其中的System Call(系统调用)情况； 安装strace&a…

Go语言 2023年5月25日
0081
Go语言程序的命令行参数

获取命令行参数是程序功能多样化的必要前提。这个例子展示Go语言如何获得程序的命令行参数。 Go语言程序： // echoarg project main.go package m…

Go语言 2023年5月29日
0047
Go语言学习笔记1

1.Go语言环境搭建及基础知识 Go语言官方网站（http://golang.org）代码包文档网站（http://godoc.org）Go语言中文网（http://studygo…

Go语言 2023年5月29日
0058
开始读 Go 源码了

学完 Go 的基础知识已经有一段时间了，那么接下来应该学什么呢？有几个方向可以考虑，比如说 Web 开发，网络编程等。在写项目的过程中，发现一个问题。实现功能是没问题的，但不知道…

Go语言 2023年5月25日
0041
Go语言内置函数大全

https://studygolang.com/articles/1708 Original: https://www.cnblogs.com/answercard/p/12574…

Go语言 2023年5月29日
0054
Golang实现set

Golang语言本身未实现set，但是实现了map golang的map是一种无序的键值对的集合，其中键是唯一的而set是键的不重复的集合，因此可以用map来实现set 由于ma…

Go语言 2023年5月25日
0050
写了一年golang，来聊聊进程、线程与协程

本文已收录 https://github.com/lkxiaolou/lkxiaolou 欢迎star。进程在早期的单任务计算机中，用户一次只能提交一个作业，独享系统的全部资源…

Go语言 2023年5月25日
0075
Golang的GC回收机制

GC触发的条件 v1.3版本标记清除法第一步，找出不可达的对象，做上标记。第二部，回收没有被标记的对象。缺点：在标记的时候会进行STW（Stop the world） St…

Go语言 2023年5月25日
0049
Go – 关于 protoc 工具的小疑惑

protoc 工具可以干什么？ protoc 工具可以通过相关插件将 .proto 文件编译成 C、 C++、 Golang、 Java、 Python、 PHP 等多种语言的…

Go语言 2023年5月25日
0052
Go – 使用 sync.WaitGroup 来实现并发操作

如果你有一个任务可以分解成多个子任务进行处理，同时每个子任务没有先后执行顺序的限制，等到全部子任务执行完毕后，再进行下一步处理。这时每个子任务的执行可以并发处理，这种情景下适合使用…

Go语言 2023年5月25日
0058
Go语言学习笔记-结构体（Struct）

Go语言结构体 1、概念结构体是一种聚合的数据类型，是由零个或多个任意类型的值聚合成的实体。每个值称为结构体的成员。Go 语言中数组可以存储同一类型的数据，但在结构体中我们可以为不…

Go语言 2023年5月25日
0055
Go 语言快速开发入门

需求开发的步骤 linux下如何开发Go程序 MAC下如何开发Go程序 Golang执行流程分析编译和运行说明 Go程序开发的注意事项 Go语言的转义字符（escapechar…

Go语言 2023年5月25日
0075
Go代码规范梳理

注释语句 // Request 表示运行命令的请求。 type Request struct { … // Encode 将 req 的 JSON 编码写入 w 。 func …

Go语言 2023年5月25日
0061
go语言四 channel和gorotime

goroutine go中使用Goroutine来实现并发concurrently。 Goroutine是Go语言特有的名词。区别于进程Process，线程Thread，协程Cor…

Go语言 2023年5月29日
0045
Go 中的 byte、rune 与 string

byte 和 rune byte 是 uint8 的别名，其字面量是 8 位整数值，byte 切片相比于不可变的 string 方便常用许多。它可以更改每个字节或字符。这对于处理文…

Go语言 2023年5月25日
0093
Golang笔记

本文主要为go的学习过程笔记。一、基本介绍 1、开发环境安装-windows安装打开Golang官网，选择对应版本，进行安装。 2、环境变量配置 1）步骤（1）首先在环境变量…

Go语言 2023年5月25日
0047

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

【Docker】使用Docker Client和Docker Go SDK为容器分配GPU资源

依赖安装

--gpus 用法

使用 NVIDIA/go-nvml 获取 GPU 信息

使用 Docker Go SDK 为容器分配 GPU

大家都在看

`--gpus` 用法

使用 `NVIDIA/go-nvml` 获取 GPU 信息