在Kubernetes（k8s）中使用GPU

2023年5月23日下午5:13 • 人工智能 • 阅读 192

介绍

Kubernetes 支持对节点上的 AMD 和 NVIDIA GPU （图形处理单元）进行管理，目前处于实验状态。

修改docker配置文件

root@hello:~# cat /etc/docker/daemon.json
{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
  "data-root": "/var/lib/docker",
  "exec-opts": ["native.cgroupdriver=systemd"],
  "registry-mirrors": [
    "https://docker.mirrors.ustc.edu.cn",
    "http://hub-mirror.c.163.com"
  ],
  "insecure-registries": ["127.0.0.1/8"],
  "max-concurrent-downloads": 10,
  "live-restore": true,
  "log-driver": "json-file",
  "log-level": "warn",
  "log-opts": {
    "max-size": "50m",
    "max-file": "1"
    },
  "storage-driver": "overlay2"
}
root@hello:~#

root@hello:~# systemctl  daemon-reload

root@hello:~# systemctl  start docker

添加标签

root@hello:~# kubectl label nodes 192.168.1.56 nvidia.com/gpu.present=true

root@hello:~# kubectl get nodes -L nvidia.com/gpu.present
NAME           STATUS                     ROLES    AGE    VERSION   GPU.PRESENT
192.168.1.55   Ready,SchedulingDisabled   master   128m   v1.22.2
192.168.1.56   Ready                      node     127m   v1.22.2   true
root@hello:~#

安装helm仓库

root@hello:~# curl https://baltocdn.com/helm/signing.asc | sudo apt-key add -
root@hello:~# sudo apt-get install apt-transport-https --yes
root@hello:~# echo "deb https://baltocdn.com/helm/stable/debian/ all main" | sudo tee /etc/apt/sources.list.d/helm-stable-debian.list
root@hello:~# sudo apt-get update
root@hello:~# sudo apt-get install helm

helm install \
    --version=0.10.0 \
    --generate-name \
    nvdp/nvidia-device-plugin

查看是否有nvidia

root@hello:~# kubectl describe node 192.168.1.56 | grep nv
                    nvidia.com/gpu.present=true
  nvidia.com/gpu:     1
  nvidia.com/gpu:     1
  kube-system                 nvidia-device-plugin-1637728448-fgg2d         0 (0%)        0 (0%)      0 (0%)           0 (0%)         50s
  nvidia.com/gpu     0           0
root@hello:~#

下载镜像

root@hello:~# docker pull registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
root@hello:~# docker save -o tensorflow-gpu.tar  registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
root@hello:~# docker load -i tensorflow-gpu.tar

创建tensorflow测试pod

root@hello:~# vim gpu-test.yaml
root@hello:~# cat gpu-test.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-gpu
  labels:
    test-gpu: "true"
spec:
  containers:
  - name: training
    image: registry.cn-beijing.aliyuncs.com/ai-samples/tensorflow:1.5.0-devel-gpu
    command:
    - python
    - tensorflow-sample-code/tfjob/docker/mnist/main.py
    - --max_steps=300
    - --data_dir=tensorflow-sample-code/data
    resources:
      limits:
        nvidia.com/gpu: 1
  tolerations:
  - effect: NoSchedule
    operator: Exists
root@hello:~#

root@hello:~# kubectl  apply -f gpu-test.yaml
pod/test-gpu created
root@hello:~#

查看日志

root@hello:~# kubectl logs test-gpu
WARNING:tensorflow:From tensorflow-sample-code/tfjob/docker/mnist/main.py:120: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.

Instructions for updating:

Future major versions of TensorFlow will allow gradients to flow
into the labels input on backprop by default.

See tf.nn.softmax_cross_entropy_with_logits_v2.

2021-11-24 04:38:50.846973: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:895] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-24 04:38:50.847698: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1105] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:10.0
totalMemory: 14.75GiB freeMemory: 14.66GiB
2021-11-24 04:38:50.847759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1195] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla T4, pci bus id: 0000:00:10.0, compute capability: 7.5)
root@hello:~#

https://blog.csdn.net/qq_33921750

https://my.oschina.net/u/3981543

https://www.zhihu.com/people/chen-bu-yun-2

https://segmentfault.com/u/hppyvyv6/articles

https://juejin.cn/user/3315782802482007

https://space.bilibili.com/352476552/article

https://cloud.tencent.com/developer/column/93230

知乎、CSDN、开源中国、思否、掘金、哔哩哔哩、腾讯云

Original: https://blog.csdn.net/qq_33921750/article/details/121867991
Author: 小陈运维
Title: 在Kubernetes（k8s）中使用GPU

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/496725/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

初入深度学习4——如何修改一个深度学习库

初入深度学习4——如何修改一个深度学习库学习前言深度学习库的组成修改目标定位举例 * 一、任务介绍二、目标定位三、变量分析四、修改代码总结学习前言在学习过一个深…

人工智能 2023年6月24日
00107
# 研究杂感 × Citespace（第一辑）

Citespce 数据可视化写在前面 Citespace 软件在揭示学科的动态发展规律，发现学科的研究前沿上有一定的优势；而对学科主题之间的关系进行清晰的呈现、或者数据量非常…

人工智能 2023年7月16日
0057
opencv-python图像处理 —图像轮廓检测与代码实现

一·、轮廓检测边缘检测虽然能够检测边缘，但是其边缘并不是连续的，图像轮廓用于检测一个整体，来用于后续进行其他处理。opencv提供了两个函数来完成这些操作：findContours…

人工智能 2023年6月20日
0099
智慧校园中教务管理系统功能需求思路设计分享来自博奥智源

教师端可进行个人信息、教务通知、课表、监考信息、学生成绩、在线资源、空教室、全校性课表的查询；学生可进行课表、成绩、课程、学业进程、考试信息、个人信息、空闲教室的查询。管理端可查看…

人工智能 2023年6月28日
0069
PCL ——CropBox filter 过滤使用 box 切割

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年6月29日
0045
图床是什么？如何使用图床？

图床其实是互联网中存储图片的空间，举个栗子：假设你在微博分享一张图片，你的粉丝可以通过互联网看到你分享的图片，那么他是去访问你的手机的相册吗？其实不是的，你分享图片，也就是把图片…

人工智能 2023年6月4日
0079
三维点云地图转二维栅格地图

三维点云地图转二维栅格地图的实现需要1.地图转换工具——octomap；2.栅格地图保存工具——map_server；3.点云发布和转换工具启动launch文件。安装octoma…

人工智能 2023年6月10日
00165
Jetson Xavier NX 卸载Tensorrt8.2.1并安装指定版本8.0.1

我的板子目前环境是Jetpack4.6.2、CUDA10.2、Cudnn8.2.1、Tensorrt8.2.1 首先说一下为什么要更换版本，在执行tensorrt的测试案例的时候，…

人工智能 2023年6月17日
0081
具有神经网络思维的Logistic回归

** 1 – Packages（导入包，加载数据集） 1.1导入包其中，用到的Python包有：◎numpy 是使用Python进行科学计算的基础包。◎h5py Py…

人工智能 2023年6月18日
0071
如何理解卷积神经网络中的通道（channel）

在卷积神经网络中我们通常需要输入 in_channels 和 out_channels ，即输入通道数和输出通道数，它们代表什么意思呢？本文将给出一个形象的理解。对于通道的理解可…

人工智能 2023年7月13日
0058
概率统计笔记：威沙特分布（Wishart Distribution）

1 威沙特分布定义 1.1 中心威沙特分布假设X是一个n×p维的矩阵，其中每一行的p元向量均为满足p维正态分布的向量，即：当p=1，即每个X维为1维的时候，威沙特分布就变成了一…

人工智能 2023年6月16日
00104
[半监督学习] FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

一些先进的半监督学习方法使用基于图像的转换增强和一致性正则化的组合策略. 在FeatMatch 中, 提出了一种新颖的基于学习特征的细化和增强方法, 该方法可产生各种复杂的转换集….

人工智能 2023年6月3日
0093
安装keras

深度学习框架keras的安装 * – 1 创建虚拟环境py36 – 2 在虚拟环境py36中安装keras – 3 检验是否正确安装keras …

人工智能 2023年5月24日
0087
机器学习-科学数据库day4

1.pandas常用数据类型（1）Series 一维，带标签数组（2）DataFrame 二维，Series 容器 2.pandas 之Series 切片和索引 t 的输出： …

人工智能 2023年7月8日
0064
Speech在AI算法中的常见细节问题包括什么

1. 问题介绍在AI算法中，Speech是一个常见的研究领域。它涵盖了语音识别、语音合成、语音转换等多个方面。本文将着重介绍Speech中的一个常见细节问题：如何将语音信号转换为…

人工智能 2024年1月3日
0041
集成显卡安装的cpu版本tensorflow和pytorch

安装cpu版本的tensorflow和pytorch * – 1.查看电脑显卡 – 2.安装anaconda和pycharm – 3.创建对应的…

人工智能 2023年7月22日
00125

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

在Kubernetes（k8s）中使用GPU

大家都在看