tensorflow使用gpu进行训练

2023年5月24日下午6:53 • 人工智能 • 阅读 96

GPU之nvidia-smi命令详解查看显卡的信息：

cmd: nvidia-smi

https://www.jianshu.com/p/ceb3c020e06b

GPU：本机中的GPU编号（有多块显卡的时候，从0开始编号）图上GPU的编号是：0
Fan：风扇转速（0%-100%），N/A表示没有风扇
Name：GPU类型，图上GPU的类型是：Tesla T4
Temp：GPU的温度（GPU温度过高会导致GPU的频率下降）
Perf：GPU的性能状态，从P0（最大性能）到P12（最小性能），图上是：P0
Persistence-M：持续模式的状态，持续模式虽然耗能大，但是在新的GPU应用启动时花费的时间更少，图上显示的是：off
Pwr：Usager/Cap：能耗表示，Usage：用了多少，Cap总共多少
Bus-Id：GPU总线相关显示，domain：bus：device.function
Disp.A：Display Active ，表示GPU的显示是否初始化
Memory-Usage：显存使用率
Volatile GPU-Util：GPU使用率
Uncorr. ECC：关于ECC的东西，是否开启错误检查和纠正技术，0/disabled,1/enabled
Compute M：计算模式，0/DEFAULT,1/EXCLUSIVE_PROCESS,2/PROHIBITED
Processes：显示每个进程占用的显存使用率、进程号、占用的哪个GPU

隔几秒刷新一下显存状态：nvidia-smi -l 秒数

隔两秒刷新一下GPU的状态： nvidia-smi -l 2

tensorflow的显卡使用方式

1、直接使用

此方法基本上会占用当前机器上所有显卡的剩余显存，请注意，它是机器上所有显卡的剩余显存。所以程序可能只需要一块显卡，但是程序太霸道了，我不用其他显卡，或者我不能用那么多显卡，但我只想占据它。

[En]

This method will basically occupy the remaining video memory of all the video cards on the current machine, note that it is the remaining video memory of all the video cards on the machine. So the program may only need one graphics card, but the program is so bossy, I don’t use other graphics cards, or I can’t use that many graphics cards, but I just want to occupy it.

with tf.compat.v1.Session() as sess:
        # &#x8F93;&#x5165;&#x56FE;&#x7247;&#x4E3A;256x256&#xFF0C;2&#x4E2A;&#x5206;&#x7C7B;
        shape, classes = (224, 224, 3), 20
        # &#x8C03;&#x7528;keras&#x7684;ResNet50&#x6A21;&#x578B;
        model = keras.applications.resnet50.ResNet50(input_shape = shape, weights=None, classes=classes)
        model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

        # &#x8BAD;&#x7EC3;&#x6A21;&#x578B; categorical_crossentropy sparse_categorical_crossentropy
        # training = model.fit(train_x, train_y, epochs=50, batch_size=10)
        model.fit(train_x,train_y,validation_data=(test_x, test_y), epochs=20, batch_size=6,verbose=2)
        # # &#x628A;&#x8BAD;&#x7EC3;&#x597D;&#x7684;&#x6A21;&#x578B;&#x4FDD;&#x5B58;&#x5230;&#x6587;&#x4EF6;
        model.save('resnet_model_dog_n_face.h5')

2、分配比例使用

这种方法和直接使用上面的方法的不同之处在于，我不会占用所有的视频内存。例如，如果我这样写，我将拥有每个视频卡的60%。

[En]

The difference between this method and the direct use of the above is that I don’t occupy all the video memory. For example, if I write like this, I have 60% of each video card.

from tensorflow.compat.v1 import ConfigProto# tf 2.x&#x7684;&#x5199;&#x6CD5;
config =ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction=0.6
with tf.compat.v1.Session(config=config) as sess:
     model = keras.applications.resnet50.ResNet50(input_shape = shape, weights=None, classes=classes)

3. 动态申请使用

该方法动态申请显存，只申请内存，不释放内存。如果其他人的程序占用了所有剩余的显卡，它们将报告错误。

[En]

This method is dynamically applied for video memory, only memory will be applied for, and memory will not be released. And if other people’s programs occupy all the remaining graphics cards, they will report an error.

以上三种方式应根据现场情况进行选择。

[En]

The above three ways should be chosen according to the scene.

第一个是因为它占用了所有的内存，所以只要模型的大小不超过显存的大小，就不会产生显存碎片，影响计算性能。可以说，部署应用程序的配置是合适的。

[En]

The first is because it takes up all the memory, so as long as the size of the model does not exceed the size of the video memory, it will not produce video memory fragments and affect computing performance. It can be said that it is appropriate to deploy the configuration of the application.

第二种和第三种适合多人使用一台服务器，但第二种是显存的浪费，第三种避免了某个程序显存的浪费，但程序很容易因为内存申请失败而崩溃。

[En]

The second and the third are suitable for multiple people to use a server, but the second is a waste of video memory, and the third avoids the waste of video memory in a certain program, but it is very easy for the program to crash due to failure to apply for memory.

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
with tf.compat.v1.Session(config=config) as sess:
     model

4 指定GPU

在有多块GPU的服务器上运行tensorflow的时候，如果使用python编程，则可指定GPU，代码如下：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"

配上一个完整的示例：resnet50图片分类：

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.InteractiveSession(config=config)
with tf.compat.v1.Session(config=config) as sess:
        # &#x8F93;&#x5165;&#x56FE;&#x7247;&#x4E3A;256x256&#xFF0C;2&#x4E2A;&#x5206;&#x7C7B;
        shape, classes = (224, 224, 3), 20
        # &#x8C03;&#x7528;keras&#x7684;ResNet50&#x6A21;&#x578B;
        model = keras.applications.resnet50.ResNet50(input_shape = shape, weights=None, classes=classes)
        model.compile(optimizer="adam", loss="sparse_categorical_crossentropy", metrics=["accuracy"])

        # &#x8BAD;&#x7EC3;&#x6A21;&#x578B; categorical_crossentropy sparse_categorical_crossentropy
        # training = model.fit(train_x, train_y, epochs=50, batch_size=10)
        model.fit(train_x,train_y,validation_data=(test_x, test_y), epochs=20, batch_size=6,verbose=2)
        # # &#x628A;&#x8BAD;&#x7EC3;&#x597D;&#x7684;&#x6A21;&#x578B;&#x4FDD;&#x5B58;&#x5230;&#x6587;&#x4EF6;
        model.save('resnet_model_dog_n_face.h5')

Original: https://blog.csdn.net/qq_38735017/article/details/119991239
Author: 甜辣uu
Title: tensorflow使用gpu进行训练

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/508937/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

什么是聚类以及它是如何工作的？

什么是聚类？让我们直接进入它，看看这个散点图来解释。您可能会注意到，数据点似乎集中在不同的组中（图 1）。图 1. 示例数据的散点图为了使这一点显而易见，我们显示了相同的数据…

人工智能 2023年5月31日
0047
视频素材资源

素材资源一、视频 Videezy ：https://www.videezy.com/Videovo：https://www.videvo.net/mixkit：https://m…

人工智能 2023年6月29日
00100
《深度学习之pytorch实战计算机视觉》第9章多模型融合（代码可跑通）

上一章《深度学习之pytorch实战计算机视觉》第8章图像风格迁移实战（代码可跑通）讲了图像风格迁移实战，是个很有趣的应用。多模型融合是一种”集百家之所长&#822…

人工智能 2023年5月26日
00101
yolov5的onnx推断示例和思路记录（包含detect.py的最新源码解读）

最近把yolov5的模型导出为了onnx格式，想写一个脚本来验证一下结果，看看和直接使用pt文件进行推断有无出入，虽然官方在detect.py文件里可以针对各种模型格式直接进行推断…

人工智能 2023年6月17日
0079
制作PointNet以及PointNet++点云训练样本

目录一、明确问题： 1.1、标准数据集参考： 1.2 、HDF5数据组织形式：二、开始制作数据集 2.1、数据标注，本人使用了Arcgis软件进行标注 2.2、样本点云提取，…

人工智能 2023年6月16日
0086
MMDetection亲测安装教程

MMDetection是一个基于 PyTorch 的目标检测开源工具箱。接下来就安装看看吧。本人安装环境：系统环境：Ubuntu 20.04.2 LTS cuda版本：11.0…

人工智能 2023年6月16日
0087
KNN-k近邻算法

KNN-k近邻算法 k近邻算法基础 * 解决分类问题在Notebook中实现 – 数据的准备 KNN过程通过函数封装来实现 scikit-learn 中的机器学习封…

人工智能 2023年7月2日
0046
与Bert结构不完全相同的模型从.ckpt 转换为.bin.报错：AttributeError: ‘BertForPreTraining‘ object has noattribute ‘bias“

如果模型与bert结构一致，或是transformers中的其他模型，都可以用transformer官方库提供的转换方式进行转换。1）vim convert.py2) 使用命令行 …

人工智能 2023年5月23日
0068
惯性导航技术发展

最近在看关于惯性导航技术方法的综述，在此整理下。惯性导航技术涉及惯性传感器、算法、控制系统方面的知识，与材料、机电、控制理论等学科息息相关。一、惯性传感器惯性传感器是惯性导航系…

人工智能 2023年6月10日
0063
最新资讯｜2022年8月29日，IECEE发布电池认证CTL协议DSH1037A！

2022年8月29日，IECEE发布电池认证CTL协议DSH 1037A，涉及标准IEC 62133:2002 ，IEC 62133:2012，IEC 62133-1:2017，I…

人工智能 2023年6月4日
0078
实战1 – 空气质量数据的校准

1 题目简介题目来源于2019 高教社杯全国大学生数学建模竞赛D题——空气质量数据的校准。空气污染对生态环境和人类健康危害巨大，通过对”两尘四气”（PM2…

人工智能 2023年6月15日
0095
Selenium自动化测试框架

一.Selenium概述 1.1 什么是框架?     框架（framewor…

人工智能 2023年7月29日
0049
Anaconda+PyCharm创建环境及配置环境

回答1：在下配置pytorch可以按照如下步骤进行： 1. 打开一个新的虚拟，例如命名为”pytorch_env”。 2. 在命令行中使用conda…

人工智能 2023年7月4日
0063
时间序列聚类、分类

@创建于：2022.05.11@修改于：2022.05.11 文章目录 * – 1、时序聚类 – 2、时序分类 1、时序聚类聚类分析（cluster an…

人工智能 2023年6月2日
0070
聚类–DBSCAN

1、什么是DBSCN DBSCAN也是一个非常有用的聚类算法。它的主要优点:它不需要用户先验地设置簇的个数，可以划分具有复杂形状的簇，还可以找出不属于任何簇的点。 DBSCAN比…

人工智能 2023年5月31日
0069
opencv读取、显示和保存图片

目录读取图片显示图片保存图片应用实例运行截图读取图片读取图片使用imread函数，注意：opencv读取图片，默认颜色通道是BGR，而不是RGB。 def imre…

人工智能 2023年5月26日
0060

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

tensorflow使用gpu进行训练

1、直接使用

2、分配比例使用

3. 动态申请使用

4 指定GPU

大家都在看