tensorflow利用for循环进行训练遇到的内存爆炸问题(OOM)

2023年5月23日下午6:06 • 人工智能 • 阅读 97

最近在用tensorflow学习模型的知识蒸馏，自己基于cifar10数据集训练得到的teacher模型，在对3种不同参数量的student模型使用相同的alpha和temperature参数进行蒸馏之后，得到的实验结果均与论文结果相反（论文：Distilling the Knowledge in a Neural Network）
所以自己打算用for循环方式遍历多种alpha，temperature的参数组合来对比蒸馏效果，然后用matplotlib.pyplot将训练以及对比结果进行绘图（不想自己手动调参了.jpg，在notebook里一遍遍调完参重新跑然后保存数据真难顶）
然后就遇到了很多bug，有tensorflow的out of memory的问题，也有Process和Thread的问题，还有matplotlib.pyplot的问题，没想到代码里面都遇到了

本地环境：
os: win10
显卡：GTX 1650 4G独显
python : 3.8.3
tensorflow : 2.7.0
最后使用的服务器的GPU配置：4个12G的2080Ti（实际只用了GPU：0进行测试）

以下是录制过程和解决方案

[En]

The following is the recording process and solutions

需要以下模块、函数和类<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>The following modules, functions, and classes are needed</font>*</details>
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
from multiprocessing import Process
from threading import Thread
import time

class Distiller(keras.Model):
    def __init__(self, student: keras.Sequential, teacher: keras.Sequential):
        super().__init__()
        self.teacher = teacher
        self.student = student

    def compile(self,
                optimizer: keras.optimizers.Optimizer,
                metrics,
                student_loss_fn,
                distillation_loss_fn,
                alpha=0.1,
                temperature=3):
        ......

    def train_step(self, data):
        x, y = data

        ......

        return results

    def test_step(self, data):
        x, y = data
        ......

        return results

def build_model(name, conv1_size, conv2_size, conv3_size, dense_size):
    model = keras.Sequential([
        ......

    ],
        name=name)

    return model

循环使用的主函数
def main_loop(alpha, T):

    global loop_param
    loop_param = str(alpha) + '_' + str(T) + '_'

    print('\n\n' + loop_param)

    student = build_model('student', 8, 16, 16, 16)

    student_scratch = keras.models.clone_model(student)
    ......

def draw_distill():
......

def draw_scratch():
......

一开始直接在for循环中进行distiller和student_scratch的训练与评估，之后几个循环之后就出现了OOM，部分输出如下：

......

Epoch 2/20
2022-03-23 21:34:09.303394: W tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 242.0KiB (rounded to 247808)requested by op student/conv2d_2/Relu
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.

Current allocation summary follows.

......

2022-03-23 21:34:09.332826: I tensorflow/core/common_runtime/bfc_allocator.cc:1078] Sum Total of in-use chunks: 2.10GiB
2022-03-23 21:34:09.332899: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] total_region_allocated_bytes_: 2255958784 memory_limit_: 2255958836 available bytes: 52 curr_region_allocation_bytes_: 4511918080
2022-03-23 21:34:09.332977: I tensorflow/core/common_runtime/bfc_allocator.cc:1086] Stats:
Limit:                      2255958836
InUse:                      2255956224
MaxInUse:                   2255956224
NumAllocs:                    55811090
MaxAllocSize:                614400000
Reserved:                            0
PeakReserved:                        0
LargestFreeBlock:                    0
......

tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.

所以想着用Thread或者Process来将每个循环进行隔离， 使用单进程或者单线程。这样每次循环结束之后创建的进程 / 线程也会终止，tensorflow占用的空间就不会一直增加直到OOM

然后尝试使用Thread或者Process来为每次循环创建单独的进程或线程

if __name__ == '__main__':

    (train_images, train_labels), (test_images, test_labels) = keras.datasets.cifar10.load_data()

    train_images, test_images = train_images / 255.0, test_images / 255.0

    teacher = build_model('teacher', 32, 64, 64, 64)

    teacher = keras.models.load_model('teacher_model')

    for alpha in (0.1, 0.2, 0.3):
        for T in range(5, 21, 5):
            print(time.strftime('%H-%M-%S: '))
            t = Thread(target=main_loop, args=(alpha, T))
            t.start()
            t.join()
            draw_distill()
            draw_scratch()

使用Process
与上面代码内容基本相同，只是将Thread换成Process，然后把需要的参数都传入Process的args中即可。此时可以直接将模型的训练、评估以及用plt把训练结果进行绘图等都放在main_loop函数中进行。

只需编写更改后的代码：<details><summary>*<font color='gray'>[En]</font>*</summary>*<font color='gray'>Simply write the changed code:</font>*</details>
if __name__ == '__main__':

    ......

    for alpha in (0.1, 0.2, 0.3):
        for T in range(5, 21, 5):
            print(time.strftime('%H-%M-%S: '))
            因为创建的子进程与主进程是相互隔离的，所以无法像子线程直接使用主线程的变量数据那样，只能把需要的变量作为参数传入Process中
            p = Process(target=main_loop, args=(alpha, T, teacher, train_images, train_labels, test_images, test_labels))
            p.start()

            p.join()

此外，在本地机器上尝试了用GPU来进行训练，但是很慢，而且输出很多过程的信息。输出的部分信息如下：

 376/1563 [======>.......................] - ETA: 34s - loss: 2.1089 - accuracy: 0.20432022-03-23 11:18:02.870737: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.871020: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
......

2022-03-23 11:18:02.871797: I
tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op Identity in device /job:localhost/replica:0/task:0/device:GPU:0
2022-03-23 11:18:02.925364: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
 383/1563 [======>.......................] - ETA: 34s - loss: 2.1056 - accuracy: 0.20602022-03-23 11:18:02.931965: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
......

2022-03-23 11:18:02.971101: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
 391/1563 [======>.......................] - ETA: 33s - loss: 2.1030 - accuracy: 0.20772022-03-23 11:18:02.977031: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
......

2022-03-23 11:18:03.020212: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op __inference_train_function_914 in device /job:localhost/replica:0/task:0/device:GPU:0
 400/1563 [======>.......................] - ETA: 32s - loss: 2.0986 - accuracy: 0.20922022-03-23 11:18:03.025223: I tensorflow/core/common_runtime/eager/execute.cc:1224] Executing op ReadVariableOp in device /job:localhost/replica:0/task:0/device:GPU:0
 ......

即将写完本文之时，又参考别的blog(参考部分有链接)里面说的方法用GPU训练了一下，在import tensorflow as tf之前，设置：
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'，之后发现果然有效，很多无用的信息都不再输出。
还尝试在服务器上用GPU进行训练（之前没用服务器测试，就是因为上述原因，输出了太多冗余信息），比本地的GPU快了很多：

服务器：GPU total train time: 62.842313051223755 s
本地：GPU total train time: 104.92916321754456 s

https://keras.io/examples/vision/knowledge_distillation
https://www.tensorflow.org/guide/gpu?hl=zh-cn
https://blog.csdn.net/dcrmg/article/details/80029741

Original: https://blog.csdn.net/weixin_43698781/article/details/123681608
Author: alphanoblaker
Title: tensorflow利用for循环进行训练遇到的内存爆炸问题(OOM)

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/497083/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

推荐一款国产免费开源的ERP进销存系统附带安装详细教程

软件简介 ERP可用于自动化和简化整个企业或组织的各项活动，例如会计和采购、项目管理、生产管理、风险管理、合规性和供应链运营。 ERP全称Enterprise Resource P…

人工智能 2023年6月25日
00357
QT+QTimer+QThread实现线程内的定时任务并且和主线程进行交互

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档在制作qt软件时，我们经常会用到qtimer定时器以及将其在非主线程中进行使用，甚至有的时候还需要和主线程进行交互…

人工智能 2023年6月30日
0065
Python Tkinter教程（一）——tkinter编程基本步骤、窗口基本属性及Toplevel控件的使用

>>> 【上节回顾：Python Tkinter 模块简要介绍】<<< Python Tkinter教程（一）这篇博客将详细地介绍如何使用tk…

人工智能 2023年7月29日
0075
JetsonNano学习（一）SDKManager系统烧录

文章目录一、VMware16虚拟机安装二、安装VMware Tools 三、SDKManager系统烧录 NVIDIA在2019年NVIDIA GPU技术大会（GTC）上发布了…

人工智能 2023年7月18日
0074
基于 libdmtx和zxing的DM二维码识别总结

1.1 python实现 python识别DM二维码比较简单，只需要pylibdmtx 库即可，pylibdmtx 库包含了libdmtx的功能，python代码如下。 impor…

人工智能 2023年7月18日
0076
YOLOv5的Tricks | 【Trick6】学习率调整策略（One Cycle Policy、余弦退火等）

如有错误，恳请指出。文章目录 0. Yolov5的学习率调整方案 1. LR Range Test 2. Cyclical LR 3. One Cycle Policy 4. S…

人工智能 2023年7月20日
00101
数据挖掘学习报告一

1.观看学习了学堂在线《数据科学导论》 1.1了解了一些数据科学发展史：中国已将大数据发展确定为国家战略。 1.2认识了一些基本概念：（1）”数据”指的…

人工智能 2023年7月17日
0063
对比excel，轻松学习python数据分析(8：数据运算、9：时间序列）

1.算术运算 2.比较运算比较是在列与列之间进行 3.汇总运算 count 非空值计算 1.某一个区域中非空（单元格）数值的个数 2.直接在整个数据表上调用 count()函数，…

人工智能 2023年7月6日
0080
机器学习之决策树

一、决策树基本介绍 Decision Tree 可以解决分类和回归问题监督学习算法二、决策树工作原理从根开始，按照决策树的分类属性，从上往下，逐层划分。直到叶子节点，便能获得…

人工智能 2023年6月16日
0068
盘点典型错误之TypeError: X() got multiple values for argument ‘Y‘

个性签名：整个建筑最重要的是地基，地基不稳，地动山摇。而学技术更要扎稳基础，关注我，带你稳扎每一板块邻域的基础。博客主页：七归的博客南来的北往的，走过路过千万别错过，错过本篇，&#…

人工智能 2023年7月6日
00103
NLP赛事电商搜索

本次题目围绕电商领域搜索算法，开发者们可以通过基于阿里巴巴集团自研的高性能分布式搜索引擎问天引擎（提供高工程性能的电商智能搜索平台），可以快速迭代搜索算法，无需自主建设检索全链路环…

人工智能 2023年5月28日
0075
分析方法论_用户生命周期的建立

前言工作中针对用户生命周期学习实践总结笔记，不足之处，希望改正很多。 [En] Work for the user life cycle learning practice su…

人工智能 2023年5月27日
0097
深度聚类：将深度表示学习和聚类联合优化

参考文献：简介经典聚类即数据通过各种表示学习技术以矢量化形式表示为特征。随着数据变得越来越复杂和复杂，浅层（传统）聚类方法已经无法处理高维数据类型。结合深度学习优势的一种直接方…

人工智能 2023年5月31日
0074
图解机器学习算法(1) | 机器学习基础知识（机器学习通关指南·完结）

作者：韩信子@ShowMeAI 教程地址：https://www.showmeai.tech/tutorials/34 本文地址：https://www.showmeai.tech…

人工智能 2023年6月16日
0067
【OpenCV】“帧差法”实现移动物体的检测（车辆识别）

目录一、帧差法 1、概念 2、为什么帧差法可以检测运动的物体？二、使用OpenCV配合帧差法实现车辆识别 1、加载视频 2、灰度处理+帧差计算 3、二值化 4、腐蚀 5、膨胀 …

人工智能 2023年7月19日
0051
Pytorch+CUDA安装方法步骤

首先我们要确定本机是否有独立显卡，在右键点击开始按钮—设备管理器-显示适配器中，查看是否有独立显卡。可以看到本机有一个集成显卡和独立显卡NVIDIA GetForce GTX 10…

人工智能 2023年7月20日
0097

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

tensorflow利用for循环进行训练遇到的内存爆炸问题(OOM)

大家都在看