目标检测的Tricks | 【Trick12】分布式训练（Multi-GPU）与DistributedParallel使用相关总结

2023年7月12日下午4:18 • 人工智能 • 阅读 74

如有错误，恳请指出。

用这篇博客记录多卡（也就是mutil-gpu）的使用，目的是加快训练过程，在pytorch中称之为分布式训练。在pytorch中主要使用的是DistributedParallel的相关函数来实现分布式训练。

下面就记录一下我设置gpu等方面的一些笔记，比较多的部分是从其他笔记整理而来，详细见参考资料。

文章目录

1. 设置可见GPU，进行多显卡深度学习训练
2. 一些标志位的设置
3. 多卡训练且平均数据DataParallel
4. 分布式训练DistributedParallel
5. 一些增加gpu利用率的方法
设置可见GPU，进行多显卡深度学习训练

import os

os.environ["CUDA_DEVICE_ORDER"]="PCI_BUS_ID"

os.environ["CUDA_VISIBLE_DEVICES"]="0"

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"

os.environ["CUDA_VISIBLE_DEVICES"] = "1,0"

一些标志位的设置
torch.backends.cudnn.enabled = True
官方解释： A bool that controls whether cuDNN is enabled.
cuDNN使用非确定性算法，并且可以使用cudnn.enabled = False来进行禁用;如果设置为True，那么cuDNN使用的非确定性算法就会自动寻找最适合当前配置的高效算法，来达到优化运行效率的问题。
torch.backends.cudnn.benchmark = True
官方解释： A bool that, if True, causes cuDNN to benchmark multiple convolution algorithms and select the fastest.
对模型里的卷积层进行预先的优化，也就是在每一个卷积层中测试 cuDNN 提供的所有卷积实现算法，然后选择最快的那个。这样在模型启动的时候，只要额外多花一点点预处理时间，就可以较大幅度地减少训练时间。但是如果不停地改变输入形状，运行效率就会很低。这里cudnn.benchmark默认是False。
所以，以上两条可以配合使用，如下所示：

torch.backends.cudnn.enabled = True
torch.backends.cudnn.benchmark = True

torch.backends.cudnn.deterministic = Ture
官方解释： A bool that, if True, causes cuDNN to only use deterministic convolution algorithms.
若 cudnn.benchmark == False， cudnn.deterministic=True，则每次返回的卷积算法将是确定的，即默认算法。如果配合上设置 Torch 的随机种子为固定值的话，应该可以保证每次运行网络的时候相同输入的输出是固定的。

下面会介绍分布式训练的使用方法：

DataParallel：一般使用在单机多卡的情况下
*DistributedParallel：一般使用在多机多卡的情况下（当然在单机多卡下效果也会比DataParallel要好）
多卡训练且平均数据DataParallel

在之前，DataParallel是一个比较常用的选择，使用也比较简单：


if torch.cuda.is_available():
    device = torch.device('cuda')
    net = net.to(device)

    net = torch.nn.DataParallel(net)

else:
    device = torch.device('cpu')
    net = net.to(device)

效果如下，两个显卡的利用空间是相似的：

目标检测的Tricks | 【Trick12】分布式训练（Multi-GPU）与DistributedParallel使用相关总结

去掉了这几行代码，只会利用一个gpu设备，只有当超出0卡的显存空间时，才会用1卡，也就是优先使用0号设备,然后使用1号设备：

但是尽管如此，计算的时候还是主要位于0卡中的：

所以，这时候就需要多gpu真正的并行训练，也就是接下来需要介绍的分布式训练。

分布式训练DistributedParallel

这里主要介绍的是pytoch中DistributedParallel函数的使用。

pytorch-encoding使用介绍：

这是一个开源gpu balance的工具，使用方法如下：

from utils.encoding importDataParallelModel, DataParallelCriterion
model =DataParallelModel(model)
criterion =DataParallelCriterion(criterion)

giuhub链接：https://link.zhihu.com/?target=https%3A//github.com/zhanghang1989/PyTorch-Encoding

DistributedParallel使用介绍：

网上资料的测试代码：
寻找资料，一般使用DistributedParallel的处理流程如下所示：

import torch
import os
import argparse

import torch.distributed as dist
from torch.utils.data import DataLoader
from torch.utils.data.distributed import DistributedSampler
from torch.nn.parallel import DistributedDataParallel

parser = argparse.ArgumentParser()

parser.add_argument('--syncBN', type=bool, default=True)

parser.add_argument('--world-size', default=2, type=int, help='number of distributed processes')
parser.add_argument('--dist-url', default='tcp://172.16.1.186:2222', type=str, help='url used to set up distributed training')
parser.add_argument('--dist-backend', default='gloo', type=str, help='distributed backend')
parser.add_argument('--dist-rank', default=0, type=int, help='rank of distributed processes')
args = parser.parse_args()

torch.distributed.init_process_group(backend="nccl")

                        init_method=args.dist_url,
                        world_size=args.world_size,
                        rank=args.dist_rank)

local_rank = torch.distributed.get_rank()
torch.cuda.set_device(local_rank)
device = torch.device('cuda:%d' % local_rank)

model = YourModel()
model = model.to(device)

model = DistributedDataParallel(model,
                                device_ids=[local_rank],
                                output_device=local_rank)

train_data   = Dataset(root=args.root, resize=args.resize, mode='train')
train_loader = DataLoader(train_data, args.batch_size,
                          pin_memory=True, drop_last=True,
                          sampler=DistributedSampler(train_data))

yolov3spp的multi-gpu代码：
先贴上yolov3spp中关于分布式训练的代码（这里的代码是由b站博主劈里啪啦提供的）：


from train_utils import init_distributed_mode, torch_distributed_zero_first
from train_utils import get_coco_api_from_dataset

parser.add_argument('--world-size', default=2, type=int, help='number of distributed processes')
parser.add_argument('--dist-url', default='env://', help='url used to set up distributed training')

def init_distributed_mode(args):
    if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
        args.rank = int(os.environ["RANK"])
        args.world_size = int(os.environ['WORLD_SIZE'])
        args.gpu = int(os.environ['LOCAL_RANK'])
    elif 'SLURM_PROCID' in os.environ:
        args.rank = int(os.environ['SLURM_PROCID'])
        args.gpu = args.rank % torch.cuda.device_count()
    else:
        print('Not using distributed mode')
        args.distributed = False
        return

    args.distributed = True

    torch.cuda.set_device(args.gpu)
    args.dist_backend = 'nccl'
    print('| distributed init (rank {}): {}'.format(
        args.rank, args.dist_url), flush=True)
    torch.distributed.init_process_group(backend=args.dist_backend,
                                         init_method=args.dist_url,
                                         world_size=args.world_size,
                                         rank=args.rank)
    torch.distributed.barrier()
    setup_for_distributed(args.rank == 0)

def main(opt, hyp):

    init_distributed_mode(opt)
    ...

    device = torch.device(opt.device)
    model = Darknet(cfg).to(device)
    model.load_state_dict(ckpt["model"], strict=False)
    model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
    model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[opt.gpu])
    ...

    with torch_distributed_zero_first(opt.rank):
        train_dataset = LoadImagesAndLabels(train_path, imgsz_train, batch_size,
                                            augment=True,
                                            hyp=hyp,
                                            rect=opt.rect,
                                            cache_images=opt.cache_images,
                                            single_cls=opt.single_cls,
                                            rank=opt.rank)

        val_dataset = LoadImagesAndLabels(test_path, imgsz_test, batch_size,
                                          hyp=hyp,
                                          cache_images=opt.cache_images,
                                          single_cls=opt.single_cls,
                                          rank=opt.rank)

    train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
    val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)

    train_batch_sampler = torch.utils.data.BatchSampler(
        train_sampler, batch_size, drop_last=True)
    ...

    train_data_loader = torch.utils.data.DataLoader(
        train_dataset, batch_sampler=train_batch_sampler, num_workers=nw,
        pin_memory=True, collate_fn=train_dataset.collate_fn)
    val_data_loader = torch.utils.data.DataLoader(
        val_dataset, batch_size=batch_size,
        sampler=val_sampler, num_workers=nw,
        pin_memory=True, collate_fn=val_dataset.collate_fn)
    ...

    with torch_distributed_zero_first(opt.rank):
        if os.path.exists("tmp.pk") is False:
            coco = get_coco_api_from_dataset(val_dataset)
            with open("tmp.pk", "wb") as f:
                pickle.dump(coco, f)
        else:
            with open("tmp.pk", "rb") as f:
                coco = pickle.load(f)

      for epoch in range(start_epoch, epochs):
          train_sampler.set_epoch(epoch)

          mloss, lr = train_util.train_one_epoch(model, optimizer, train_data_loader,...)
          ...

          result_info = train_util.evaluate(model, val_data_loader, coco=coco, device=device)
          ...

DistributedParallel使用模板代码：
这里我对yolov3spp的代码进行简化，整理的使用模板如下所示：


import torch.distributed as dist
from torch.utils.data import DataLoader
from torch.nn.parallel import DistributedDataParallel
from torch.utils.data.distributed import DistributedSampler

parser = argparse.ArgumentParser()
parser.add_argument('--device', default='cuda', help='device id (i.e. 0 or 0,1 or cpu)')
parser.add_argument('--world-size', default=2, type=int, help='number of distributed processes')
parser.add_argument('--dist-url', default='env://', help='url used to set up distributed training')
args = parser.parse_args()

args.gpu = int(os.environ['LOCAL_RANK'])
args.dist_backend = 'nccl'
args.world_size = int(os.environ['WORLD_SIZE'])
args.rank = int(os.environ["RANK"])
args.distributed = True

torch.cuda.set_device(args.gpu)
device = torch.device(args.device)

torch.distributed.init_process_group(backend=args.dist_backend,
                                     init_method=args.dist_url,
                                     world_size=args.world_size,
                                     rank=args.rank)
torch.distributed.barrier()

...

model = Darknet(cfg).to(device)
model.load_state_dict(ckpt["model"], strict=False)
model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model).to(device)
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[opt.gpu])
...

train_dataset = LoadImagesAndLabels(train_path, imgsz_train, batch_size,
                                    augment=True,
                                    hyp=hyp,
                                    rect=opt.rect,
                                    cache_images=opt.cache_images,
                                    single_cls=opt.single_cls,
                                    rank=args.rank)
val_dataset = LoadImagesAndLabels(test_path, imgsz_test, batch_size,
                                  hyp=hyp,
                                  cache_images=opt.cache_images,
                                  single_cls=opt.single_cls,
                                  rank=args.rank)

train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)
val_sampler = torch.utils.data.distributed.DistributedSampler(val_dataset)

train_batch_sampler = torch.utils.data.BatchSampler(
    train_sampler, batch_size, drop_last=True)
...

train_data_loader = torch.utils.data.DataLoader(
    train_dataset, batch_sampler=train_batch_sampler, num_workers=nw,
    pin_memory=True, collate_fn=train_dataset.collate_fn)
val_data_loader = torch.utils.data.DataLoader(
    val_dataset, batch_size=batch_size,
    sampler=val_sampler, num_workers=nw,
    pin_memory=True, collate_fn=val_dataset.collate_fn)
...

for epoch in range(start_epoch, epochs):
  train_sampler.set_epoch(epoch)

  mloss, lr = train_util.train_one_epoch(model, optimizer, train_data_loader,...)
  ...

  result_info = train_util.evaluate(model, val_data_loader, coco=coco, device=device)
  ...

然后在不能直接在pycharm中运行，使用分布式训练需要只使用命令行来跑：


python -m torch.distributed.launch --nproc_per_node=2 --use_env train_multi_GPU.py

python -m torch.distributed.launch --nproc_per_node=2 train_multi_GPU.py

一些增加gpu利用率的方法
1）主函数前面加：（这个会牺牲一点点现存提高模型精度）

cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.enabled = True

2）训练时，epoch前面加：（定期清空模型，效果感觉不明显）

torch.cuda.empty_cache()

3）dataloader的长度_len_设置：（dataloader会间歇式出现卡顿，设置成这样会避免不少）

def __len__(self):
    return self.images.shape[0]

4）dataloader的预加载设置：（会在模型训练的时候加载数据，提高一点点gpu利用率）

train_loader = torch.utils.data.DataLoader(
        train_dataset,
        ...

        pin_memory=True,
    )

ps：后续还有其他的会继续更新…

参考链接：

1）和nn.DataParallel说再见：https://zhuanlan.zhihu.com/p/95700549
2）PyTorch常见的坑汇总：https://cloud.tencent.com/developer/article/1512508
3）pytorch 分布式训练初探：https://zhuanlan.zhihu.com/p/43424629
4）Pytorch中多GPU训练指北：https://www.cnblogs.com/jfdwd/p/11196439.html
5）torch.backends.cudnn.benchmark标志位True or False：https://zhuanlan.zhihu.com/p/333632424

Original: https://blog.csdn.net/weixin_44751294/article/details/124369828
Author: Clichong
Title: 目标检测的Tricks | 【Trick12】分布式训练（Multi-GPU）与DistributedParallel使用相关总结

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/687964/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

NLP预训练模型系列-GPT-2

NLP预训练模型系列文章目录 BERT GPT GPT-2 GPT-3 目录 NLP预训练模型系列文章目录前言 1. Abstract 2. Introduction 3. A…

人工智能 2023年5月31日
0093
目标检测 /yolo算法原理的详解

前言目标检测目标检测目前有两类流行算法:一类是基于Region Proposal的R-cnn系列（比如r-cnn,fast r-cnn,faster r-cnn）,他们是属于t…

人工智能 2023年7月12日
00139
TensorFlow 2.0 深度学习实战 —— 浅谈卷积神经网络 CNN

前言上一章为大家介绍过深度学习的基础和多层感知机 MLP 的应用，本章开始将深入讲解卷积神经网络的实用场景。卷积神经网络 CNN（Convolutional Neural Net…

人工智能 2023年6月4日
00117
virtio split & ctrl virtqueue

1、Split virtqueue 组成结构： • Descriptor Table • Available Ring • Used Ring 数据结构： Descriptor T…

人工智能 2023年6月28日
0063
机器学习——pandas

一、pandas的数据结构 Pandas中一共有三种数据结构，分别为： Series、DataFrame和MultiIndex（老版本中叫Panel ）。其中 Series是一…

人工智能 2023年7月7日
0053
知识蒸馏6: yolov5知识蒸馏训练

; 1. 知识蒸馏训练 1) 模型准备 学生模型 yolov5s和&a…

人工智能 2023年6月16日
0079
【并发编程】线程池及Executor框架

文章目录 * – + * 1.为什么要使用线程池 * 2.线程池创建线程 * 3.ThreadPoolExecutor类 * 4.深入剖析线程池实现原理 * 5.线程池…

人工智能 2023年6月27日
0076
超多热门视频都在用的声音|只要一部手机就能制作的配音神器

在本文开始之前，让我们先来看一段视频↓↓↓。 [En] Before the article begins, let’s take a look at a video …

人工智能 2023年5月27日
00133
Python深度学习11——Keras实现共享层模型（多输入多输出）

参考书目：陈允杰.TensorFlow与Keras——Python深度学习应用实战.北京:中国水利水电出版社,2021 本系列基本不讲数学原理，只从代码角度去让读者们利用最简洁的P…

人工智能 2023年6月16日
0071
在运行yolo5的v5.0版本detect.py时遇到的一些错误

跟着小土堆的视频教学自己遇到的一些问题。出现错误的原因：由于yolov5目前最新版本为v6.1，但我跑的是5.0版本，则运行detect.py时自动从github上下载的训练好的…

人工智能 2023年7月20日
0078
保姆级使用PyTorch训练与评估自己的VGG网络教程

文章目录前言 0. 环境搭建&快速开始 1. 数据集制作 * 1.1 标签文件制作 1.2 数据集划分 1.3 数据集信息文件制作 2. 修改参数文件 3. 训练 4. …

人工智能 2023年7月13日
00102
深度强化学习-Double DQN算法原理与代码

深度强化学习-Double DQN算法原理与代码引言 1 DDQN算法简介 2 DDQN算法原理 3 DDQN算法伪代码 4 仿真验证引言 Double Deep Q Netw…

人工智能 2023年7月28日
0078
微信小程序（三）— 视图与逻辑详解（导航相关，WXS脚本，页面事件，生命周期，自定义编译模式等）

目录一、页面导航 1、声明式导航（1）导航到tabBar页面（2）导航到非tabBar页面（3）后退导航 2、编程式导航（1）导航到tabBar页面（2）导航到非ta…

人工智能 2023年6月29日
0090
TensorFlow-lite添加自定义算子

涉及文件：tensorflow/lite/kernels/modeling/util.sc.h|– PrintMatricesInfo|– PrintMat…

人工智能 2023年5月24日
0093
[自动驾驶-目标检测] C++ PCL 障碍物检测

文章目录 1 文章引言 2 难点分析 3 初期思路 4 初期展示(Kitti数据集) 5 初步方案 * 5.1 栅格化 5.2 地面分割 5.2 点云聚类 6 参考文献 1 文章引…

人工智能 2023年7月9日
0098
蒙特卡罗仿真（1）：入门求生指南（Python实例）

目录 1. 前言 1.1 两个要点 1.2 Simulation pro’s and con’s[2] 2. 随机数生成 3. 几个简单的应用 3.1 抛硬…

人工智能 2023年6月16日
00114

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

目标检测的Tricks | 【Trick12】分布式训练（Multi-GPU）与DistributedParallel使用相关总结

文章目录

大家都在看