Wenet多机多卡分布式训练

2023年10月28日下午4:22 • Python • 阅读 33

Wenet多机多卡分布式训练
PyTorch分布式训练Demo
Wenet分布式训练实践

Wenet多机多卡分布式训练

PyTorch分布式训练Demo

Wenet框架基于PyTorch实现，因此wenet多机多卡训练依赖于PyTorch分布式训练的实现。

下面代码展示了如何基于PyTorch进行分布式训练：

def ddp_demo(rank, world_size, accum_grad=4):
    assert dist.is_gloo_available(), "Gloo is not available!"
    print(f"world_size: {world_size}, rank: {rank}, is_gloo_available: {dist.is_gloo_available()}")

    # 1. 初始化进程组
    dist.init_process_group("gloo", world_size=world_size, rank=rank)
    model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(), nn.Linear(100, 20))

    # 2. 模型转化成ddp模型
    ddp_model = DistributedDataParallel(model)

    criterion = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=1e-3)

    dataset = TensorDataset(torch.randn(1000, 10))
    # 3. 数据分布式并行（内部会根据rank采样）
    sampler = DistributedSampler(dataset=dataset, num_replicas=world_size, shuffle=True)
    dataloader = DataLoader(dataset=dataset, batch_size=24, sampler=sampler, collate_fn=transform)

    for epoch in range(1):
        for step, batch in enumerate(dataloader):
            output = ddp_model(batch)
            label = torch.rand_like(output)

            if step % accum_grad == 0:
                # 同步参数
                context = contextlib.nullcontext
            else:
                # 4. 梯度累计，不同步参数
                context = ddp_model.no_sync

            with context():
                time.sleep(random.random())
                loss = criterion(output, label)
                loss.backward()

            if step % accum_grad == 0:
                optimizer.step()
                optimizer.zero_grad()
                print(f"epoch: {epoch}, step: {step}, rank: {rank} update parameters.")

    # 5. 销毁进程组上下文数据（一些全局变量）
    dist.destroy_process_group()

本地环境没有Nvidia显卡，用 gloo后端替代 nccl。

源代码参考：https://gist.github.com/hotbaby/15950bbb43d052cd835b0f18c997f67c

模型转换成分布式训练的步骤：

初始化进程组 dist.init_process_group；
分布式数据并行封装模型 DistributedDataParallel(model)；
数据分布式并行，将数据分成 world_size 份，根据 rank采样 DistributedSampler(dataset=dataset, num_replicas=world_size, shuffle=True)；
训练过程中梯度累计，降低训练进程间的参数同步频率，提升通信效率【可选】；
销毁进程组 dist.destroy_process_group()。

Wenet分布式训练实践

Wenet如何配置多机多卡分布式训练?

GPU机器列表：

节点名称 IP地址 GPU数量 node1 10.10.23.9 8 node2 10.10.23.10 8

以aishell数据集为例，说明Wenet框架中文ASR模型在GPU机器上的训练过程：

环境初始化和数据准备环境初始化参考Wenet官方文档：https://github.com/wenet-e2e/wenet#installationtraining-and-developing 将aishell数据集解压后，分别拷贝node1和node2两台机器的 /data/aishell/目录。
配置训练脚本配置 node1训练脚本配置： wenet/examples/aishell/s0/run.sh

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
data=/data/aishell/
num_nodes=2
node_rank=0
init_method="tcp://${node1_ip}:23456"
dist_backend="nccl"

node2训练脚本配置：

wenet/examples/aishell/s0/run.sh

export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"
num_nodes=2
node_rank=1
init_method="tcp://${node1_ip}:23456"
dist_backend="nccl"

运行训练脚本分别在node1和node2上后台运行 run.sh训练脚本。

export NCCL_SOCKET_IFNAME=ens1f0
nohup bash run.sh > train.log 2>&1 &

ens1f0为网卡接口名字，如果没有配置，可能会导致多机网络通信问题。

Wenet分布式训练实验结果

GPU配置每个Epoch的训练时间（秒）速度提升单机多卡（4） 407.17 单机多卡（8） 204.36 相比单机多卡（4）提升99.24% 多机多卡（8） 221.75 相比单机多卡（8）慢了7.84% 多机多卡（16） 121.7 相比单机多卡（8）提升了67.92%

Wenet分布式训练如何实现？

与上述DDP Demo类似，Wenet调用PyTorch相关接口实现分布式训练。

初始化进程组

wenet/bin/train.py

def main():
    ...

    if distributed:
        logging.info('training on multiple gpus, this gpu {}'.format(args.gpu))
        dist.init_process_group(args.dist_backend,
                                init_method=args.init_method,
                                world_size=args.world_size,
                                rank=args.rank)
    ...

Wenet源代码链接：https://github.com/wenet-e2e/wenet/blob/main/wenet/bin/train.py#L141

分布式数据并行封装模型

def main():
    ...
    if distributed:
        assert (torch.cuda.is_available())
        # cuda model is required for nn.parallel.DistributedDataParallel
        model.cuda()
        model = torch.nn.parallel.DistributedDataParallel(
            model, find_unused_parameters=True)
    ...

Wenet源代码链接：https://github.com/wenet-e2e/wenet/blob/main/wenet/bin/train.py#L232

数据分布式并行

wenet/dataset/dataset.py

class DistributedSampler:
    ...

    def sample(self, data):
        """ Sample data according to rank/world_size/num_workers
            Args:
                data(List): input data list
            Returns:
                List: data list after sample
"""
        data = list(range(len(data)))
        # TODO(Binbin Zhang): fix this
        # We can not handle uneven data for CV on DDP, so we don't
        # sample data by rank, that means every GPU gets the same
        # and all the CV data
        if self.partition:
            if self.shuffle:
                random.Random(self.epoch).shuffle(data)
            data = data[self.rank::self.world_size]
        # num_workers参数与world_size相等，按world_size进行切片。
        data = data[self.worker_id::self.num_workers]
        return data
    ...

Wenet源代码链接：https://github.com/wenet-e2e/wenet/blob/main/wenet/dataset/dataset.py#L79

梯度累积，降低训练进程参数同步频率

wenet/utils/executor.py

class Executor:
    def train(...):
        with model_context():
            for batch_idx, batch in enumerate(data_loader):
                if is_distributed and batch_idx % accum_grad != 0:
                    # 梯度累计，不同步参数
                    context = model.no_sync
                # Used for single gpu training and DDP gradient synchronization
                # processes.

                else:
                    # 同步参数
                    context = nullcontext
                with context():
                    # autocast context
                    # The more details about amp can be found in
                    # https://pytorch.org/docs/stable/notes/amp_examples.html
                    with torch.cuda.amp.autocast(scaler is not None):
                        loss_dict = model(feats, feats_lengths, target,
                                          target_lengths)
                        loss = loss_dict['loss'] / accum_grad
                    if use_amp:
                        scaler.scale(loss).backward()
                    else:
                        loss.backward()

Wenet源代码链接：https://github.com/wenet-e2e/wenet/blob/main/wenet/utils/executor.py#L67

销毁进程组，Wenet源码中没有调用PyTorch的 destroy_process_group()方法，因为训练进程退出后， process_group相关全局变量和上下文会自然销毁，所以不会影响训练过程。

Wenet分布式训练对一些超参的影响？

多机多卡（16卡）相关对于单机多卡（4卡）开发集loss收敛速度变慢？

调整 wenet/examples/aishell/s0/conf/train_conformer.yaml的 warmup_steps参数可以解决此问题。

optim_conf:
    lr: 0.002
scheduler: warmuplr     # pytorch v1.1.0+ required
scheduler_conf:
    warmup_steps: 1562

如何调整梯度累计的间隔？

调整 wenet/examples/aishell/s0/conf/train_conformer.yaml的 accum_grad参数。

Original: https://www.cnblogs.com/bytehandler/p/17038186.html
Author: ByteHandler
Title: Wenet多机多卡分布式训练

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/807000/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

用Python做数据分析之数据处理及数据提取

1、数据预处理第四部分是数据的预处理，对清洗完的数据进行整理以便后期的统计和分析工作。主要包括数据表的合并，排序，数值分列，数据分组及标记等工作。 1）数据表合并首先是对不同的…

Python 2023年8月20日
0043
11.Scrapy框架基础-使用Scrapy抓取数据并保存到mongodb

目录一、Scrapy安装 1.mac系统 2.windows系统二、使用scrapy爬取数据 1.新建一个scrapy工程 2.在spiders下新建一个爬虫文件 3.提取网页…

Python 2023年10月1日
0056
flask获取post请求参数

flask获取post请求参数文章目录 flask获取post请求参数 * 概述 1. application/json – http 请求报文格式如下：使用cur…

Python 2023年8月13日
0060
pyhton_Pandas教程

Pandas 是 Python 语言的一个扩展程序库，用于数据分析。 Pandas 是一个开放源码、BSD 许可的库，提供高性能、易于使用的数据结构和数据分析工具。 Pandas …

Python 2023年11月3日
0032
机器学习：监督学习

监督学习参考吴恩达2022MachineLearning视频教程视频教程：(强推|双字)2022吴恩达机器学习Deeplearning.ai课程机器学习分类监督学习和无监督…

Python 2023年10月28日
0038
MySQL建表语句生成Golang代码

1. 背景对于后台开发新的需求时，一般会先进行各种表的设计，写各个表的建表语句然后根据建立的表，写对应的model代码、基础的增删改查代码（基础的增删改查服务可以划入DAO(D…

Python 2023年10月22日
0026
[ 数据集 ] MINIST 数据集介绍

🤵 Author ：Horizon Max ✨ 编程技巧篇：各种操作小结 🎇 机器视觉篇：会变魔术 OpenCV 💥 深度学习篇：简单入门 PyTorch 🏆 神经网络篇：经典网络…

Python 2023年10月8日
0070
scrapy 爬取壁纸

1、创建项目 scrapy startproject bizhi 2、创建爬虫 scrapy genspider bizhispider www.netbian.com要爬取的壁纸…

Python 2023年10月5日
0034
python spyder编辑代码卡顿解决方案

为什么有了这篇呢？前几天在快乐肝代码的时候，Spyder突然变得异常卡顿 … … Original: https://blog.51cto.com/cod…

Python 2023年5月24日
0081
Python x OpenCV+Numpy 函数参考列表

（1）图像的读取操作cv2.imread(文件名，标记)功能：给定文件名和读入方式，读入一幅图像返回值：numpy数组，类型为ndarray的2维或3维数组文件名：图像全名，包括后…

Python 2023年8月27日
0047
scrapy爬虫之爬取百度手机助手app信息并保存至mongodb数据库（附源码）

声明：本文内容仅供学习python爬虫的同学用作学习参考！！！如有错误，请评论指出，非常感谢！！！ 1.使用环境 python 3.8scrapy 2.5mongodb…

Python 2023年10月3日
0070
numpy求矩阵的逆和伪逆

我们可以使用np.linalg中的inv和pinv函数来求解矩阵的逆/伪逆。 np.linalg.inv 对于可逆方阵M，我们使用下面这行代码求逆： np.linalg.inv(J…

Python 2023年8月25日
0055
Flask框架-SSTI模板注入漏洞

文章目录 * – 使用Flask搭建网站 – 模板渲染 – + * flask有2个渲染方法 * 测试代码1 * 测试代码2 * 存在漏洞的代码…

Python 2023年8月13日
0047
raise TypeError(‘to_bytes must receive a str or bytes ‘TypeError:

在scrapy 设置代理学习过程中日志出现报错 raise TypeError(‘to_bytes must receive a str or bytes &#8216…

Python 2023年10月3日
0052
AI人工智能学习笔记–配置opencv，skimage，matplotlib，PIL，numpy等视觉识别的库

一前提条件 1.安装好Anaconda，Pycharm。2.用Anaconda配置好tensorFlow或者pytorch环境 conda info -e &#x67E5…

Python 2023年8月26日
0033
python数据间隔_如何用python绘制时间间隔数据？

这是一个有点不同的方法。对于标签使用ax.注释. 我把标签上的日期，不知道你是要这些标签或发动机号。我肯定你从这里得到了：import pandas as pd df = pd.D…

Python 2023年8月8日
0049

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Wenet多机多卡分布式训练

PyTorch分布式训练Demo

Wenet分布式训练实践

Wenet如何配置多机多卡分布式训练?

Wenet分布式训练实验结果

Wenet分布式训练如何实现？

Wenet分布式训练对一些超参的影响？

大家都在看