自然语言处理（二十五）：Transformer与torchtext构建语言模型

2023年5月30日下午8:30 • 人工智能 • 阅读 79

Transformer介绍

本案例取自PyTorch官网的LANGUAGE MODELING WITH NN.TRANSFORMER AND TORCHTEXT

首先导入一些包

import math
import torch
from torch import nn
from torch.nn import TransformerEncoder, TransformerEncoderLayer

Step 1：Define the model

nn.TransformerEncoder由多层 nn.TransformerEncoderLayer组成，还需要一个正方形的注意掩码，防止未来信息的泄露。将 nn.TransformerEncoder的输出直接送入最终的 Linear层即是本任务的Decoder，最后在经过softmax函数，这里softmax隐藏在 nn.CrossEntropyLoss中

class TransformerModel(nn.Module):
    def __init__(self, ntoken, d_model, nhead, d_hid, nlayers, dropout=0.5):
"""
        :param ntoken: 词表大小
        :param d_model: 词嵌入维度
        :param nhead: 头数
        :param d_hid: 隐藏层维度
        :param nlayers: 编码器层数
        :param dropout: dropout
        :return:
"""
        super(TransformerModel, self).__init__()
        self.model_type = 'Transformer'
        self.pos_encoder = PositionalEncoding(d_model, dropout)
        encoder_layers = TransformerEncoderLayer(d_model, nhead, d_hid, dropout)
        self.transformer_encoder = TransformerEncoder(encoder_layers, nlayers)
        self.encoder = nn.Embedding(ntoken, d_model)
        self.d_model = d_model
        self.decoder = nn.Linear(d_model, ntoken)

        self.init_weights()

    def init_weights(self):
        initrange = 0.1
        self.encoder.weight.data.uniform_(-initrange, initrange)
        self.decoder.bias.data.zero_()
        self.decoder.weight.data.uniform_(-initrange, initrange)

    def forward(self, src, src_mask):
"""
        :param src: Tensor, shape [seq_len, batch_size]
        :param src_mask: Tensor, shape [seq_len, seq_len]
        :return: output Tensor of shape [seq_len, batch_size, ntoken]
"""
        src = self.encoder(src) * math.sqrt(self.d_model)
        src = self.pos_encoder(src)
        output = self.transformer_encoder(src, src_mask)
        output = self.decoder(output)
        return output

def generate_square_subsequent_mask(sz):
    """生成一个上三角矩阵，主对角线及以下为0，主对角线之上为-inf"""
    return torch.triu(torch.ones(sz, sz) * float('-inf'), diagonal=1)

位置编码：


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout=0.1, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)

        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x):
"""
        :param x: Tensor, shape [seq_len, batch_size, embedding_dim]
"""
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

Step 2：Load and batch data

我们将使用 torchtext来生成 Wikitext-2 数据集，vocab对象是基于训练数据集构建的， batchify()函数将数据集排列为列，以修剪掉数据分成大小为 batch_size的批量后剩余的所有标记，如下图所示

from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

train_iter = WikiText2(split='train')
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=[''])
vocab.set_default_index(vocab[''])

def data_process(raw_text_iter):
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))

train_iter, val_iter, test_iter = WikiText2()
train_data = data_process(train_iter)
val_data = data_process(val_iter)
test_data = data_process(test_iter)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

def batchify(data, bsz):
"""
    分割数据，并且移除多余数据
    :param data: Tensor, shape [N] 文本数据 train_data、val_data、test_data
    :param bsz: int, batch_size，每次模型更新参数的数据量
    :return: Tensor of shape [N // bsz, bsz]
"""
    seq_len = data.size(0) // bsz
    data = data[:seq_len * bsz]
    data = data.view(bsz, seq_len).t().contiguous()
    return data.to(device)

batch_size = 20
eval_batch_size = 10
train_data = batchify(train_data, batch_size)
val_data = batchify(val_data, eval_batch_size)
test_data = batchify(test_data, eval_batch_size)

filter()用法

numel()：返回当前tensor中元素的个数

Step 3：Functions to generate input and target sequence

get_batch()函数为转换器模型生成输入和目标序列。它将源数据细分为长度为bptt的块。对于语言建模任务，模型需要以下单词作为Target。例如，如果bptt值为 2，则i = 0时，我们将获得以下两个变量：


bptt = 35

def get_batch(source, i):
"""
    :param source: Tensor, shape [full_seq_len, batch_size]
    :param i: 批次数
    :return: tuple (data, target), where data has shape [seq_len, batch_size] and
             target has shape [seq_len * batch_size]
"""

    seq_len = min(bptt, len(source) - 1 - i)
    data = source[i:i + seq_len]
    target = source[i + 1:i + 1 + seq_len].reshape(-1)
    return data, target

source = test_data
i = 1
data, target = get_batch(source, i)

Step 4：Initiate an instance


ntokens = len(vocab)
emsize = 200
d_hid = 200
nlayers = 2
nhead = 2
dropout = 0.2
model = TransformerModel(ntokens, emsize, nhead, d_hid, nlayers, dropout).to(device)

Step 5：Run the model

import copy
import time

criterion = nn.CrossEntropyLoss()
lr = 5.0
optimizer = torch.optim.SGD(model.parameters(), lr=lr)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1, gamma=0.95)

def train(model):
    model.train()
    total_loss = 0.

    log_interval = 200
    start_time = time.time()
    src_mask = generate_square_subsequent_mask(bptt).to(device)

    num_batches = len(train_data) // bptt
    for batch, i in enumerate(range(0, train_data.size(0) - 1, bptt)):
        data, targets = get_batch(train_data, i)
        batch_size = data.size(0)
        if batch_size != bptt:
            src_mask = src_mask[:batch_size, :batch_size]
        output = model(data, src_mask)
        loss = criterion(output.view(-1, ntokens), targets)

        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.5)
        optimizer.step()

        total_loss += loss.item()
        if batch % log_interval == 0 and batch > 0:
            lr = scheduler.get_last_lr()[0]
            ms_per_batch = (time.time() - start_time) * 1000 / log_interval
            cur_loss = total_loss / log_interval
            ppl = math.exp(cur_loss)
            print(f'| epoch {epoch:3d} | {batch:5d}/{num_batches:5d} batches | '
                  f'lr {lr:02.2f} | ms/batch {ms_per_batch:5.2f} | '
                  f'loss {cur_loss:5.2f} | ppl {ppl:8.2f}')
            total_loss = 0
            start_time = time.time()

def evaluate(model, eval_data):
    model.eval()
    total_loss = 0.

    src_mask = generate_square_subsequent_mask(bptt).to(device)
    with torch.no_grad():
        for i in range(0, eval_data.size(0) - 1, bptt):
            data, targets = get_batch(eval_data, i)
            batch_size = data.size(0)
            if batch_size != bptt:
                src_mask = src_mask[:batch_size, :batch_size]
            output = model(data, src_mask)
            output_flat = output.view(-1, ntokens)
            total_loss += batch_size * criterion(output_flat, targets).item()
    return total_loss / (len(eval_data) - 1)

循环遍历。如果验证损失是迄今为止迄今为止最好的，保存模型。并在每个周期之后调整学习率。

best_val_loss = float('inf')
epochs = 5
best_model = None

for epoch in range(1, epochs + 1):
    epoch_start_time = time.time()
    train(model)
    val_loss = evaluate(model, val_data)
    val_ppl = math.exp(val_loss)
    elapsed = time.time() - epoch_start_time

    print('-' * 89)
    print(f'| end of epoch {epoch:3d} | time: {elapsed:5.2f}s | '
          f'valid loss {val_loss:5.2f} | valid ppl {val_ppl:8.2f}')
    print('-' * 89)

    if val_loss < best_val_loss:
        best_val_loss = val_loss
        best_model = copy.deepcopy(model)

    scheduler.step()

Out：

| epoch   1 |   200/ 2928 batches | lr 5.00 | ms/batch 36.25 | loss  8.22 | ppl  3721.31
| epoch   1 |   400/ 2928 batches | lr 5.00 | ms/batch 18.66 | loss  6.95 | ppl  1048.14
| epoch   1 |   600/ 2928 batches | lr 5.00 | ms/batch 18.73 | loss  6.48 | ppl   648.89
| epoch   1 |   800/ 2928 batches | lr 5.00 | ms/batch 18.64 | loss  6.32 | ppl   554.08
| epoch   1 |  1000/ 2928 batches | lr 5.00 | ms/batch 18.68 | loss  6.20 | ppl   492.78
| epoch   1 |  1200/ 2928 batches | lr 5.00 | ms/batch 18.77 | loss  6.17 | ppl   475.86
| epoch   1 |  1400/ 2928 batches | lr 5.00 | ms/batch 18.80 | loss  6.12 | ppl   453.58
| epoch   1 |  1600/ 2928 batches | lr 5.00 | ms/batch 18.88 | loss  6.11 | ppl   452.17
| epoch   1 |  1800/ 2928 batches | lr 5.00 | ms/batch 18.74 | loss  6.03 | ppl   416.26
| epoch   1 |  2000/ 2928 batches | lr 5.00 | ms/batch 18.69 | loss  6.02 | ppl   413.09
| epoch   1 |  2200/ 2928 batches | lr 5.00 | ms/batch 18.68 | loss  5.90 | ppl   366.05
| epoch   1 |  2400/ 2928 batches | lr 5.00 | ms/batch 18.76 | loss  5.97 | ppl   392.08
| epoch   1 |  2600/ 2928 batches | lr 5.00 | ms/batch 18.69 | loss  5.96 | ppl   388.64
| epoch   1 |  2800/ 2928 batches | lr 5.00 | ms/batch 18.82 | loss  5.89 | ppl   362.42
| epoch   2 |   200/ 2928 batches | lr 4.75 | ms/batch 18.87 | loss  5.88 | ppl   356.48
| epoch   2 |   400/ 2928 batches | lr 4.75 | ms/batch 18.73 | loss  5.86 | ppl   350.79
| epoch   2 |   600/ 2928 batches | lr 4.75 | ms/batch 18.84 | loss  5.67 | ppl   289.74
| epoch   2 |   800/ 2928 batches | lr 4.75 | ms/batch 18.73 | loss  5.70 | ppl   298.65
| epoch   2 |  1000/ 2928 batches | lr 4.75 | ms/batch 18.71 | loss  5.65 | ppl   285.12
| epoch   2 |  1200/ 2928 batches | lr 4.75 | ms/batch 18.76 | loss  5.68 | ppl   291.53
| epoch   2 |  1400/ 2928 batches | lr 4.75 | ms/batch 18.76 | loss  5.69 | ppl   296.89
| epoch   2 |  1600/ 2928 batches | lr 4.75 | ms/batch 18.74 | loss  5.71 | ppl   301.69
| epoch   2 |  1800/ 2928 batches | lr 4.75 | ms/batch 18.74 | loss  5.65 | ppl   283.23
| epoch   2 |  2000/ 2928 batches | lr 4.75 | ms/batch 18.86 | loss  5.66 | ppl   287.13
| epoch   2 |  2200/ 2928 batches | lr 4.75 | ms/batch 18.77 | loss  5.55 | ppl   256.27
| epoch   2 |  2400/ 2928 batches | lr 4.75 | ms/batch 18.72 | loss  5.64 | ppl   280.60
| epoch   2 |  2600/ 2928 batches | lr 4.75 | ms/batch 18.71 | loss  5.64 | ppl   281.32
| epoch   2 |  2800/ 2928 batches | lr 4.75 | ms/batch 18.72 | loss  5.57 | ppl   263.44
| epoch   3 |   200/ 2928 batches | lr 4.51 | ms/batch 18.99 | loss  5.60 | ppl   270.62
| epoch   3 |   400/ 2928 batches | lr 4.51 | ms/batch 18.79 | loss  5.63 | ppl   277.79
| epoch   3 |   600/ 2928 batches | lr 4.51 | ms/batch 18.80 | loss  5.42 | ppl   226.29
| epoch   3 |   800/ 2928 batches | lr 4.51 | ms/batch 18.78 | loss  5.48 | ppl   239.64
| epoch   3 |  1000/ 2928 batches | lr 4.51 | ms/batch 18.75 | loss  5.43 | ppl   228.49
| epoch   3 |  1200/ 2928 batches | lr 4.51 | ms/batch 18.70 | loss  5.48 | ppl   239.29
| epoch   3 |  1400/ 2928 batches | lr 4.51 | ms/batch 18.75 | loss  5.48 | ppl   240.92
| epoch   3 |  1600/ 2928 batches | lr 4.51 | ms/batch 18.76 | loss  5.51 | ppl   246.19
| epoch   3 |  1800/ 2928 batches | lr 4.51 | ms/batch 18.77 | loss  5.47 | ppl   236.52
| epoch   3 |  2000/ 2928 batches | lr 4.51 | ms/batch 18.72 | loss  5.47 | ppl   238.51
| epoch   3 |  2200/ 2928 batches | lr 4.51 | ms/batch 18.78 | loss  5.35 | ppl   210.82
| epoch   3 |  2400/ 2928 batches | lr 4.51 | ms/batch 19.11 | loss  5.46 | ppl   235.03
| epoch   3 |  2600/ 2928 batches | lr 4.51 | ms/batch 18.75 | loss  5.47 | ppl   236.98
| epoch   3 |  2800/ 2928 batches | lr 4.51 | ms/batch 18.73 | loss  5.39 | ppl   220.22
| epoch   4 |   200/ 2928 batches | lr 4.29 | ms/batch 18.89 | loss  5.43 | ppl   229.22
| epoch   4 |   400/ 2928 batches | lr 4.29 | ms/batch 18.75 | loss  5.46 | ppl   235.84
| epoch   4 |   600/ 2928 batches | lr 4.29 | ms/batch 18.77 | loss  5.27 | ppl   194.07
| epoch   4 |   800/ 2928 batches | lr 4.29 | ms/batch 18.75 | loss  5.33 | ppl   205.91
| epoch   4 |  1000/ 2928 batches | lr 4.29 | ms/batch 18.71 | loss  5.28 | ppl   197.32
| epoch   4 |  1200/ 2928 batches | lr 4.29 | ms/batch 18.60 | loss  5.32 | ppl   205.38
| epoch   4 |  1400/ 2928 batches | lr 4.29 | ms/batch 18.51 | loss  5.35 | ppl   210.13
| epoch   4 |  1600/ 2928 batches | lr 4.29 | ms/batch 18.49 | loss  5.38 | ppl   217.76
| epoch   4 |  1800/ 2928 batches | lr 4.29 | ms/batch 18.47 | loss  5.33 | ppl   206.47
| epoch   4 |  2000/ 2928 batches | lr 4.29 | ms/batch 18.52 | loss  5.33 | ppl   207.32
| epoch   4 |  2200/ 2928 batches | lr 4.29 | ms/batch 18.47 | loss  5.20 | ppl   182.16
| epoch   4 |  2400/ 2928 batches | lr 4.29 | ms/batch 18.48 | loss  5.32 | ppl   203.72
| epoch   4 |  2600/ 2928 batches | lr 4.29 | ms/batch 18.48 | loss  5.33 | ppl   205.98
| epoch   4 |  2800/ 2928 batches | lr 4.29 | ms/batch 18.68 | loss  5.26 | ppl   193.05
| epoch   5 |   200/ 2928 batches | lr 4.07 | ms/batch 18.59 | loss  5.30 | ppl   201.26
| epoch   5 |   400/ 2928 batches | lr 4.07 | ms/batch 18.47 | loss  5.33 | ppl   207.39
| epoch   5 |   600/ 2928 batches | lr 4.07 | ms/batch 18.55 | loss  5.14 | ppl   170.04
| epoch   5 |   800/ 2928 batches | lr 4.07 | ms/batch 18.49 | loss  5.20 | ppl   180.89
| epoch   5 |  1000/ 2928 batches | lr 4.07 | ms/batch 18.50 | loss  5.17 | ppl   175.22
| epoch   5 |  1200/ 2928 batches | lr 4.07 | ms/batch 18.49 | loss  5.21 | ppl   183.48
| epoch   5 |  1400/ 2928 batches | lr 4.07 | ms/batch 18.51 | loss  5.23 | ppl   186.35
| epoch   5 |  1600/ 2928 batches | lr 4.07 | ms/batch 18.54 | loss  5.27 | ppl   194.44
| epoch   5 |  1800/ 2928 batches | lr 4.07 | ms/batch 18.50 | loss  5.22 | ppl   184.42
| epoch   5 |  2000/ 2928 batches | lr 4.07 | ms/batch 18.51 | loss  5.23 | ppl   186.21
| epoch   5 |  2200/ 2928 batches | lr 4.07 | ms/batch 18.53 | loss  5.09 | ppl   161.85
| epoch   5 |  2400/ 2928 batches | lr 4.07 | ms/batch 18.55 | loss  5.21 | ppl   182.69
| epoch   5 |  2600/ 2928 batches | lr 4.07 | ms/batch 18.47 | loss  5.23 | ppl   186.25
| epoch   5 |  2800/ 2928 batches | lr 4.07 | ms/batch 18.48 | loss  5.16 | ppl   173.32

Step 6：Evaluate the best model on the test dataset


test_loss = evaluate(best_model, test_data)
test_ppl = math.exp(test_loss)
print('=' * 89)
print(f'| End of training | test loss {test_loss:5.2f} | '
      f'test ppl {test_ppl:8.2f}')
print('=' * 89)

Out：

=========================================================================================
| End of training | test loss  5.46 | test ppl   234.79
=========================================================================================

Original: https://blog.csdn.net/weixin_45707277/article/details/122727044
Author: GeniusAng
Title: 自然语言处理（二十五）：Transformer与torchtext构建语言模型

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/544936/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【XML】学习笔记第三章-namesapce

命名空间概述标记中出现了同名不同义的情况，极其容易造成含义混乱。命名空间就是由W3C制定的用于解决这类问题的。【命名空间的作用】出现标记同名不同义情况时，避免含义混乱 XML技…

人工智能 2023年6月4日
0062
知识图谱的抽取与构建

知识图谱的抽取与构建知识抽取——实体识别与分类 * HMM EM 维特比 CRF BiLSTM+CRF 知识抽取——关系抽取与属性补全 * 监督学习递归神经网络 CNN BiL…

人工智能 2023年6月1日
0090
OpenCV 透视变换 & 图像拼接

A：OpenCV 透视变换一：OpenCV透视变换的概念仿射变换(affine transform)与透视变换(perspective transform)在图像还原、图像局部…

人工智能 2023年7月19日
0046
普通话测试第四题评分标准_普通话等级考试《评分细则》.docx

湖北省普通话水平测试评分细则根据教育部、国家语言文字工作委员会发布的《普通话水平测试纲要》。 [En] According to the outline of the P…

人工智能 2023年5月27日
00107
LIBSVM使用：二维数据SVM支持向量、决策边界与分类间隔可视化（动图）

| 图源 LIBSVM是台湾大学林智仁教授等开发的一个简单易用、快速有效的SVM、SVR的开源软件包，有各种语言接口，包括python、matlab和java等。现在有很多的机器…

人工智能 2023年7月1日
0085
基于梵·高《向日葵》的图像阈值处理专题（二值处理、反二值处理、截断处理、自适应处理及Otsu方法）【Python-Open_CV系列（六）】

基于梵·高《向日葵》的图像阈值处理专题（二值处理、反二值处理、截断处理、自适应处理及Otsu方法）【Python-Open_CV系列（六）】文章目录 🍹1. 什么是阈值处理？ 🍹…

人工智能 2023年6月23日
0073
机器学习开篇之机器学习的分类

目录 1 引言 2 机器学习分类 2.1 监督学习（Supervised Learning） 2.1.1 传统监督学习 2.1.2 非监督学习 2.1.3 半监督学习 2.1.4 …

人工智能 2023年6月23日
0074
构建民航业知识图谱并实现语义查询

CivilAviation Q&A 根据民航业年度公报（年报）构建民航业知识图谱并实现语义查询。项目地址：https://github.com/ShawnHXH/QA-Ci…

人工智能 2023年6月1日
0082
[附源码]java毕业设计大学生家教服务推荐系统

项目运行环境配置： Jdk1.8 + Tomcat7.0 + Mysql + HBuilderX（Webstorm也行）+ Eclispe（IntelliJ IDEA,Eclis…

人工智能 2023年6月28日
0071
机器学习高斯混合模型

高斯混合模型前言高斯混合模型 * 高斯分布混合模型高斯模型 – 单高斯模型高斯混合模型高斯混合模型训练 + EM算法应用 * 图像背景的高斯混合模型智能…

人工智能 2023年5月31日
0064
C语言数字图像处理进阶—12光照特效滤镜

光照特效滤镜光照特效滤镜是一种模拟光源照射物体表面的特效滤镜，如下图所示：原图光照滤镜 [算法] 图像光照滤镜效果就是在图像中添加上一个太阳光源，以此模仿光照条件。这个效果…

人工智能 2023年6月21日
0061
代码实现stable-diffusion模型，你也用AI生成获得一等奖的艺术图

Midjourney工具获奖图片好吗，人工智能虽然已经涉及到人类的方方面面，但没有想到，AI 还能抢艺术家的饭碗，这不，一位小哥使用AI工具生成的艺术照片竟然获奖了，而且还是一等…

人工智能 2023年6月24日
0065
切比雪夫（Chebyshev）不等式

标准化设随机变量x具有数学期望E ( x ) = μ E(x) = \mu E (x )=μ，方差D ( x ) = σ 2 D(x) = \sigma^{2}D (x )=σ2…

人工智能 2023年6月15日
00105
vulnhub靶机raven2

靶机下载地址：https://www.vulnhub.com/entry/raven-2,269/Kail ip：192.168.70.129靶机ip ：192.168.70.16…

人工智能 2023年6月26日
0078
网络攻击防范

目录扫描窥探攻击畸形报文攻击特殊报文攻击 FW的URPF技术 TCP/UDP流量攻击 TCP类攻击 TCP防范–源认证 TCP防范–会话检查 UDP类…

人工智能 2023年6月29日
0068
搞懂Seata分布式事务AT、TCC、SAGA、XA模式选型

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年6月26日
0093

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

自然语言处理（二十五）：Transformer与torchtext构建语言模型

大家都在看