深度学习 Transformer架构解析

2023年9月27日下午11:29 • Python • 阅读 52

文章目录

一、Transformer背景介绍
*
1.1 Transformer的诞生
1.2 Transformer的优势
1.3 Transformer的市场
二、Transformer架构解析
*
2.1 认识Transformer架构
–
- 2.1.1 Transformer模型的作用
- 2.1.2 Transformer总体架构图
2.2 输入部分实现
–
- 2.2.1 文本嵌入层的作用
- 2.2.2 位置编码器的作用
2.3 编码器部分实现
–
2.4 解码器部分实现
–
- 2.4.1 解码器层
- 2.4.2 解码器
2.5 输出部分实现
2.6 模型构建
三、使用Transformer构建语言模型

一、Transformer背景介绍

1.1 Transformer的诞生

2018年10月，Google发出一篇论文《BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding》, BERT模型横空出世, 并横扫NLP领域11项任务的最佳成绩!

论文地址: https://arxiv.org/pdf/1810.04805.pdf

而在BERT中发挥重要作用的结构就是Transformer, 之后又相继出现XLNET，roBERT等模型击败了BERT，但是他们的核心没有变，仍然是：Transformer.

1.2 Transformer的优势

相比之前占领市场的LSTM和GRU模型，Transformer有两个显著的优势:

Transformer能够利用分布式GPU进行并行训练，提升模型训练效率.
在分析预测更长的文本时, 捕捉间隔较长的语义关联效果更好.

下面是一张在测评比较图:

; 1.3 Transformer的市场

在著名的SOTA机器翻译榜单上, 几乎所有排名靠前的模型都使用Transformer,

其基本上可以看作是工业界的风向标, 市场空间自然不必多说！

二、Transformer架构解析

2.1 认识Transformer架构

2.1.1 Transformer模型的作用

基于seq2seq架构的transformer模型可以完成NLP领域研究的典型任务, 如机器翻译, 文本生成等. 同时又可以构建预训练语言模型，用于不同任务的迁移学习.

声明:
在接下来的架构分析中, 我们将假设使用Transformer模型架构处理从一种语言文本到另一种语言文本的翻译工作, 因此很多命名方式遵循NLP中的规则. 比如: Embeddding层将称作文本嵌入层, Embedding层产生的张量称为词嵌入张量, 它的最后一维将称作词向量等.

2.1.2 Transformer总体架构图

Transformer总体架构可分为四个部分:

输入部分
输出部分
编码器部分
解码器部分

输入部分包含:

源文本嵌入层及其位置编码器
目标文本嵌入层及其位置编码器

输出部分包含:

线性层
softmax层

编码器部分:

由N个编码器层堆叠而成
每个编码器层由两个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接

解码器部分:

由N个解码器层堆叠而成
每个解码器层由三个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个多头注意力子层和规范化层以及一个残差连接
第三个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接

; 2.2 输入部分实现

输入部分包含:

源文本嵌入层及其位置编码器
目标文本嵌入层及其位置编码器

2.2.1 文本嵌入层的作用

无论是源文本嵌入还是目标文本嵌入，都是为了将文本中词汇的数字表示转变为向量表示, 希望在这样的高维空间捕捉词汇间的关系.

文本嵌入层的代码分析:


import torch

import torch.nn as nn

import math

from torch.autograd import Variable

class Embeddings(nn.Module):
    def __init__(self, d_model, vocab):
        """类的初始化函数, 有两个参数, d_model: 指词嵌入的维度, vocab: 指词表的大小."""

        super(Embeddings, self).__init__()

        self.lut = nn.Embedding(vocab, d_model)

        self.d_model = d_model

    def forward(self, x):
        """可以将其理解为该层的前向传播逻辑，所有层中都会有此函数
           当传给该类的实例化对象参数时, 自动调用该类函数
           参数x: 因为Embedding层是首层, 所以代表输入给模型的文本通过词汇映射后的张量"""

        return self.lut(x) * math.sqrt(self.d_model)

nn.Embedding演示:

>>> embedding = nn.Embedding(10, 3)
>>> input = torch.LongTensor([[1,2,4,5],[4,3,2,9]])
>>> embedding(input)
tensor([[[-0.0251, -1.6902,  0.7172],
         [-0.6431,  0.0748,  0.6969],
         [ 1.4970,  1.3448, -0.9685],
         [-0.3677, -2.7265, -0.1685]],

        [[ 1.4970,  1.3448, -0.9685],
         [ 0.4362, -0.4004,  0.9400],
         [-0.6431,  0.0748,  0.6969],
         [ 0.9124, -2.3616,  1.1151]]])

>>> embedding = nn.Embedding(10, 3, padding_idx=0)
>>> input = torch.LongTensor([[0,2,0,5]])
>>> embedding(input)
tensor([[[ 0.0000,  0.0000,  0.0000],
         [ 0.1535, -2.0309,  0.9315],
         [ 0.0000,  0.0000,  0.0000],
         [-0.1655,  0.9897,  0.0635]]])

实例化参数:


d_model = 512

vocab = 1000

输入参数:


x = Variable(torch.LongTensor([[100,2,421,508],[491,998,1,221]]))

调用:

emb = Embeddings(d_model, vocab)
embr = emb(x)
print("embr:", embr)

输出效果:

embr: Variable containing:
( 0 ,.,.) =
  35.9321   3.2582 -17.7301  ...    3.4109  13.8832  39.0272
   8.5410  -3.5790 -12.0460  ...   40.1880  36.6009  34.7141
 -17.0650  -1.8705 -20.1807  ...  -12.5556 -34.0739  35.6536
  20.6105   4.4314  14.9912  ...   -0.1342  -9.9270  28.6771

( 1 ,.,.) =
  27.7016  16.7183  46.6900  ...   17.9840  17.2525  -3.9709
   3.0645  -5.5105  10.8802  ...  -13.0069  30.8834 -38.3209
  33.1378 -32.1435  -3.9369  ...   15.6094 -29.7063  40.1361
 -31.5056   3.3648   1.4726  ...    2.8047  -9.6514 -23.4909
[torch.FloatTensor of size 2x4x512]

2.2.2 位置编码器的作用

因为在Transformer的编码器结构中, 并没有针对词汇位置信息的处理，因此需要在Embedding层后加入位置编码器，将词汇位置不同可能会产生不同语义的信息加入到词嵌入张量中, 以弥补位置信息的缺失.

位置编码器的代码分析:


class PositionalEncoding(nn.Module):
    def __init__(self, d_model, dropout, max_len=5000):
        """位置编码器类的初始化函数, 共有三个参数, 分别是d_model: 词嵌入维度,
           dropout: 置0比率, max_len: 每个句子的最大长度"""
        super(PositionalEncoding, self).__init__()

        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_len, d_model)

        position = torch.arange(0, max_len).unsqueeze(1)

        div_term = torch.exp(torch.arange(0, d_model, 2) *
                             -(math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)

        self.register_buffer('pe', pe)

    def forward(self, x):
        """forward函数的参数是x, 表示文本序列的词嵌入表示"""

        x = x + Variable(self.pe[:, :x.size(1)],
                         requires_grad=False)

        return self.dropout(x)

nn.Dropout演示:

>>> m = nn.Dropout(p=0.2)
>>> input = torch.randn(4, 5)
>>> output = m(input)
>>> output
Variable containing:
 0.0000 -0.5856 -1.4094  0.0000 -1.0290
 2.0591 -1.3400 -1.7247 -0.9885  0.1286
 0.5099  1.3715  0.0000  2.2079 -0.5497
-0.0000 -0.7839 -1.2434 -0.1222  1.2815
[torch.FloatTensor of size 4x5]

torch.unsqueeze演示:

>>> x = torch.tensor([1, 2, 3, 4])
>>> torch.unsqueeze(x, 0)
tensor([[ 1,  2,  3,  4]])
>>> torch.unsqueeze(x, 1)
tensor([[ 1],
        [ 2],
        [ 3],
        [ 4]])

实例化参数:


d_model = 512

dropout = 0.1

max_len=60

输入参数:


x = embr
Variable containing:
( 0 ,.,.) =
  35.9321   3.2582 -17.7301  ...    3.4109  13.8832  39.0272
   8.5410  -3.5790 -12.0460  ...   40.1880  36.6009  34.7141
 -17.0650  -1.8705 -20.1807  ...  -12.5556 -34.0739  35.6536
  20.6105   4.4314  14.9912  ...   -0.1342  -9.9270  28.6771

( 1 ,.,.) =
  27.7016  16.7183  46.6900  ...   17.9840  17.2525  -3.9709
   3.0645  -5.5105  10.8802  ...  -13.0069  30.8834 -38.3209
  33.1378 -32.1435  -3.9369  ...   15.6094 -29.7063  40.1361
 -31.5056   3.3648   1.4726  ...    2.8047  -9.6514 -23.4909
[torch.FloatTensor of size 2x4x512]

调用:

pe = PositionalEncoding(d_model, dropout, max_len)
pe_result = pe(x)
print("pe_result:", pe_result)

输出效果:

pe_result: Variable containing:
( 0 ,.,.) =
 -19.7050   0.0000   0.0000  ...  -11.7557  -0.0000  23.4553
  -1.4668 -62.2510  -2.4012  ...   66.5860 -24.4578 -37.7469
   9.8642 -41.6497 -11.4968  ...  -21.1293 -42.0945  50.7943
   0.0000  34.1785 -33.0712  ...   48.5520   3.2540  54.1348

( 1 ,.,.) =
   7.7598 -21.0359  15.0595  ...  -35.6061  -0.0000   4.1772
 -38.7230   8.6578  34.2935  ...  -43.3556  26.6052   4.3084
  24.6962  37.3626 -26.9271  ...   49.8989   0.0000  44.9158
 -28.8435 -48.5963  -0.9892  ...  -52.5447  -4.1475  -3.0450
[torch.FloatTensor of size 2x4x512]

绘制词汇向量中特征的分布曲线:

import matplotlib.pyplot as plt

plt.figure(figsize=(15, 5))

pe = PositionalEncoding(20, 0)

y = pe(Variable(torch.zeros(1, 100, 20)))

plt.plot(np.arange(100), y[0, :, 4:8].data.numpy())

plt.legend(["dim %d"%p for p in [4,5,6,7]])

输出效果:

效果分析:

每条颜色的曲线代表某一个词汇中的特征在不同位置的含义.
保证同一词汇随着所在位置不同它对应位置嵌入向量会发生变化.
正弦波和余弦波的值域范围都是1到-1这又很好的控制了嵌入数值的大小, 有助于梯度的快速计算.

2.3 编码器部分实现

编码器部分:

由N个编码器层堆叠而成
每个编码器层由两个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接

; 2.3.1 掩码张量

什么是掩码张量:
掩代表遮掩，码就是我们张量中的数值，它的尺寸不定，里面一般只有1和0的元素，代表位置被遮掩或者不被遮掩，至于是0位置被遮掩还是1位置被遮掩可以自定义，因此它的作用就是让另外一个张量中的一些数值被遮掩，也可以说被替换, 它的表现形式是一个张量.
掩码张量的作用:
在transformer中, 掩码张量的主要作用在应用attention(将在下一小节讲解)时，有一些生成的attention张量中的值计算有可能已知了未来信息而得到的，未来信息被看到是因为训练时会把整个输出结果都一次性进行Embedding，但是理论上解码器的的输出却不是一次就能产生最终结果的，而是一次次通过上一次结果综合得出的，因此，未来的信息可能被提前利用. 所以，我们会进行遮掩. 关于解码器的有关知识将在后面的章节中讲解.

生成掩码张量的代码分析:

def subsequent_mask(size):
    """生成向后遮掩的掩码张量, 参数size是掩码张量最后两个维度的大小, 它的最后两维形成一个方阵"""

    attn_shape = (1, size, size)

    subsequent_mask = np.triu(np.ones(attn_shape), k=1).astype('uint8')

    return torch.from_numpy(1 - subsequent_mask)

np.triu演示:

>>> np.triu([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], k=-1)
array([[ 1,  2,  3],
       [ 4,  5,  6],
       [ 0,  8,  9],
       [ 0,  0, 12]])

>>> np.triu([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], k=0)
array([[ 1,  2,  3],
       [ 0,  5,  6],
       [ 0,  0,  9],
       [ 0,  0, 0]])

>>> np.triu([[1,2,3],[4,5,6],[7,8,9],[10,11,12]], k=1)
array([[ 0,  2,  3],
       [ 0,  0,  6],
       [ 0,  0,  0],
       [ 0,  0, 0]])

输入实例:


size = 5

调用:

sm = subsequent_mask(size)
print("sm:", sm)

输出效果:


sm: (0 ,.,.) =
  1  0  0  0  0
  1  1  0  0  0
  1  1  1  0  0
  1  1  1  1  0
  1  1  1  1  1
[torch.ByteTensor of size 1x5x5]

掩码张量的可视化:

plt.figure(figsize=(5,5))
plt.imshow(subsequent_mask(20)[0])

输出效果:

效果分析:

通过观察可视化方阵, 黄色是1的部分, 这里代表被遮掩, 紫色代表没有被遮掩的信息, 横坐标代表目标词汇的位置, 纵坐标代表可查看的位置;
我们看到, 在0的位置我们一看望过去都是黄色的, 都被遮住了，1的位置一眼望过去还是黄色, 说明第一次词还没有产生, 从第二个位置看过去, 就能看到位置1的词, 其他位置看不到, 以此类推.

2.3.2 注意力机制

什么是注意力:
我们观察事物时，之所以能够快速判断一种事物(当然允许判断是错误的), 是因为我们大脑能够很快把注意力放在事物最具有辨识度的部分从而作出判断，而并非是从头到尾的观察一遍事物后，才能有判断结果. 正是基于这样的理论，就产生了注意力机制.
什么是注意力计算规则:
它需要三个指定的输入Q(query), K(key), V(value), 然后通过公式得到注意力的计算结果, 这个结果代表query在key和value作用下的表示. 而这个具体的计算规则有很多种, 我这里只介绍我们用到的这一种.

我们这里使用的注意力的计算规则:

Q, K, V的比喻解释:

假如我们有一个问题: 给出一段文本，使用一些关键词对它进行描述!

为了方便统一正确答案，这道题可能预先已经给大家写出了一些关键词作为提示.其中这些给出的提示就可以看作是key，
而整个的文本信息就相当于是query，value的含义则更抽象，可以比作是你看到这段文本信息后，脑子里浮现的答案信息，
这里我们又假设大家最开始都不是很聪明，第一次看到这段文本后脑子里基本上浮现的信息就只有提示这些信息，
因此key与value基本是相同的，但是随着我们对这个问题的深入理解，通过我们的思考脑子里想起来的东西原来越多，
并且能够开始对我们query也就是这段文本，提取关键信息进行表示. 这就是注意力作用的过程，通过这个过程，
我们最终脑子里的value发生了变化，
根据提示key生成了query的关键词表示方法，也就是另外一种特征表示方法.

刚刚我们说到key和value一般情况下默认是相同，与query是不同的，这种是我们一般的注意力输入形式，
但有一种特殊情况，就是我们query与key和value相同，这种情况我们称为自注意力机制，就如同我们的刚刚的例子，
使用一般注意力机制，是使用不同于给定文本的关键词表示它. 而自注意力机制,
需要用给定文本自身来表达自己，也就是说你需要从给定文本中抽取关键词来表述它, 相当于对文本自身的一次特征提取.

什么是注意力机制:

注意力机制是注意力计算规则能够应用的深度学习网络的载体, 除了注意力计算规则外, 还包括一些必要的全连接层以及相关张量处理, 使其与应用网络融为一体. 使用自注意力计算规则的注意力机制称为自注意力机制.

注意力机制在网络中实现的图形表示:

注意力计算规则的代码分析:

def attention(query, key, value, mask=None, dropout=None):
    """注意力机制的实现, 输入分别是query, key, value, mask: 掩码张量,
       dropout是nn.Dropout层的实例化对象, 默认为None"""

    d_k = query.size(-1)

    scores = torch.matmul(query, key.transpose(-2, -1)) / math.sqrt(d_k)

    if mask is not None:

        scores = scores.masked_fill(mask == 0, -1e9)

    p_attn = F.softmax(scores, dim = -1)

    if dropout is not None:

        p_attn = dropout(p_attn)

    return torch.matmul(p_attn, value), p_attn

tensor.masked_fill演示:

>>> input = Variable(torch.randn(5, 5))
>>> input
Variable containing:
 2.0344 -0.5450  0.3365 -0.1888 -2.1803
 1.5221 -0.3823  0.8414  0.7836 -0.8481
-0.0345 -0.8643  0.6476 -0.2713  1.5645
 0.8788 -2.2142  0.4022  0.1997  0.1474
 2.9109  0.6006 -0.6745 -1.7262  0.6977
[torch.FloatTensor of size 5x5]

>>> mask = Variable(torch.zeros(5, 5))
>>> mask
Variable containing:
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
 0  0  0  0  0
[torch.FloatTensor of size 5x5]

>>> input.masked_fill(mask == 0, -1e9)
Variable containing:
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
-1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09 -1.0000e+09
[torch.FloatTensor of size 5x5]

输入参数:


query = key = value = pe_result
Variable containing:
( 0 ,.,.) =
  46.5196  16.2057 -41.5581  ...  -16.0242 -17.8929 -43.0405
 -32.6040  16.1096 -29.5228  ...    4.2721  20.6034  -1.2747
 -18.6235  14.5076  -2.0105  ...   15.6462 -24.6081 -30.3391
   0.0000 -66.1486 -11.5123  ...   20.1519  -4.6823   0.4916

( 1 ,.,.) =
 -24.8681   7.5495  -5.0765  ...   -7.5992 -26.6630  40.9517
  13.1581  -3.1918 -30.9001  ...   25.1187 -26.4621   2.9542
 -49.7690 -42.5019   8.0198  ...   -5.4809  25.9403 -27.4931
 -52.2775  10.4006   0.0000  ...   -1.9985   7.0106  -0.5189
[torch.FloatTensor of size 2x4x512]

调用:

attn, p_attn = attention(query, key, value)
print("attn:", attn)
print("p_attn:", p_attn)

输出效果:


attn: Variable containing:
( 0 ,.,.) =
   12.8269    7.7403   41.2225  ...     1.4603   27.8559  -12.2600
   12.4904    0.0000   24.1575  ...     0.0000    2.5838   18.0647
  -32.5959   -4.6252  -29.1050  ...     0.0000  -22.6409  -11.8341
    8.9921  -33.0114   -0.7393  ...     4.7871   -5.7735    8.3374

( 1 ,.,.) =
  -25.6705   -4.0860  -36.8226  ...    37.2346  -27.3576    2.5497
  -16.6674   73.9788  -33.3296  ...    28.5028   -5.5488  -13.7564
    0.0000  -29.9039   -3.0405  ...     0.0000   14.4408   14.8579
   30.7819    0.0000   21.3908  ...   -29.0746    0.0000   -5.8475
[torch.FloatTensor of size 2x4x512]

p_attn: Variable containing:
(0 ,.,.) =
  1  0  0  0
  0  1  0  0
  0  0  1  0
  0  0  0  1

(1 ,.,.) =
  1  0  0  0
  0  1  0  0
  0  0  1  0
  0  0  0  1
[torch.FloatTensor of size 2x4x4]

带有mask的输入参数：

query = key = value = pe_result

mask = Variable(torch.zeros(2, 4, 4))

调用:

attn, p_attn = attention(query, key, value, mask=mask)
print("attn:", attn)
print("p_attn:", p_attn)

带有mask的输出效果:


attn: Variable containing:
( 0 ,.,.) =
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770
   0.4284  -7.4741   8.8839  ...    1.5618   0.5063   0.5770

( 1 ,.,.) =
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
  -2.8890   9.9972 -12.9505  ...    9.1657  -4.6164  -0.5491
[torch.FloatTensor of size 2x4x512]

p_attn: Variable containing:
(0 ,.,.) =
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500

(1 ,.,.) =
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
  0.2500  0.2500  0.2500  0.2500
[torch.FloatTensor of size 2x4x4]

2.3.3 多头注意力机制

什么是多头注意力机制:
从多头注意力的结构图中，貌似这个所谓的多个头就是指多组线性变换层，其实并不是，我只有使用了一组线性变化层，即三个变换张量对Q，K，V分别进行线性变换，这些变换不会改变原有张量的尺寸，因此每个变换矩阵都是方阵，得到输出结果后，多头的作用才开始显现，每个头开始从词义层面分割输出的张量，也就是每个头都想获得一组Q，K，V进行注意力机制的计算，但是句子中的每个词的表示只获得一部分，也就是只分割了最后一维的词嵌入向量. 这就是所谓的多头，将每个头的获得的输入送到注意力机制中, 就形成多头注意力机制.
多头注意力机制结构图:

多头注意力机制的作用:
这种结构设计能让每个注意力机制去优化每个词汇的不同特征部分，从而均衡同一种注意力机制可能产生的偏差，让词义拥有来自更多元的表达，实验表明可以从而提升模型效果.

多头注意力机制的代码实现:


import copy

def clones(module, N):
    """用于生成相同网络层的克隆函数, 它的参数module表示要克隆的目标网络层, N代表需要克隆的数量"""

    return nn.ModuleList([copy.deepcopy(module) for _ in range(N)])

class MultiHeadedAttention(nn.Module):
    def __init__(self, head, embedding_dim, dropout=0.1):
        """在类的初始化时, 会传入三个参数，head代表头数，embedding_dim代表词嵌入的维度，
           dropout代表进行dropout操作时置0比率，默认是0.1."""
        super(MultiHeadedAttention, self).__init__()

        assert embedding_dim % head == 0

        self.d_k = embedding_dim // head

        self.head = head

        self.linears = clones(nn.Linear(embedding_dim, embedding_dim), 4)

        self.attn = None

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, query, key, value, mask=None):
        """前向逻辑函数, 它的输入参数有四个，前三个就是注意力机制需要的Q, K, V，
           最后一个是注意力机制中可能需要的mask掩码张量，默认是None. """

        if mask is not None:

            mask = mask.unsqueeze(0)

        batch_size = query.size(0)

        query, key, value = \
           [model(x).view(batch_size, -1, self.head, self.d_k).transpose(1, 2)
            for model, x in zip(self.linears, (query, key, value))]

        x, self.attn = attention(query, key, value, mask=mask, dropout=self.dropout)

        x = x.transpose(1, 2).contiguous().view(batch_size, -1, self.head * self.d_k)

        return self.linears[-1](x)

tensor.view演示:

>>> x = torch.randn(4, 4)
>>> x.size()
torch.Size([4, 4])
>>> y = x.view(16)
>>> y.size()
torch.Size([16])
>>> z = x.view(-1, 8)
>>> z.size()
torch.Size([2, 8])

>>> a = torch.randn(1, 2, 3, 4)
>>> a.size()
torch.Size([1, 2, 3, 4])
>>> b = a.transpose(1, 2)
>>> b.size()
torch.Size([1, 3, 2, 4])
>>> c = a.view(1, 3, 2, 4)
>>> c.size()
torch.Size([1, 3, 2, 4])
>>> torch.equal(b, c)
False

torch.transpose演示:

>>> x = torch.randn(2, 3)
>>> x
tensor([[ 1.0028, -0.9893,  0.5809],
        [-0.1669,  0.7299,  0.4942]])
>>> torch.transpose(x, 0, 1)
tensor([[ 1.0028, -0.1669],
        [-0.9893,  0.7299],
        [ 0.5809,  0.4942]])

实例化参数:


head = 8

embedding_dim = 512

dropout = 0.2

输入参数:


query = value = key = pe_result

mask = Variable(torch.zeros(8, 4, 4))

调用:

mha = MultiHeadedAttention(head, embedding_dim, dropout)
mha_result = mha(query, key, value, mask)
print(mha_result)

输出效果:

tensor([[[-0.3075,  1.5687, -2.5693,  ..., -1.1098,  0.0878, -3.3609],
         [ 3.8065, -2.4538, -0.3708,  ..., -1.5205, -1.1488, -1.3984],
         [ 2.4190,  0.5376, -2.8475,  ...,  1.4218, -0.4488, -0.2984],
         [ 2.9356,  0.3620, -3.8722,  ..., -0.7996,  0.1468,  1.0345]],

        [[ 1.1423,  0.6038,  0.0954,  ...,  2.2679, -5.7749,  1.4132],
         [ 2.4066, -0.2777,  2.8102,  ...,  0.1137, -3.9517, -2.9246],
         [ 5.8201,  1.1534, -1.9191,  ...,  0.1410, -7.6110,  1.0046],
         [ 3.1209,  1.0008, -0.5317,  ...,  2.8619, -6.3204, -1.3435]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.3.4 前馈全连接层

什么是前馈全连接层:
在Transformer中前馈全连接层就是具有两层线性层的全连接网络.
前馈全连接层的作用:
考虑注意力机制可能对复杂过程的拟合程度不够, 通过增加两层网络来增强模型的能力.

前馈全连接层的代码分析:


class PositionwiseFeedForward(nn.Module):
    def __init__(self, d_model, d_ff, dropout=0.1):
        """初始化函数有三个输入参数分别是d_model, d_ff,和dropout=0.1，第一个是线性层的输入维度也是第二个线性层的输出维度，
           因为我们希望输入通过前馈全连接层后输入和输出的维度不变. 第二个参数d_ff就是第二个线性层的输入维度和第一个线性层的输出维度.

           最后一个是dropout置0比率."""
        super(PositionwiseFeedForward, self).__init__()

        self.w1 = nn.Linear(d_model, d_ff)
        self.w2 = nn.Linear(d_ff, d_model)

        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """输入参数为x，代表来自上一层的输出"""

        return self.w2(self.dropout(F.relu(self.w1(x))))

ReLU函数公式: ReLU(x)=max(0, x)
ReLU函数图像:
实例化参数:

d_model = 512

d_ff = 64

dropout = 0.2

输入参数:


x = mha_result
tensor([[[-0.3075,  1.5687, -2.5693,  ..., -1.1098,  0.0878, -3.3609],
         [ 3.8065, -2.4538, -0.3708,  ..., -1.5205, -1.1488, -1.3984],
         [ 2.4190,  0.5376, -2.8475,  ...,  1.4218, -0.4488, -0.2984],
         [ 2.9356,  0.3620, -3.8722,  ..., -0.7996,  0.1468,  1.0345]],

        [[ 1.1423,  0.6038,  0.0954,  ...,  2.2679, -5.7749,  1.4132],
         [ 2.4066, -0.2777,  2.8102,  ...,  0.1137, -3.9517, -2.9246],
         [ 5.8201,  1.1534, -1.9191,  ...,  0.1410, -7.6110,  1.0046],
         [ 3.1209,  1.0008, -0.5317,  ...,  2.8619, -6.3204, -1.3435]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

调用:

ff = PositionwiseFeedForward(d_model, d_ff, dropout)
ff_result = ff(x)
print(ff_result)

输出效果:

tensor([[[-1.9488e+00, -3.4060e-01, -1.1216e+00,  ...,  1.8203e-01,
          -2.6336e+00,  2.0917e-03],
         [-2.5875e-02,  1.1523e-01, -9.5437e-01,  ..., -2.6257e-01,
          -5.7620e-01, -1.9225e-01],
         [-8.7508e-01,  1.0092e+00, -1.6515e+00,  ...,  3.4446e-02,
          -1.5933e+00, -3.1760e-01],
         [-2.7507e-01,  4.7225e-01, -2.0318e-01,  ...,  1.0530e+00,
          -3.7910e-01, -9.7730e-01]],

        [[-2.2575e+00, -2.0904e+00,  2.9427e+00,  ...,  9.6574e-01,
          -1.9754e+00,  1.2797e+00],
         [-1.5114e+00, -4.7963e-01,  1.2881e+00,  ..., -2.4882e-02,
          -1.5896e+00, -1.0350e+00],
         [ 1.7416e-01, -4.0688e-01,  1.9289e+00,  ..., -4.9754e-01,
          -1.6320e+00, -1.5217e+00],
         [-1.0874e-01, -3.3842e-01,  2.9379e-01,  ..., -5.1276e-01,
          -1.6150e+00, -1.1295e+00]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.3.5 规范化层

规范化层的作用:
它是所有深层网络模型都需要的标准网络层，因为随着网络层数的增加，通过多层的计算后参数可能开始出现过大或过小的情况，这样可能会导致学习过程出现异常，模型可能收敛非常的慢. 因此都会在一定层数后接规范化层进行数值的规范化，使其特征数值在合理范围内.

规范化层的代码实现:


class LayerNorm(nn.Module):
    def __init__(self, features, eps=1e-6):
        """初始化函数有两个参数, 一个是features, 表示词嵌入的维度,
           另一个是eps它是一个足够小的数, 在规范化公式的分母中出现,
           防止分母为0.默认是1e-6."""
        super(LayerNorm, self).__init__()

        self.a2 = nn.Parameter(torch.ones(features))
        self.b2 = nn.Parameter(torch.zeros(features))

        self.eps = eps

    def forward(self, x):
        """输入参数x代表来自上一层的输出"""

        mean = x.mean(-1, keepdim=True)
        std = x.std(-1, keepdim=True)
        return self.a2 * (x - mean) / (std + self.eps) + self.b2

实例化参数:

features = d_model = 512
eps = 1e-6

输入参数:


x = ff_result
tensor([[[-1.9488e+00, -3.4060e-01, -1.1216e+00,  ...,  1.8203e-01,
          -2.6336e+00,  2.0917e-03],
         [-2.5875e-02,  1.1523e-01, -9.5437e-01,  ..., -2.6257e-01,
          -5.7620e-01, -1.9225e-01],
         [-8.7508e-01,  1.0092e+00, -1.6515e+00,  ...,  3.4446e-02,
          -1.5933e+00, -3.1760e-01],
         [-2.7507e-01,  4.7225e-01, -2.0318e-01,  ...,  1.0530e+00,
          -3.7910e-01, -9.7730e-01]],

        [[-2.2575e+00, -2.0904e+00,  2.9427e+00,  ...,  9.6574e-01,
          -1.9754e+00,  1.2797e+00],
         [-1.5114e+00, -4.7963e-01,  1.2881e+00,  ..., -2.4882e-02,
          -1.5896e+00, -1.0350e+00],
         [ 1.7416e-01, -4.0688e-01,  1.9289e+00,  ..., -4.9754e-01,
          -1.6320e+00, -1.5217e+00],
         [-1.0874e-01, -3.3842e-01,  2.9379e-01,  ..., -5.1276e-01,
          -1.6150e+00, -1.1295e+00]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

调用:

ln = LayerNorm(feature, eps)
ln_result = ln(x)
print(ln_result)

输出效果:

tensor([[[ 2.2697,  1.3911, -0.4417,  ...,  0.9937,  0.6589, -1.1902],
         [ 1.5876,  0.5182,  0.6220,  ...,  0.9836,  0.0338, -1.3393],
         [ 1.8261,  2.0161,  0.2272,  ...,  0.3004,  0.5660, -0.9044],
         [ 1.5429,  1.3221, -0.2933,  ...,  0.0406,  1.0603,  1.4666]],

        [[ 0.2378,  0.9952,  1.2621,  ..., -0.4334, -1.1644,  1.2082],
         [-1.0209,  0.6435,  0.4235,  ..., -0.3448, -1.0560,  1.2347],
         [-0.8158,  0.7118,  0.4110,  ...,  0.0990, -1.4833,  1.9434],
         [ 0.9857,  2.3924,  0.3819,  ...,  0.0157, -1.6300,  1.2251]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.3.6 子层连接结构

什么是子层连接结构:
如图所示，输入到每个子层以及规范化层的过程中，还使用了残差链接（跳跃连接），因此我们把这一部分结构整体叫做子层连接（代表子层及其链接结构），在每个编码器层中，都有两个子层，这两个子层加上周围的链接结构就形成了两个子层连接结构.
子层连接结构图:

子层连接结构的代码分析:


class SublayerConnection(nn.Module):
    def __init__(self, size, dropout=0.1):
        """它输入参数有两个, size以及dropout， size一般是都是词嵌入维度的大小，
           dropout本身是对模型结构中的节点数进行随机抑制的比率，
           又因为节点被抑制等效就是该节点的输出都是0，因此也可以把dropout看作是对输出矩阵的随机置0的比率.

"""
        super(SublayerConnection, self).__init__()

        self.norm = LayerNorm(size)

        self.dropout = nn.Dropout(p=dropout)

    def forward(self, x, sublayer):
        """前向逻辑函数中, 接收上一个层或者子层的输入作为第一个参数，
           将该子层连接中的子层函数作为第二个参数"""

        return x + self.dropout(sublayer(self.norm(x)))

实例化参数

size = 512
dropout = 0.2
head = 8
d_model = 512

输入参数:


x = pe_result
mask = Variable(torch.zeros(8, 4, 4))

self_attn =  MultiHeadedAttention(head, d_model)

sublayer = lambda x: self_attn(x, x, x, mask)

调用:

sc = SublayerConnection(size, dropout)
sc_result = sc(x, sublayer)
print(sc_result)
print(sc_result.shape)

输出效果:

tensor([[[ 14.8830,  22.4106, -31.4739,  ...,  21.0882, -10.0338,  -0.2588],
         [-25.1435,   2.9246, -16.1235,  ...,  10.5069,  -7.1007,  -3.7396],
         [  0.1374,  32.6438,  12.3680,  ..., -12.0251, -40.5829,   2.2297],
         [-13.3123,  55.4689,   9.5420,  ..., -12.6622,  23.4496,  21.1531]],

        [[ 13.3533,  17.5674, -13.3354,  ...,  29.1366,  -6.4898,  35.8614],
         [-35.2286,  18.7378, -31.4337,  ...,  11.1726,  20.6372,  29.8689],
         [-30.7627,   0.0000, -57.0587,  ...,  15.0724, -10.7196, -18.6290],
         [ -2.7757, -19.6408,   0.0000,  ...,  12.7660,  21.6843, -35.4784]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.3.7 编码器层

编码器层的作用:
作为编码器的组成单元, 每个编码器层完成一次对输入的特征提取过程, 即编码过程.
编码器层的构成图:
编码器层的代码分析:


class EncoderLayer(nn.Module):
    def __init__(self, size, self_attn, feed_forward, dropout):
        """它的初始化函数参数有四个，分别是size，其实就是我们词嵌入维度的大小，它也将作为我们编码器层的大小,
           第二个self_attn，之后我们将传入多头自注意力子层实例化对象, 并且是自注意力机制,
           第三个是feed_froward, 之后我们将传入前馈全连接层实例化对象, 最后一个是置0比率dropout."""
        super(EncoderLayer, self).__init__()

        self.self_attn = self_attn
        self.feed_forward = feed_forward

        self.sublayer = clones(SublayerConnection(size, dropout), 2)

        self.size = size

    def forward(self, x, mask):
        """forward函数中有两个输入参数，x和mask，分别代表上一层的输出，和掩码张量mask."""

        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, mask))
        return self.sublayer[1](x, self.feed_forward)

实例化参数:

size = 512
head = 8
d_model = 512
d_ff = 64
x = pe_result
dropout = 0.2
self_attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
mask = Variable(torch.zeros(8, 4, 4))

调用:

el = EncoderLayer(size, self_attn, ff, dropout)
el_result = el(x, mask)
print(el_result)
print(el_result.shape)

输出效果:

tensor([[[ 33.6988, -30.7224,  20.9575,  ...,   5.2968, -48.5658,  20.0734],
         [-18.1999,  34.2358,  40.3094,  ...,  10.1102,  58.3381,  58.4962],
         [ 32.1243,  16.7921,  -6.8024,  ...,  23.0022, -18.1463, -17.1263],
         [ -9.3475,  -3.3605, -55.3494,  ...,  43.6333,  -0.1900,   0.1625]],

        [[ 32.8937, -46.2808,   8.5047,  ...,  29.1837,  22.5962, -14.4349],
         [ 21.3379,  20.0657, -31.7256,  ..., -13.4079, -44.0706,  -9.9504],
         [ 19.7478,  -1.0848,  11.8884,  ...,  -9.5794,   0.0675,  -4.7123],
         [ -6.8023, -16.1176,  20.9476,  ...,  -6.5469,  34.8391, -14.9798]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.3.8 编码器

编码器的作用:
编码器用于对输入进行指定的特征提取过程, 也称为编码, 由N个编码器层堆叠而成.
编码器的结构图:
编码器的代码分析:


class Encoder(nn.Module):
    def __init__(self, layer, N):
        """初始化函数的两个参数分别代表编码器层和编码器层的个数"""
        super(Encoder, self).__init__()

        self.layers = clones(layer, N)

        self.norm = LayerNorm(layer.size)

    def forward(self, x, mask):
        """forward函数的输入和编码器层相同, x代表上一层的输出, mask代表掩码张量"""

        for layer in self.layers:
            x = layer(x, mask)
        return self.norm(x)

实例化参数:


size = 512
head = 8
d_model = 512
d_ff = 64
c = copy.deepcopy
attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
dropout = 0.2
layer = EncoderLayer(size, c(attn), c(ff), dropout)

N = 8
mask = Variable(torch.zeros(8, 4, 4))

调用:

en = Encoder(layer, N)
en_result = en(x, mask)
print(en_result)
print(en_result.shape)

输出效果:

tensor([[[-0.2081, -0.3586, -0.2353,  ...,  2.5646, -0.2851,  0.0238],
         [ 0.7957, -0.5481,  1.2443,  ...,  0.7927,  0.6404, -0.0484],
         [-0.1212,  0.4320, -0.5644,  ...,  1.3287, -0.0935, -0.6861],
         [-0.3937, -0.6150,  2.2394,  ..., -1.5354,  0.7981,  1.7907]],

        [[-2.3005,  0.3757,  1.0360,  ...,  1.4019,  0.6493, -0.1467],
         [ 0.5653,  0.1569,  0.4075,  ..., -0.3205,  1.4774, -0.5856],
         [-1.0555,  0.0061, -1.8165,  ..., -0.4339, -1.8780,  0.2467],
         [-2.1617, -1.5532, -1.4330,  ..., -0.9433, -0.5304, -1.7022]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.4 解码器部分实现

解码器部分:

由N个解码器层堆叠而成
每个解码器层由三个子层连接结构组成
第一个子层连接结构包括一个多头自注意力子层和规范化层以及一个残差连接
第二个子层连接结构包括一个多头注意力子层和规范化层以及一个残差连接
第三个子层连接结构包括一个前馈全连接子层和规范化层以及一个残差连接

说明:
解码器层中的各个部分，如，多头注意力机制，规范化层，前馈全连接网络，子层连接结构都与编码器中的实现相同. 因此这里可以直接拿来构建解码器层.

; 2.4.1 解码器层

解码器层的作用:
作为解码器的组成单元, 每个解码器层根据给定的输入向目标方向进行特征提取操作，即解码过程.

解码器层的代码实现:


class DecoderLayer(nn.Module):
    def __init__(self, size, self_attn, src_attn, feed_forward, dropout):
        """初始化函数的参数有5个, 分别是size，代表词嵌入的维度大小, 同时也代表解码器层的尺寸，
            第二个是self_attn，多头自注意力对象，也就是说这个注意力机制需要Q=K=V，
            第三个是src_attn，多头注意力对象，这里Q!=K=V， 第四个是前馈全连接层对象，最后就是droupout置0比率.

"""
        super(DecoderLayer, self).__init__()

        self.size = size
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward

        self.sublayer = clones(SublayerConnection(size, dropout), 3)

    def forward(self, x, memory, source_mask, target_mask):
        """forward函数中的参数有4个，分别是来自上一层的输入x，
           来自编码器层的语义存储变量mermory， 以及源数据掩码张量和目标数据掩码张量.

"""

        m = memory

        x = self.sublayer[0](x, lambda x: self.self_attn(x, x, x, target_mask))

        x = self.sublayer[1](x, lambda x: self.src_attn(x, m, m, source_mask))

        return self.sublayer[2](x, self.feed_forward)

实例化参数:


head = 8
size = 512
d_model = 512
d_ff = 64
dropout = 0.2
self_attn = src_attn = MultiHeadedAttention(head, d_model, dropout)

ff = PositionwiseFeedForward(d_model, d_ff, dropout)

输入参数:


x = pe_result

memory = en_result

mask = Variable(torch.zeros(8, 4, 4))
source_mask = target_mask = mask

调用:

dl = DecoderLayer(size, self_attn, src_attn, ff, dropout)
dl_result = dl(x, memory, source_mask, target_mask)
print(dl_result)
print(dl_result.shape)

输出效果:

tensor([[[ 1.9604e+00,  3.9288e+01, -5.2422e+01,  ...,  2.1041e-01,
          -5.5063e+01,  1.5233e-01],
         [ 1.0135e-01, -3.7779e-01,  6.5491e+01,  ...,  2.8062e+01,
          -3.7780e+01, -3.9577e+01],
         [ 1.9526e+01, -2.5741e+01,  2.6926e-01,  ..., -1.5316e+01,
           1.4543e+00,  2.7714e+00],
         [-2.1528e+01,  2.0141e+01,  2.1999e+01,  ...,  2.2099e+00,
          -1.7267e+01, -1.6687e+01]],

        [[ 6.7259e+00, -2.6918e+01,  1.1807e+01,  ..., -3.6453e+01,
          -2.9231e+01,  1.1288e+01],
         [ 7.7484e+01, -5.0572e-01, -1.3096e+01,  ...,  3.6302e-01,
           1.9907e+01, -1.2160e+00],
         [ 2.6703e+01,  4.4737e+01, -3.1590e+01,  ...,  4.1540e-03,
           5.2587e+00,  5.2382e+00],
         [ 4.7435e+01, -3.7599e-01,  5.0898e+01,  ...,  5.6361e+00,
           3.5891e+01,  1.5697e+01]]], grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.4.2 解码器

解码器的作用:
根据编码器的结果以及上一次预测的结果, 对下一次可能出现的’值’进行特征表示.

解码器的代码分析:


class Decoder(nn.Module):
    def __init__(self, layer, N):
        """初始化函数的参数有两个，第一个就是解码器层layer，第二个是解码器层的个数N."""
        super(Decoder, self).__init__()

        self.layers = clones(layer, N)
        self.norm = LayerNorm(layer.size)

    def forward(self, x, memory, source_mask, target_mask):
        """forward函数中的参数有4个，x代表目标数据的嵌入表示，memory是编码器层的输出，
           source_mask, target_mask代表源数据和目标数据的掩码张量"""

        for layer in self.layers:
            x = layer(x, memory, source_mask, target_mask)
        return self.norm(x)

实例化参数:


size = 512
d_model = 512
head = 8
d_ff = 64
dropout = 0.2
c = copy.deepcopy
attn = MultiHeadedAttention(head, d_model)
ff = PositionwiseFeedForward(d_model, d_ff, dropout)
layer = DecoderLayer(d_model, c(attn), c(attn), c(ff), dropout)
N = 8

输入参数:


x = pe_result
memory = en_result
mask = Variable(torch.zeros(8, 4, 4))
source_mask = target_mask = mask

调用:

de = Decoder(layer, N)
de_result = de(x, memory, source_mask, target_mask)
print(de_result)
print(de_result.shape)

输出效果:

tensor([[[ 0.9898, -0.3216, -1.2439,  ...,  0.7427, -0.0717, -0.0814],
         [-0.7432,  0.6985,  1.5551,  ...,  0.5232, -0.5685,  1.3387],
         [ 0.2149,  0.5274, -1.6414,  ...,  0.7476,  0.5082, -3.0132],
         [ 0.4408,  0.9416,  0.4522,  ..., -0.1506,  1.5591, -0.6453]],

        [[-0.9027,  0.5874,  0.6981,  ...,  2.2899,  0.2933, -0.7508],
         [ 1.2246, -1.0856, -0.2497,  ..., -1.2377,  0.0847, -0.0221],
         [ 3.4012, -0.4181, -2.0968,  ..., -1.5427,  0.1090, -0.3882],
         [-0.1050, -0.5140, -0.6494,  ..., -0.4358, -1.2173,  0.4161]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 512])

2.5 输出部分实现

输出部分包含:

线性层
softmax层
线性层的作用
通过对上一步的线性变化得到指定维度的输出, 也就是转换维度的作用.
softmax层的作用
使最后一维的向量中的数字缩放到0-1的概率值域内, 并满足他们的和为1.

线性层和softmax层的代码分析:


import torch.nn.functional as F

class Generator(nn.Module):
    def __init__(self, d_model, vocab_size):
        """初始化函数的输入参数有两个, d_model代表词嵌入维度, vocab_size代表词表大小."""
        super(Generator, self).__init__()

        self.project = nn.Linear(d_model, vocab_size)

    def forward(self, x):
        """前向逻辑函数中输入是上一层的输出张量x"""

        return F.log_softmax(self.project(x), dim=-1)

nn.Linear演示:

>>> m = nn.Linear(20, 30)
>>> input = torch.randn(128, 20)
>>> output = m(input)
>>> print(output.size())
torch.Size([128, 30])

实例化参数:


d_model = 512

vocab_size = 1000

输入参数:


x = de_result

调用:

gen = Generator(d_model, vocab_size)
gen_result = gen(x)
print(gen_result)
print(gen_result.shape)

输出效果:

tensor([[[-7.8098, -7.5260, -6.9244,  ..., -7.6340, -6.9026, -7.5232],
         [-6.9093, -7.3295, -7.2972,  ..., -6.6221, -7.2268, -7.0772],
         [-7.0263, -7.2229, -7.8533,  ..., -6.7307, -6.9294, -7.3042],
         [-6.5045, -6.0504, -6.6241,  ..., -5.9063, -6.5361, -7.1484]],

        [[-7.1651, -6.0224, -7.4931,  ..., -7.9565, -8.0460, -6.6490],
         [-6.3779, -7.6133, -8.3572,  ..., -6.6565, -7.1867, -6.5112],
         [-6.4914, -6.9289, -6.2634,  ..., -6.2471, -7.5348, -6.8541],
         [-6.8651, -7.0460, -7.6239,  ..., -7.1411, -6.5496, -7.3749]]],
       grad_fn=<LogSoftmaxBackward>)
torch.Size([2, 4, 1000])

2.6 模型构建

通过上面的小节, 我们已经完成了所有组成部分的实现, 接下来就来实现完整的编码器-解码器结构.

Transformer总体架构图:

编码器-解码器结构的代码实现


class EncoderDecoder(nn.Module):
    def __init__(self, encoder, decoder, source_embed, target_embed, generator):

        super(EncoderDecoder, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.src_embed = source_embed
        self.tgt_embed = target_embed
        self.generator = generator

    def forward(self, source, target, source_mask, target_mask):

        return self.generator(self.decode(self.encode(source, source_mask), source_mask,
                           target, target_mask))

    def encode(self, source, source_mask):
        return self.encoder(self.src_embed(source), source_mask)

    def decode(self, memory, source_mask, target, target_mask):

        return self.decoder(self.tgt_embed(target), memory, source_mask, target_mask)

实例化参数

vocab_size = 1000
d_model = 512
encoder = en
decoder = de
source_embed = nn.Embedding(vocab_size, d_model)
target_embed = nn.Embedding(vocab_size, d_model)
generator = gen

输入参数:


source = target = Variable(torch.LongTensor([[100, 2, 421, 508], [491, 998, 1, 221]]))

source_mask = target_mask = Variable(torch.zeros(8, 4, 4))

调用:

ed = EncoderDecoder(encoder, decoder, source_embed, target_embed, generator)
ed_result = ed(source, target, source_mask, target_mask)
print(ed_result)
print(ed_result.shape)

输出效果:

tensor([[[ 0.2102, -0.0826, -0.0550,  ...,  1.5555,  1.3025, -0.6296],
         [ 0.8270, -0.5372, -0.9559,  ...,  0.3665,  0.4338, -0.7505],
         [ 0.4956, -0.5133, -0.9323,  ...,  1.0773,  1.1913, -0.6240],
         [ 0.5770, -0.6258, -0.4833,  ...,  0.1171,  1.0069, -1.9030]],

        [[-0.4355, -1.7115, -1.5685,  ..., -0.6941, -0.1878, -0.1137],
         [-0.8867, -1.2207, -1.4151,  ..., -0.9618,  0.1722, -0.9562],
         [-0.0946, -0.9012, -1.6388,  ..., -0.2604, -0.3357, -0.6436],
         [-1.1204, -1.4481, -1.5888,  ..., -0.8816, -0.6497,  0.0606]]],
       grad_fn=<AddBackward0>)
torch.Size([2, 4, 1000])

接着将基于以上结构构建用于训练的模型.

Tansformer模型构建过程的代码分析

def make_model(source_vocab, target_vocab, N=6,
               d_model=512, d_ff=2048, head=8, dropout=0.1):
    """该函数用来构建模型, 有7个参数，分别是源数据特征(词汇)总数，目标数据特征(词汇)总数，
       编码器和解码器堆叠数，词向量映射维度，前馈全连接网络中变换矩阵的维度，
       多头注意力结构中的多头数，以及置零比率dropout."""

    c = copy.deepcopy

    attn = MultiHeadedAttention(head, d_model)

    ff = PositionwiseFeedForward(d_model, d_ff, dropout)

    position = PositionalEncoding(d_model, dropout)

    model = EncoderDecoder(
        Encoder(EncoderLayer(d_model, c(attn), c(ff), dropout), N),
        Decoder(DecoderLayer(d_model, c(attn), c(attn),
                             c(ff), dropout), N),
        nn.Sequential(Embeddings(d_model, source_vocab), c(position)),
        nn.Sequential(Embeddings(d_model, target_vocab), c(position)),
        Generator(d_model, target_vocab))

    for p in model.parameters():
        if p.dim() > 1:
            nn.init.xavier_uniform(p)
    return model

nn.init.xavier_uniform演示:


>>> w = torch.empty(3, 5)
>>> w = nn.init.xavier_uniform_(w, gain=nn.init.calculate_gain('relu'))
>>> w
tensor([[-0.7742,  0.5413,  0.5478, -0.4806, -0.2555],
        [-0.8358,  0.4673,  0.3012,  0.3882, -0.6375],
        [ 0.4622, -0.0794,  0.1851,  0.8462, -0.3591]])

输入参数:

source_vocab = 11
target_vocab = 11
N = 6

调用:

if __name__ == '__main__':
    res = make_model(source_vocab, target_vocab, N)
    print(res)

输出效果:


EncoderDecoder(
  (encoder): Encoder(
    (layers): ModuleList(
      (0): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
      (1): EncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
    )
    (norm): LayerNorm(
    )
  )
  (decoder): Decoder(
    (layers): ModuleList(
      (0): DecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (src_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (2): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
      (1): DecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (src_attn): MultiHeadedAttention(
          (linears): ModuleList(
            (0): Linear(in_features=512, out_features=512)
            (1): Linear(in_features=512, out_features=512)
            (2): Linear(in_features=512, out_features=512)
            (3): Linear(in_features=512, out_features=512)
          )
          (dropout): Dropout(p=0.1)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048)
          (w_2): Linear(in_features=2048, out_features=512)
          (dropout): Dropout(p=0.1)
        )
        (sublayer): ModuleList(
          (0): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (1): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
          (2): SublayerConnection(
            (norm): LayerNorm(
            )
            (dropout): Dropout(p=0.1)
          )
        )
      )
    )
    (norm): LayerNorm(
    )
  )
  (src_embed): Sequential(
    (0): Embeddings(
      (lut): Embedding(11, 512)
    )
    (1): PositionalEncoding(
      (dropout): Dropout(p=0.1)
    )
  )
  (tgt_embed): Sequential(
    (0): Embeddings(
      (lut): Embedding(11, 512)
    )
    (1): PositionalEncoding(
      (dropout): Dropout(p=0.1)
    )
  )
  (generator): Generator(
    (proj): Linear(in_features=512, out_features=11)
  )
)

三、使用Transformer构建语言模型

https://blog.csdn.net/sinat_28015305/article/details/109410129

Original: https://blog.csdn.net/mengxianglong123/article/details/126261479
Author: 落花雨时
Title: 深度学习 Transformer架构解析

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/786155/

转载文章受原作者版权保护。转载请注明原作者出处！

python

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

2022年终总结，我的10年Android之旅

本文同步发表于我的微信公众号，扫一扫文章底部的二维码或在微信搜索郭霖即可关注，每个工作日都有文章更新。不可思议，这已经是我第10个年头的年终总结了。但准确来说，我的Andr…

Python 2023年11月4日
0038
Anaconda安装之后Spyder打不开解决办法

小白一个，搜索了大半天，还重装了两次，快被折磨疯了，终于探索出解决之道了。分享出来，希望后来者少走一些弯路，不要在安装上费这么大劲。如果你遇到了跟我一样的问题，希望可以帮到你问…

Python 2023年6月9日
0063
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of ‘Ran

在python中对股票进行时间序列的重分类时遇到报错 TypeError: Only valid with DatetimeIndex, TimedeltaIndex or Per…

Python 2023年8月17日
0063
python人物关系网络图共现_文本分析之制作网络关系图

最近忙于学术，公众号更新的有点慢了，在这里给大家个歉，希望大家能继续支持我。由于学术需要，未来一段时间，我以后会有一些文本分析的脚本要出现，希望大家喜欢。目前简单的文本分析已经满…

Python 2023年9月7日
0073
死磕JAVA10余年，呕心整理出了核心知识点已经做成PDF，无私奉献

前言：想在面试、工作中脱颖而出？想在最短的时间内快速掌握 Java 的核心基础知识点？想要成为一位优秀的 Java 工程师？本篇文章能助你一臂之力！目前正值招聘求职旺季，很多同…

Python 2023年10月7日
0040
phpstudy后门POC分析和EXP开发

POC 2019年9月20日，网上传出 phpStudy 软件存在后门，随后作者立即发布声明进行澄清，其真实情况是该软件官网于2016年被非法入侵，程序包自带PHP的php_xml…

Python 2023年6月12日
0084
pandas-datareader

pandas-datareader介绍 Pandas库提供了专门从财经网站获取金融数据的API接口，可作为量化交易股票数据获取的另一种途径，该接口在urllib3库基础上实现了以客…

Python 2023年8月9日
0073
PySide2入门–PySide2介绍与配置

前言因为有对GUI界面开发的需求，我前些阵子接触过Qt，一套著名的跨平台的C++图形界面框架。Qt开发最有效的Qt creator，跨平台且集成多款工具，上手体验十分友好。但是，…

Python 2023年8月2日
00157
python通过指定excel模板导出_Python（openpyxl）：将数据从一个excel文件放到另一个（模板文件），并用另一个名称保存，同时保留temp…

实际上不需要使用shutil模块，因为您可以使用openpyxl.load_工作簿加载模板，然后用其他名称保存。在此外，for循环中的ws.append(r)将附加到现有的数据中…

Python 2023年8月19日
0042
gateway中的限流与熔断

目录 1. 限流的使用场景 2. gateway限流实现 2.1 前提： 2.2 导入依赖包 2.3 在项目配置文件中配置redis 2.4 开发限流需要的Bean 2.5 为服务…

Python 2023年9月29日
0052
Python Pandas PK esProc SPL，谁才是数据预处理王者？

做数据分析和人工智能运算前常常需要大量的数据准备工作，也就是把各种数据源以及各种规格的数据整理成统一的格式。因为情况非常复杂多样，很难有某种可视化工具来完成此项工作，常常需要编程才…

Python 2023年8月1日
0049
python异步爬虫框架_scrapy: Scrapy 是一套基于基于Twisted的异步处理的超级爬虫框架，纯python实现的爬虫框架，用户只需要定制开发几个模块就可以轻松的实现一个爬虫…

Scrapy Overview Scrapy is a fast high-level web crawling and web scraping framework, used …

Python 2023年10月4日
0050
Python趣味入门02: 妥妥地安装配置Python（Windows版）

本篇内容手把手教您如何去网上下载安装Python的运行环境，本文写于2020年Python稳定的版本是3.8，Windows流行的版本是Win10，各位看官搜索到本文时可能已经20…

Python 2023年6月12日
0071
ucb DATA100 Note 1

如果想学习用python numpy和pandas一些基本的数据处理（R也可以实现） SQL visulization 建议学一下DATA100！！！编程和课程很棒！人家的课怎…

Python 2023年9月7日
0045
如何将.csv文件数据直接读取为numpy array型数据（np.genfromtxt()函数）

函数简介（1）完整形式 numpy.genfromtxt(fname, dtype=<type ‘float’>, comments=’#’, delimiter=N…

Python 2023年8月28日
0059
【爬虫+情感判定+Top10高频词+词云图】“刘畊宏“热门弹幕python舆情分析

一、背景介绍二、代码讲解-爬虫部分 2.1 分析弹幕接口 2.2 讲解爬虫代码三、代码讲解-情感分析部分 3.1 整体思路 3.2 情感分析打标 3.3 统计top10高频词 …

Python 2023年11月2日
0063

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31