ViT结构详解（附pytorch代码）

2023年6月16日下午1:02 • 人工智能 • 阅读 87

参考这篇文章，本文会加一些注解。

源自paper:AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE

ViT把tranformer用在了图像上, transformer的文章: Attention is all you need

ViT的结构如下：

可以看到是把图像分割成小块，像NLP的句子那样按顺序进入transformer，经过MLP后，输出类别。
每个小块是16×16，进入Linear Projection of Flattened Patches, 在每个的开头加上cls token位置信息，
也就是position embedding。

从下而上实现，position embedding, Transformer, Head, Vit的顺序。
首先import

import torch
import torch.nn.functional as F
import matplotlib.pyplot as plt

from torch import nn
from torch import Tensor
from PIL import Image
from torchvision.transforms import Compose, Resize, ToTensor
from einops import rearrange, reduce, repeat
from einops.layers.torch import Rearrange, Reduce
from torchsummary import summary

image输入要是224x224x3, 所以先reshape一下


transform = Compose([Resize((224, 224)), ToTensor()])
x = transform(img)
x = x.unsqueeze(0)
x.shape

这是shape是[1, 3, 224, 224]

把图片分成小块

patch_size = 16
pathes = rearrange(x, 'b c (h s1) (w s2) -> b (h w) (s1 s2 c)', s1=patch_size, s2=patch_size)

rearrange里面的(h s1)表示hxs1,而s1是patch_size=16, 那通过hx16=224可以算出height里面包含了h个patch_size，
同理算出weight里面包含了w个patch_size。
然后输出是b (h w) (s1 s2 c)，这相当于把每个patch(16x16x3)拉成一个向量，每个batch里面有hxw个这样的向量。
就相当于上图一字排开有hxw个小块。

然后把这些小块放进Linear layer改变每条向量的维度。

上面这些可以写成一个class，用conv2代替linear layer提高计算效率，把拉成的一条向量维度变成e

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768):
        self.patch_size = patch_size
        super().__init__()
        self.projection = nn.Sequential(

            Rearrange('b c (h s1) (w s2) -> b (h w) (s1 s2 c)', s1=patch_size, s2=patch_size),
            nn.Linear(patch_size * patch_size * in_channels, emb_size)
        )

    def forward(self, x: Tensor) -> Tensor:
        x = self.projection(x)
        return x
PatchEmbedding()(x).shape

torch.Size([1, 196, 768])

CLS token

要在刚刚的patch向量中加入cls token和每个patch所在的位置信息，也就是position embedding。
cls token就是每个sequence开头的一个数字。
一张图片的一串patch是一个sequence, 所以cls token就加在它们前面，embedding_size的向量copy batch_size次。

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768):
        self.patch_size = patch_size
        super().__init__()
        self.proj = nn.Sequential(

            nn.Conv2d(in_channels, emb_size, kernel_size=patch_size, stride=patch_size),
            Rearrange('b e (h) (w) -> b (h w) e'),
        )

        self.cls_token = nn.Parameter(torch.randn(1,1, emb_size))

    def forward(self, x: Tensor) -> Tensor:
        b, _, _, _ = x.shape
        x = self.proj(x)
        cls_tokens = repeat(self.cls_token, '() n e -> b n e', b=b)

        x = torch.cat([cls_tokens, x], dim=1)
        return x
PatchEmbedding()(x).shape

这时的shape是torch.Size([1, 197, 768])，而加cls token之前是torch.Size([1, 196, 768])，可以参考下面的图。

Position embedding

要在每个patch向量前面加上位置信息，但是具体怎么加位置，ViT中这个位置信息是通过学习得到的，
下图中的 * 就是cls token, 然后包含cls, 每个patch前都要加一个位置。
所以加的位置信息为：小图像块的个数+1 （位置0）

于是在Position embedding class里面再加几句，position是直接加的。

class PatchEmbedding(nn.Module):
    def __init__(self, in_channels: int = 3, patch_size: int = 16, emb_size: int = 768, img_size: int = 224):
        self.patch_size = patch_size
        super().__init__()
        self.projection = nn.Sequential(

            nn.Conv2d(in_channels, emb_size, kernel_size=patch_size, stride=patch_size),
            Rearrange('b e (h) (w) -> b (h w) e'),
        )
        self.cls_token = nn.Parameter(torch.randn(1,1, emb_size))

        self.positions = nn.Parameter(torch.randn((img_size // patch_size) **2 + 1, emb_size))

    def forward(self, x: Tensor) -> Tensor:
        b, _, _, _ = x.shape
        x = self.projection(x)
        cls_tokens = repeat(self.cls_token, '() n e -> b n e', b=b)

        x = torch.cat([cls_tokens, x], dim=1)

        x += self.positions
        return x

PatchEmbedding()(x).shape

这时的size是torch.Size([1, 197, 768])

下一步就要实现transformer了，但是只需要encoder部分，它的结构如下

先从Attention开始吧

Attention

Attention有3个输入：query, key. value
利用query和value计算attention矩阵, 这个矩阵用来给value注意力机制。
多头注意力有n个heads同时计算。

实现上可以用pytorch自带的nn.MultiHeadAttention, 也可以自己实现。
为了了解里面的细节，自己来实现一下。
要参考一下transformer的结构。

需要4个FC layer，其中3个给query, key,value, 1个给后面的dropout。
整体流程如下：

class MultiHeadAttention(nn.Module):
    def __init__(self, emb_size: int = 512, num_heads: int = 8, dropout: float = 0):
        super().__init__()
        self.emb_size = emb_size
        self.num_heads = num_heads
        self.keys = nn.Linear(emb_size, emb_size)
        self.queries = nn.Linear(emb_size, emb_size)
        self.values = nn.Linear(emb_size, emb_size)
        self.att_drop = nn.Dropout(dropout)
        self.projection = nn.Linear(emb_size, emb_size)

    def forward(self, x : Tensor, mask: Tensor = None) -> Tensor:

        queries = rearrange(self.queries(x), "b n (h d) -> b h n d", h=self.num_heads)
        keys = rearrange(self.keys(x), "b n (h d) -> b h n d", h=self.num_heads)
        values  = rearrange(self.values(x), "b n (h d) -> b h n d", h=self.num_heads)

        energy = torch.einsum('bhqd, bhkd -> bhqk', queries, keys)
        if mask is not None:
            fill_value = torch.finfo(torch.float32).min
            energy.mask_fill(~mask, fill_value)

        scaling = self.emb_size ** (1/2)
        att = F.softmax(energy, dim=-1) / scaling
        att = self.att_drop(att)

        out = torch.einsum('bhal, bhlv -> bhav ', att, values)
        out = rearrange(out, "b h n d -> b n (h d)")
        out = self.projection(out)
        return out

下面会解释上面这段代码。

因为要用多头注意力机制，所以要把query, key, value resize成对应多头的形状，
这个用到einops.rearrange,
query, key, value的shape通常是相同的，这里只有一个input x。
对应这几句

queries = rearrange(self.queries(x), "b n (h d) -> b h n d", h=self.n_heads)
keys = rearrange(self.keys(x), "b n (h d) -> b h n d", h=self.n_heads)
values  = rearrange(self.values(x), "b n (h d) -> b h n d", h=self.n_heads)

最后的size (b h n d)是指（batch, heads, sequence_len, embedding_size)
回忆一下attention matrix的计算方法

首先要把query和key矩阵乘，除一个scaling, softmax, 再和value矩阵乘
‘bhqd, bhkd -> bhqk’这个看成矩阵的shape，(b,h,q,d)的矩阵 ✖ (b,h,k.d)的矩阵
qxd ✖ (kxd 的转置) -> qxk

energy = torch.einsum('bhqd, bhkd -> bhqk', queries, keys)
att = F.softmax(energy, dim=-1) / scaling
out = torch.einsum('bhal, bhlv -> bhav ', att, values)

输出的shape就是(batch, head, values_len)

或者把query, key, value写到一个矩阵qkv，如下

class MultiHeadAttention(nn.Module):
    def __init__(self, emb_size: int = 768, num_heads: int = 8, dropout: float = 0):
        super().__init__()
        self.emb_size = emb_size
        self.num_heads = num_heads

        self.qkv = nn.Linear(emb_size, emb_size * 3)
        self.att_drop = nn.Dropout(dropout)
        self.projection = nn.Linear(emb_size, emb_size)

    def forward(self, x : Tensor, mask: Tensor = None) -> Tensor:

        qkv = rearrange(self.qkv(x), "b n (h d qkv) -> (qkv) b h n d", h=self.num_heads, qkv=3)
        queries, keys, values = qkv[0], qkv[1], qkv[2]

        energy = torch.einsum('bhqd, bhkd -> bhqk', queries, keys)
        if mask is not None:
            fill_value = torch.finfo(torch.float32).min
            energy.mask_fill(~mask, fill_value)

        scaling = self.emb_size ** (1/2)
        att = F.softmax(energy, dim=-1) / scaling
        att = self.att_drop(att)

        out = torch.einsum('bhal, bhlv -> bhav ', att, values)
        out = rearrange(out, "b h n d -> b n (h d)")
        out = self.projection(out)
        return out

patches_embedded = PatchEmbedding()(x)
MultiHeadAttention()(patches_embedded).shape

Residuals

对应下面这一块

因为residual在后面还会用，直接写成可传入函数的形式，后面会比较方便

class ResidualAdd(nn.Module):
    def __init__(self, fn):
        super().__init__()
        self.fn = fn

    def forward(self, x, **kwargs):
        res = x
        x = self.fn(x, **kwargs)
        x += res
        return x

这个attention的输出会输入到Norm和MLP

MLP是多层感知器，结构如下
ViT结构详解（附pytorch代码）

其实就是两个linear, 改变一下维度


class FeedForwardBlock(nn.Sequential):
    def __init__(self, emb_size: int, expansion: int = 4, drop_p: float = 0.):
        super().__init__(
            nn.Linear(emb_size, expansion * emb_size),
            nn.GELU(),
            nn.Dropout(drop_p),
            nn.Linear(expansion * emb_size, emb_size),
        )

现在来把transformer中的encoder block整合

class TransformerEncoderBlock(nn.Sequential):
    def __init__(self,
                 emb_size: int = 768,
                 drop_p: float = 0.,
                 forward_expansion: int = 4,
                 forward_drop_p: float = 0.,
                 ** kwargs):
        super().__init__(
            ResidualAdd(nn.Sequential(
                nn.LayerNorm(emb_size),
                MultiHeadAttention(emb_size, **kwargs),
                nn.Dropout(drop_p)
            )),
            ResidualAdd(nn.Sequential(
                nn.LayerNorm(emb_size),
                FeedForwardBlock(
                    emb_size, expansion=forward_expansion, drop_p=forward_drop_p),
                nn.Dropout(drop_p)
            )
            ))

测一下

patches_embedded = PatchEmbedding()(x)
TransformerEncoderBlock()(patches_embedded).shape

这时的输出是torch.Size([1, 197, 768])

Encoder是L个（图中的Lx）TransformerEncoderBlock,

class TransformerEncoder(nn.Sequential):
    def __init__(self, depth: int = 12, **kwargs):
        super().__init__(*[TransformerEncoderBlock(**kwargs) for _ in range(depth)])

最后一层是预测每个class的probability,
整个sequence会先通过一个计算mean的模块

class ClassificationHead(nn.Sequential):
    def __init__(self, emb_size: int = 768, n_classes: int = 1000):
        super().__init__(
            Reduce('b n e -> b e', reduction='mean'),
            nn.LayerNorm(emb_size),
            nn.Linear(emb_size, n_classes))

ViT

把上面的模块组合起来就成了ViT

class ViT(nn.Sequential):
    def __init__(self,
                in_channels: int = 3,
                patch_size: int = 16,
                emb_size: int = 768,
                img_size: int = 224,
                depth: int = 12,
                n_classes: int = 1000,
                **kwargs):
        super().__init__(
            PatchEmbedding(in_channels, patch_size, emb_size, img_size),
            TransformerEncoder(depth, emb_size=emb_size, **kwargs),
            ClassificationHead(emb_size, n_classes)
        )

Original: https://blog.csdn.net/level_code/article/details/126173408
Author: 蓝羽飞鸟
Title: ViT结构详解（附pytorch代码）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/624227/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Python实现t-SNE+DBSCAN+GASVR的组合预测模型

本模型的思路为使用t-SNE降维，使用DBSCAN降噪，GASVR进行预测。在一定程度上具有参考价值，但因除噪时并未填补会导致数据中断，故在时间序列预测上并不具有实际实用价值，需…

人工智能 2023年7月17日
0078
目标检测模型性能指标计算

一、指标 True Positive,TP ：预测为正样本（Positive），实际为正样本，则True，预测正确。True Negative,TN ：预测为负样本（Negativ…

人工智能 2023年7月12日
0084
GTX1060+win10+CUDA11.3+cudnn8.2+pytorch1.11.0——个人配置踩坑日记

以下是我亲测有效的使用 GTX 1060 的各部分安装版本电脑系统：window 10python版本：3.8.13pytorch版本：1.11.0CUDA版本：11.3cuDNN…

人工智能 2023年6月15日
0091
《智能计算系统》实验-7-1-YOLOv3

在做《智能计算系统》综合实验7-1-YOLOv3时，遇到了很多问题，实验书过程不全，现将整个实验流程梳理如下，以对其他读者有所裨益：一、搭建环境新建容器v7（非v7-updat…

人工智能 2023年7月27日
0061
nnUnet代码解读–数据增强

nnunet项目官方地址 MIC-DKFZ/nnUNet 准备工作关于nnUnet代码包的安装和配置参考nn-UNet使用记录–代码配置 nnUnet最经典的部分在于数据处理，本…

人工智能 2023年6月20日
0072
python数据分析（一）：列联分析与方差分析

列联分析收集样本数据产生二维或多维交叉列联表；对两个分类变量的相关性进行检验（假设检验） pandas.crosstab(index,columns,margins,norma…

人工智能 2023年7月15日
0076
stata怎么画分类图_Stata怎么画直方图或折线图-Stata教程

软件功能：Stata官方版是一款相当优秀的实用型理科统计软件，Stata官方版功能强悍，高效专业，提供了数据分析、数据管理和绘制专业图表等功能，Stata支持线性混合模型、均衡重复…

人工智能 2023年6月18日
00150
ggplot2图例修改详细介绍

获取更多R语言知识，请关注公众号：医学和生信笔记医学和生信笔记，专注R语言在临床医学中的使用，R语言数据分析和可视化。主要分享R语言做医学统计学、meta分析、网络药理学、临床…

人工智能 2023年7月14日
0074
【最新最详细】SQL Server 2019 安装教程

下载SQL Server引擎，下载地址：https://www.microsoft.com/en-us/sql-server/sql-server-downloads，点击下载下…

人工智能 2023年7月30日
0077
一零四、大数据可视化技术与应用实训（展示大屏幕）

目录一、定义二、技能标准三、实训安排四、前台页面展示五、项目架构六、static 七、py文件 app.py data.py data_corp.py data_job…

人工智能 2023年7月5日
0058
在Logistic回归中，特征工程对于提高模型性能至关重要。常见的特征工程包括多项式特征、交互特征和特征选择等

特征工程在Logistic回归中的重要性特征工程在机器学习中扮演着重要的角色，尤其是在Logistic回归中。通过对特征进行转换、选择和组合，特征工程能够帮助提高模型的准确性和性…

人工智能 2024年1月6日
0048
深度学习—三好学生各成绩所占权重问题（3）

🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🥰 博客首页：knighthood2001😗 欢迎点赞👍评论🗨️❤️ 热爱python，期待与大家一同进步成长！！❤️ 之前，我们从三好学生成绩问题开始，设…

人工智能 2023年5月26日
0095
复杂曲面建模_为什么要用SolidWorks曲面建模？什么时候又不需要曲面建模呢？…

1、为何要用曲面建模前面已经分享了曲面类型特点等知识，下面将为大家解释一下为什么要用曲面： 1) 实体建模在复杂外形中局限性——实体特征中的放样和扫描往往会生成一个或多个扁平形状…

人工智能 2023年6月1日
00190
Lynis 漏洞扫描工具部署及效果展示

Lynis 漏洞扫描工具部署及效果展示介绍 Lynis是一个安全审计工具，它可以在Linux，macOS和其他基于Unix的系统上运行。Lynis的主要重点是执行系统的运行状况检…

人工智能 2023年6月6日
0085
Python 自动化办公之自动识别并点击按钮

遇到一个需要电脑的体力劳动，找到了Python控制鼠标的库，结合之前用过的OpenCV识别可以屏幕内容，可以实现略微复杂的自动化办公操作。安装用到的库库安装方法作用pillow…

人工智能 2023年6月19日
0093
【YOLOv7】结合GradCAM热力图可视化

文章目录前言实现效果实现细节 * 1、在YOLOv7源码的基础上进行修改和添加操作 2、hook函数 3、GradCAM基本实现思路修改部分 * 1、Detect类中的fo…

人工智能 2023年6月16日
0098

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

ViT结构详解（附pytorch代码）

CLS token

Position embedding

Attention

Residuals

ViT

大家都在看