VIT 源码详解

2023年7月28日下午6:06 • 人工智能 • 阅读 66

1.项目配置说明

参数说明：

数据集：

–name cifar10-100_500

–dataset cifar10

哪个版本的模型：

–model_type ViT-B_16

预训练权重：

–pretrained_dir checkpoint/ViT-B_16.npz

2.patch embeding与position_embedding

对于图像编码，以VIT – B/16为例，首先用卷积核大小为1616、步长为16的卷积，对图像进行变换，此时图像维度变成16 * 768 * 14 * 14，再变换维度为[16, 196, 768]，然后将维度为161*768的0patch相连。

对于位置编码，构建一个1 * 197 * 768的向量

最后，将图像编码与位置编码相加就完成了本次编码。

代码如下：

class Embeddings(nn.Module):
    """Construct the embeddings from patch, position embeddings.

"""
    def __init__(self, config, img_size, in_channels=3):
        super(Embeddings, self).__init__()
        self.hybrid = None
        img_size = _pair(img_size)

        # patch_size &#x5927;&#x5C0F; &#x4E0E; patch&#x6570;&#x91CF;  n_patches
        if config.patches.get("grid") is not None:
            grid_size = config.patches["grid"]
            patch_size = (img_size[0] // 16 // grid_size[0], img_size[1] // 16 // grid_size[1])
            n_patches = (img_size[0] // 16) * (img_size[1] // 16)
            self.hybrid = True
        else:
            patch_size = _pair(config.patches["size"])
            n_patches = (img_size[0] // patch_size[0]) * (img_size[1] // patch_size[1])
            self.hybrid = False

        # &#x4F7F;&#x7528;&#x6DF7;&#x5408;&#x6A21;&#x578B;
        if self.hybrid:
            self.hybrid_model = ResNetV2(block_units=config.resnet.num_layers,
                                         width_factor=config.resnet.width_factor)
            in_channels = self.hybrid_model.width * 16
        # patch_embeding 16 * 768 * 14 * 14
        self.patch_embeddings = Conv2d(in_channels=in_channels,
                                       out_channels=config.hidden_size,
                                       kernel_size=patch_size,
                                       stride=patch_size)
        # &#x521D;&#x59CB;&#x5316; position_embeddings: 1 * 197 * 768
        self.position_embeddings = nn.Parameter(torch.zeros(1, n_patches+1, config.hidden_size))
        # &#x521D;&#x59CB;&#x5316;&#x7B2C; 0 &#x4E2A;patch&#xFF0C;&#x8868;&#x793A;&#x5206;&#x7C7B;&#x7279;&#x5F81; 1*1*768
        self.cls_token = nn.Parameter(torch.zeros(1, 1, config.hidden_size))
        # dropout&#x5C42;
        self.dropout = Dropout(config.transformer["dropout_rate"])

    def forward(self, x):
        print(x.shape)
        B = x.shape[0]
        # &#x62D3;&#x5C55;cls_tokens&#x7684;&#x7EF4;&#x5EA6;:16 *1*768
        cls_tokens = self.cls_token.expand(B, -1, -1)
        print(cls_tokens.shape)
        # &#x6DF7;&#x5408;&#x6A21;&#x578B;
        if self.hybrid:
            x = self.hybrid_model(x)
        # &#x7F16;&#x7801;:16 * 768 * 14 * 14
        x = self.patch_embeddings(x)
        print(x.shape)
        # &#x53D8;&#x6362;&#x7EF4;&#x5EA6;:16 * 768 * 14 * 14-->[16, 768, 196]
        x = x.flatten(2)
        print(x.shape)
        # [16, 768, 196] --> [16, 196, 768]
        x = x.transpose(-1, -2)
        print(x.shape)
        # &#x52A0;&#x5165;&#x5206;&#x7C7B;&#x7279;&#x5F81;patch
        x = torch.cat((cls_tokens, x), dim=1)
        print(x.shape)

        # &#x52A0;&#x5165;&#x4F4D;&#x7F6E;&#x7F16;&#x7801;
        embeddings = x + self.position_embeddings
        print(embeddings.shape)
        # dropout&#x5C42;
        embeddings = self.dropout(embeddings)
        print(embeddings.shape)
        return embeddings

3.ecoder

多头注意力模块:

首先构建q,k,v三个辅助向量，因为我们采用多头注意力机制(12个)，首先，我们需要将q,k,v维度从16, 197, 768转换成16, 12, 197, 64，然后获得q,k的相似性qk，因为获得的是两两之间的关系，所以维度为16, 12, 197, 197，消除量纲，经过softmax后，得到提取到的特征向量qkv，维度为16, 12, 197, 64，再将维度还原成16, 197, 768

class Attention(nn.Module):
    def __init__(self, config, vis):
        super(Attention, self).__init__()
        self.vis = vis
        # heads&#x6570;&#x91CF;
        self.num_attention_heads = config.transformer["num_heads"]
        # &#x6BCF;&#x4E2A;head&#x7684;&#x5411;&#x91CF;&#x7EF4;&#x5EA6;
        self.attention_head_size = int(config.hidden_size / self.num_attention_heads)
        # &#x603B;head_size
        self.all_head_size = self.num_attention_heads * self.attention_head_size
        # query&#x5411;&#x91CF;
        self.query = Linear(config.hidden_size, self.all_head_size)
        # key&#x5411;&#x91CF;
        self.key = Linear(config.hidden_size, self.all_head_size)
        # value&#x5411;&#x91CF;
        self.value = Linear(config.hidden_size, self.all_head_size)
        # &#x5168;&#x8FDE;&#x63A5;&#x5C42;
        self.out = Linear(config.hidden_size, config.hidden_size)
        # dropout&#x5C42;
        self.attn_dropout = Dropout(config.transformer["attention_dropout_rate"])
        self.proj_dropout = Dropout(config.transformer["attention_dropout_rate"])

        self.softmax = Softmax(dim=-1)

    def transpose_for_scores(self, x):
        # &#x7EF4;&#x5EA6;:16, 197, 768-->16,197,12,64
        new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size)
        # print(new_x_shape)
        x = x.view(*new_x_shape)
        # print(x.shape)
        # print(x.permute(0, 2, 1, 3).shape)
        # 16,197,12,64 --> 16, 12, 197, 64
        return x.permute(0, 2, 1, 3)

    def forward(self, hidden_states):
        # print(hidden_states.shape)
        # q,k,v:16, 197, 768
        mixed_query_layer = self.query(hidden_states)
        # print(mixed_query_layer.shape)
        mixed_key_layer = self.key(hidden_states)
        # print(mixed_key_layer.shape)
        mixed_value_layer = self.value(hidden_states)
        # print(mixed_value_layer.shape)
        # q,k,v:16, 197, 768-->16, 12, 197, 64
        query_layer = self.transpose_for_scores(mixed_query_layer)
        # print(query_layer.shape)
        key_layer = self.transpose_for_scores(mixed_key_layer)
        # print(key_layer.shape)
        value_layer = self.transpose_for_scores(mixed_value_layer)
        # print(value_layer.shape)
        # q,k&#x7684;&#x76F8;&#x4F3C;&#x6027;:16, 12, 197, 197
        attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
        # print(attention_scores.shape)
        # &#x6D88;&#x9664;&#x91CF;&#x7EB2;
        attention_scores = attention_scores / math.sqrt(self.attention_head_size)
        # print(attention_scores.shape)
        attention_probs = self.softmax(attention_scores)
        # print(attention_probs.shape)
        weights = attention_probs if self.vis else None
        attention_probs = self.attn_dropout(attention_probs)
        # print(attention_probs.shape)
        # print(value_layer.shape)
        # &#x7279;&#x5F81;&#x5411;&#x91CF;&#xFF1A;qkv:16, 12, 197, 64
        context_layer = torch.matmul(attention_probs, value_layer)
        # print(context_layer.shape)
        # 16, 12, 197, 64-->16, 12, 197, 64
        context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
        # print(context_layer.shape)
        # 16, 12, 197, 64-->16, 197, 768
        new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
        context_layer = context_layer.view(*new_context_layer_shape)
        # print(context_layer.shape)
        # &#x5168;&#x8FDE;&#x63A5;&#x5C42;:16, 197, 768
        attention_output = self.out(context_layer)
        # print(attention_output.shape)
        # dropout&#x5C42;
        attention_output = self.proj_dropout(attention_output)
        # print(attention_output.shape)
        return attention_output, weights

transformer encoder

对于输入的x,首先经过层归一化后，输入多头注意力机制，对结果进行残差连接，再经过层归一化，经过两层全连接，残差连接后，得到一个模块结果，堆叠L层，输出最终结果

class Block(nn.Module):
    def __init__(self, config, vis):
        super(Block, self).__init__()
        # &#x5E8F;&#x5217;&#x7684;&#x5927;&#x5C0F;:768
        self.hidden_size = config.hidden_size
        # &#x5C42;&#x5F52;&#x4E00;&#x5316;
        self.attention_norm = LayerNorm(config.hidden_size, eps=1e-6)
        self.ffn_norm = LayerNorm(config.hidden_size, eps=1e-6)
        # MLP&#x5C42;
        self.ffn = Mlp(config)
        # &#x591A;&#x5934;&#x6CE8;&#x610F;&#x529B;&#x673A;&#x5236;
        self.attn = Attention(config, vis)

    def forward(self, x):
        # print(x.shape)
        # 16, 197, 768
        h = x
        # &#x5C42;&#x5F52;&#x4E00;&#x5316;
        x = self.attention_norm(x)
        # print(x.shape)
        # &#x591A;&#x5934;&#x6CE8;&#x610F;&#x529B;&#x673A;&#x5236;
        x, weights = self.attn(x)
        # &#x6B8B;&#x5DEE;&#x8FDE;&#x63A5;
        x = x + h
        # print(x.shape)

        h = x
        # &#x5C42;&#x5F52;&#x4E00;&#x5316;
        x = self.ffn_norm(x)
        # print(x.shape)
        # MLP&#x5C42;
        x = self.ffn(x)
        # print(x.shape)
        # &#x6B8B;&#x5DEE;&#x8FDE;&#x63A5;
        x = x + h
        # print(x.shape)
        return x, weights

整体架构

对于输入x，进行patch embeding和position embeding后，此时维度为16197768，输入encoder中，经过L层的编码模块，取出第0个patch的编码结果(表示分类特征),输入分类层，得到预测结果。

class VisionTransformer(nn.Module):
    def __init__(self, config, img_size=224, num_classes=21843, zero_head=False, vis=False):
        super(VisionTransformer, self).__init__()
        self.num_classes = num_classes
        self.zero_head = zero_head
        self.classifier = config.classifier

        self.transformer = Transformer(config, img_size, vis)
        self.head = Linear(config.hidden_size, num_classes)

    def forward(self, x, labels=None):
        x, attn_weights = self.transformer(x)
        print(x.shape)
        # X.shape:16, 197, 768   logits.shape:16, 10
        logits = self.head(x[:, 0])
        print(logits.shape)
        # &#x4EA4;&#x53C9;&#x71B5;
        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_classes), labels.view(-1))
            return loss
        else:
            return logits, attn_weights

Original: https://blog.csdn.net/qq_52053775/article/details/126261070
Author: 樱花的浪漫
Title: VIT 源码详解

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/720889/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

手写数字识别（识别纸上手写的数字）

说明使用pytorch框架，实现对MNIST手写数字数据集的训练和识别。重点是，自己手写数字，手机拍照后传入电脑，使用你自己训练的权重和偏置能够识别。数据预处理过程的代码是重点。…

人工智能 2023年6月16日
0069
python库安装中Microsoft Visual C++ is required解决方法

在用pycharm过程中，用pip去安装一些第三方包的时候会出现如下错误，缺少C++编译器，因为有些程序需要使用，没有C++接口会报错，查阅相关资料及自己的解决方案 error: …

人工智能 2023年6月4日
0083
机器学习（周志华）学习笔记（二）

目录学习内容：三、线性模型 3.1 基本形式 3.2 线性回归 3.3 对数几率回归 3.4 线性判别分析(LDA) 3.5 多分类问题 3.6 类别不平衡学习时间：学习内…

人工智能 2023年6月17日
0095
使用OES纹理+GLSurfaceView+JNI实现基于OpenGL ES的播放器画面处理

前言：安卓使用SurfaceView + SurfaceTexture + Surface进行视频播放或者相机预览，只能看到原色画面，但很多场合需求画面可以实现二次加工，例如调整…

人工智能 2023年6月20日
0081
论文阅读：Knowledge Distillation: A Survey 知识蒸馏综述（2021）

论文阅读：Knowledge Distillation: A Survey 知识蒸馏综述2021 目录摘要 Introduction Background 知识 * 基于响应的知…

人工智能 2023年7月13日
0047
用神经网络进行回归预测 weight_decay为异常值（大于1 的值）模型效果好的原因解析

weight_decay越大越好的原因研究发生的问题特此记录之前在用神经网络来做一个回归问题，回归的数值范围是0~1之间。然后进行网格搜参（搜索最好的weight_decay和…

人工智能 2023年7月13日
0045
sql实战-电商订单数据分析

SQL项目实战 1.数据集介绍来源：kaggle 本数据集包含了2016年至2018年近10万条交易记录。本次分析用到五张表： olist_orders_dataset：包含订…

人工智能 2023年7月15日
0086
Nvidia Jetson TX2入门指南(白话版)

最近要用到jetson tx2，但之前也完全没有接触过。边用边学，这篇文章就是向新手介绍下jetson tx2刚入手的一些事项(适合纯小白~)。一、TX2初认识开发板全称：Nv…

人工智能 2023年7月26日
0067
努力成为一名数据分析师

一、什么是数据分析？观测、实验、应用二、重新认识数据分析观测：对事物形成客观量化的认知（报表、图表、仪表盘）。实验：发现规律、验证假设（科学研究、A/B测试）。应用：不断…

人工智能 2023年7月17日
0055
机器学习课后练习题（期末复习题目附答案）

此为第五章支持向量机一. 单选题1. ‍对于在原空间中线性不可分问题,支持向量机()。A. 无法处理B. 在原空间中寻找线性函数划分数据C. 将数据映射到核空间中D. 在原空间中…

人工智能 2023年6月16日
00273
使用PostGIS实现大批点的空间聚类

遇到了一个需求，要对数据库里比较近的点进行空间和属性的聚类合并，以便在显示时避免很多点堆在一起的问题。属性聚类比较简单，这里主要讲一下在空间聚类的一些经验。目录 * &#8211…

人工智能 2023年7月18日
0055
【论文总结】A Survey of Zero-Shot Learning: Settings, Methods, and Applications

论文地址：https://dl.acm.org/doi/abs/10.1145/3293318 一、Learning Settings 参数 Class-Inductive Ins…

人工智能 2023年6月6日
0070
遥感影像数据集-DOTA

遥感影像的数据集大多数都包含了角度信息，并且目标相对较小，传统的目标检测在遥感影像的处理上效果不佳，比较常用的数据集有nwpu数据集和dota数据集，dota数据集来源谷歌地图，分…

人工智能 2023年6月10日
00113
tensorflow学习笔记 (五) (卷积神经网络)

文章目录卷积神经网络 * 一、卷积计算过程 – 1.单通道的卷积计算 2.三通道的卷积计算 3. 卷积计算过程动图二、卷积相关 – 2.1 两种卷积核的…

人工智能 2023年5月23日
0083
【PyTorch】静态图与动态图机制

静态图（ Tensorflow ） TensorFlow 使用静态图，这意味着我们先定义计算图，然后不断使用它，中间是不能够改变它的计算图的，且定义静态图时需要使用新的特殊语法，这…

人工智能 2023年5月26日
0076
RuntimeError: Could not infer dtype of NoneType

dataloader在dataset中一个一个读取数据的时候遇到了脏数据（空值之类的），导致在网络推理的时候出现了错误。在数据放入dataset中之前判断image 和 labe…

人工智能 2023年7月21日
0073

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

VIT 源码详解

多头注意力模块:

transformer encoder

整体架构

大家都在看