【进阶篇】全流程学习《20天掌握Pytorch实战》纪实 | Day03 | 文本数据建模流程范例

2023年7月22日下午9:45 • 人工智能 • 阅读 95

💖作者简介：大家好，我是 车神哥，府学路18号的车神🥇
⚡About—> 车神：从寝室到 实验室最 快3分钟，最 慢3分半（那半分钟其实是等红绿灯）
📝个人主页：车手只需要车和手，压力来自论文_府学路18号车神_CSDN博客
🥇 官方认证： 人工智能领域优质创作者
🎉 点赞➕ 评论➕ 收藏 == 养成习惯（ 一键三连）😋
⚡希望大家多多支持🤗~一起加油 😁

专栏
《20天掌握Pytorch实战》

不定期学习《20天掌握Pytorch实战》，有兴趣就跟着专栏一起吧~

开源自由，知识无价~

所用到的源代码及书籍+数据集以帮各位小伙伴下载放在文末，自取即可~

😁概览

【进阶篇】全流程学习《20天掌握Pytorch实战》纪实 | Day03 | 文本数据建模流程范例

; 一、 🎉准备数据（存在问题）

流程和之前都一样。

imdb数据集的目标是根据电影评论的文本内容预测评论的情感标签。

训练集有 20000条电影评论文本， 测试集有 5000条电影评论文本，其中 正面评论和 负面评论都各 占一半。

文本数据预处理较为繁琐，包括 中文切词（本示例不涉及）， 构建词典， 编码转换， 序列填充， 构建数据管道等等。

在 torch中预处理文本数据一般使用 torchtext或者自定义 Dataset， torchtext功能非常强大，可以 构建文本分类， 序列标注， 问答模型， 机器翻译等 NLP任务的数据集。

下面仅演示使用它来 构建文本分类数据集的方法。

较完整的教程可以参考以下知乎文章：《pytorch学习笔记—Torchtext》

https://zhuanlan.zhihu.com/p/65833208

这个链接好像已经失效了，另外我找了C站的一个关于 Torchtext的讲解：

https://blog.csdn.net/nlpuser/article/details/88067167

关于 torchtext常见API一览:

torchtext.data.Example : 用来表示一个样本，数据和标签
torchtext.vocab.Vocab: 词汇表，可以导入一些预训练词向量
torchtext.data.Datasets: 数据集类，**getitem**返回 Example实例, torchtext.data.TabularDataset是其子类。
torchtext.data.Field : 用来定义字段的处理方法（文本字段，标签字段）创建 Example时的预处理，batch 时的一些处理操作。
torchtext.data.Iterator: 迭代器，用来生成 batch
torchtext.datasets: 包含了常见的数据集.

下面开始预处理数据（标准化的构建流程，自推自敲一遍较为适宜）：

注意一点，这里用到了新的库函数，直接 pip即可

pip install torchtext

import torch
import string,re
import torchtext

MAX_WORDS = 10000
MAX_LEN = 200
BATCH_SIZE = 20

tokenizer = lambda x:re.sub('[%s]'%string.punctuation,"",x).split(" ")

def filterLowFreqWords(arr,vocab):
    arr = [[x if x<MAX_WORDS else 0 for x in example]
           for example in arr]
    return arr

TEXT = torchtext.data.Field(sequential=True, tokenize=tokenizer, lower=True,
                  fix_length=MAX_LEN,postprocessing = filterLowFreqWords)

LABEL = torchtext.data.Field(sequential=False, use_vocab=False)

ds_train, ds_valid = torchtext.data.TabularDataset.splits(
        path='./data/imdb', train='train.tsv',test='test.tsv', format='tsv',
        fields=[('label', LABEL), ('text', TEXT)],skip_header = False)

TEXT.build_vocab(ds_train)

train_iter, valid_iter = torchtext.data.Iterator.splits(
        (ds_train, ds_valid),  sort_within_batch=True,sort_key=lambda x: len(x.text),
        batch_sizes=(BATCH_SIZE,BATCH_SIZE))

print(ds_train[0].text)
print(ds_train[0].label)

注意：这里会报错，好像是版本更新了，之前的用法已经不能用了

AttributeError: module 'torchtext.data' has no attribute 'Field'

解决方法：

将 from torchtext.data import Field 替换为 from torchtext.legacy.data import Field (但是这个方法对于torchtext 0.12.0版本不适用)，在 torchtext 0.11版本中 field方法被移到了 torchtext.legacy下，所以会看到其他博客的评论区里出现下面代码适用的情况,但是在 torchtext 0.12.0版本中 legacy目录和 field方法都没了，所以上面的代码无法再适用，会报错。
唯一的办法就是下载旧版本，哎~
下面是对应的版本，对照着更新吧

安装低版本的 torchtext方法：

conda install -c pytorch torchtext==版本号

直接安装0.6或者0.8都行，然后就可以使用啦~

直接pip安装也行：

pip install torchtext==0.4

解决完上一个问题后，下一个问题又来啦

OverflowError: Python int too large to convert to C long

尝试了各种办法也没有解决，哎~

打印显示的文档结果为：

['it', 'really', 'boggles', 'my', 'mind', 'when', 'someone', 'comes', 'across', 'a', 'movie', 'like', 'this', 'and', 'claims', 'it', 'to', 'be', 'one', 'of', 'the', 'worst', 'slasher', 'films', 'out', 'there', 'this', 'is', 'by', 'far', 'not', 'one', 'of', 'the', 'worst', 'out', 'there', 'still', 'not', 'a', 'good', 'movie', 'but', 'not', 'the', 'worst', 'nonetheless', 'go', 'see', 'something', 'like', 'death', 'nurse', 'or', 'blood', 'lake', 'and', 'then', 'come', 'back', 'to', 'me', 'and', 'tell', 'me', 'if', 'you', 'think', 'the', 'night', 'brings', 'charlie', 'is', 'the', 'worst', 'the', 'film', 'has', 'decent', 'camera', 'work', 'and', 'editing', 'which', 'is', 'way', 'more', 'than', 'i', 'can', 'say', 'for', 'many', 'more', 'extremely', 'obscure', 'slasher', 'filmsbr', 'br', 'the', 'film', 'doesnt', 'deliver', 'on', 'the', 'onscreen', 'deaths', 'theres', 'one', 'death', 'where', 'you', 'see', 'his', 'pruning', 'saw', 'rip', 'into', 'a', 'neck', 'but', 'all', 'other', 'deaths', 'are', 'hardly', 'interesting', 'but', 'the', 'lack', 'of', 'onscreen', 'graphic', 'violence', 'doesnt', 'mean', 'this', 'isnt', 'a', 'slasher', 'film', 'just', 'a', 'bad', 'onebr', 'br', 'the', 'film', 'was', 'obviously', 'intended', 'not', 'to', 'be', 'taken', 'too', 'seriously', 'the', 'film', 'came', 'in', 'at', 'the', 'end', 'of', 'the', 'second', 'slasher', 'cycle', 'so', 'it', 'certainly', 'was', 'a', 'reflection', 'on', 'traditional', 'slasher', 'elements', 'done', 'in', 'a', 'tongue', 'in', 'cheek', 'way', 'for', 'example', 'after', 'a', 'kill', 'charlie', 'goes', 'to', 'the', 'towns', 'welcome', 'sign', 'and', 'marks', 'the', 'population', 'down', 'one', 'less', 'this', 'is', 'something', 'that', 'can', 'only', 'get', 'a', 'laughbr', 'br', 'if', 'youre', 'into', 'slasher', 'films', 'definitely', 'give', 'this', 'film', 'a', 'watch', 'it', 'is', 'slightly', 'different', 'than', 'your', 'usual', 'slasher', 'film', 'with', 'possibility', 'of', 'two', 'killers', 'but', 'not', 'by', 'much', 'the', 'comedy', 'of', 'the', 'movie', 'is', 'pretty', 'much', 'telling', 'the', 'audience', 'to', 'relax', 'and', 'not', 'take', 'the', 'movie', 'so', 'god', 'darn', 'serious', 'you', 'may', 'forget', 'the', 'movie', 'you', 'may', 'remember', 'it', 'ill', 'remember', 'it', 'because', 'i', 'love', 'the', 'name']
0

继续查看词典信息


print(len(TEXT.vocab))

print(TEXT.vocab.itos[0])
print(TEXT.vocab.itos[1])

print(TEXT.vocab.stoi[''])
print(TEXT.vocab.stoi[''])

print(TEXT.vocab.freqs[''])
print(TEXT.vocab.freqs['a'])
print(TEXT.vocab.freqs['good'])

输出结果为：

108197
<unk>
<pad>
0
1
0
129453
11457

再查看：


for batch in train_iter:
    features = batch.text
    labels = batch.label
    print(features)
    print(features.shape)
    print(labels)
    break

打印结果：

tensor([[  17,   31,  148,  ...,   54,   11,  201],
        [   2,    2,  904,  ...,  335,    7,  109],
        [1371, 1737,   44,  ...,  806,    2,   11],
        ...,
        [   6,    5,   62,  ...,    1,    1,    1],
        [ 170,    0,   27,  ...,    1,    1,    1],
        [  15,    0,   45,  ...,    1,    1,    1]])
torch.Size([200, 20])
tensor([0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0])


class DataLoader:
    def __init__(self,data_iter):
        self.data_iter = data_iter
        self.length = len(data_iter)

    def __len__(self):
        return self.length

    def __iter__(self):

        for batch in self.data_iter:
            yield(torch.transpose(batch.text,0,1),
                  torch.unsqueeze(batch.label.float(),dim = 1))

dl_train = DataLoader(train_iter)
dl_valid = DataLoader(valid_iter)

上面的问题还没解决，下面就不做过多的解释了，基本思路都差不多，如果上面问题有解决的，留言说一下哈~

二、🎉定义模型

使用Pytorch通常有三种方式构建模型：使用nn.Sequential按层顺序构建模型，继承nn.Module基类构建自定义模型，继承nn.Module基类构建模型并辅助应用模型容器(nn.Sequential,nn.ModuleList,nn.ModuleDict)进行封装。

此处选择使用第三种方式进行构建。

和图片还有Day01结构化一致，后面应该殴打his这么定义模型了，就不解释了，只是对于不同的对象建模有区别，大致都是标准化模型（调包嘛，不寒颤）

导入库

import torch
from torch import nn
from torchkeras import LightModel,summary

定义模型

torch.random.seed()
import torch
from torch import nn

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()

        self.embedding = nn.Embedding(num_embeddings = MAX_WORDS,embedding_dim = 3,padding_idx = 1)
        self.conv = nn.Sequential()
        self.conv.add_module("conv_1",nn.Conv1d(in_channels = 3,out_channels = 16,kernel_size = 5))
        self.conv.add_module("pool_1",nn.MaxPool1d(kernel_size = 2))
        self.conv.add_module("relu_1",nn.ReLU())
        self.conv.add_module("conv_2",nn.Conv1d(in_channels = 16,out_channels = 128,kernel_size = 2))
        self.conv.add_module("pool_2",nn.MaxPool1d(kernel_size = 2))
        self.conv.add_module("relu_2",nn.ReLU())

        self.dense = nn.Sequential()
        self.dense.add_module("flatten",nn.Flatten())
        self.dense.add_module("linear",nn.Linear(6144,1))
        self.dense.add_module("sigmoid",nn.Sigmoid())

    def forward(self,x):
        x = self.embedding(x).transpose(1,2)
        x = self.conv(x)
        y = self.dense(x)
        return y

net = Net()
print(net)

summary(net, input_shape = (200,),input_dtype = torch.LongTensor)

输出打印构建好的网络模型：

`python
Net(
(embedding): Embedding(10000, 3, padding_idx=1)
(conv): Sequential(
(conv_1): Conv1d(3, 16, kernel_size=(5,), stride=(1,))
(pool_1): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu_1): ReLU()
(conv_2): Conv1d(16, 128, kernel_size=(2,), stride=(1,))
(pool_2): MaxPool1d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(relu_2): ReLU()
)
(dense): Sequential(
(flatten): Flatten()
(linear): Linear(in_features=6144, out_features=1, bias=True)
(sigmoid): Sigmoid()
)
)
Input size (MB): 0.000763
Forward/backward pass size (MB): 0.287796
Params size (MB): 0.154972
Estimated Total Size (MB): 0.443531

Original: https://blog.csdn.net/weixin_44333889/article/details/124186466
Author: 府学路18号车神
Title: 【进阶篇】全流程学习《20天掌握Pytorch实战》纪实 | Day03 | 文本数据建模流程范例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/709653/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Residual, BottleNeck, Linear BottleNeck, MBConv解释

这里有一个交互式版本 https://github.com/FrancescoSaverioZuppichini/BottleNeck-InvertedResidual-Fused…

人工智能 2023年7月14日
00139
机器学习-决策树概述及对鸢尾花数据分类python实现利用graphviz模块画出决策树

文章目录 * – 1. 决策树概述 – 2. 理论分析 – + * 2.1 特征选择 * – 2.1 1 熵&条件熵 &#8…

人工智能 2023年7月1日
0097
Knowledge Graph Convolutional Networks for Recommender Systems

from Hongwei Wang， Miao Zhao, Xing Xiefrom Shanghai Jiao Tong University,18 Mar 2019 研究背景 …

人工智能 2023年6月10日
00103
Graph Attention Networks(图自注意力网络)

目录 1 Introduction 2 GAT Architecture 2.1 Graph Attentional Layer 2.2 Comparisons To Relate…

人工智能 2023年6月17日
00183
【AI达人创造营第二期】餐厅自助结账系统之目标检测部分

AI Studio项目链接：https://aistudio.baidu.com/aistudio/projectdetail/3526221 【AI达人创造营第二期】一. 项目…

人工智能 2023年7月12日
00118
Visual Studio 2019设置PCL 1.12.1环境

1. 下载并安装PCL 1.12.1 到https://github.com/PointCloudLibrary/pcl/releases下载下面两个文件: 安装PCL时，选择添加…

人工智能 2023年6月4日
00159
TIBCO Spotfire 入门指南

数据分析工具 – TIBCO Spotfire 入门指南（一）文章目录数据分析工具 – TIBCO Spotfire 入门指南（一）前言一、Spor…

人工智能 2023年7月15日
0097
Gabor滤波器特征提取原理讲解及c++实现

文章目录 Gabor滤波器 * 复正弦载波高斯滤波参数解释 gabor滤波核实现效果： Gabor滤波器 1946年,Dennis Gabor于在”Theory …

人工智能 2023年6月20日
00158
【MediaPipe】(2) AI视觉，人体姿态关键点实时跟踪，附python完整代码

各位同学好，今天和大家分享一下如何使用 MediaPipe完成人体姿态关键点的实时跟踪检测，先放张图看效果，FPS值为17，右下输出框为32个人体关键点的xy坐标。有需要的可以…

人工智能 2023年7月18日
0096
【TensorFlow】常见处理tensor类型的数据的方法

import tensorflow as tf tensor数据类型包括int（整型），float（单精度浮点型），double（双精度浮点型），bool（布尔型），string（…

人工智能 2023年5月26日
0076
PyTorch是如何处理深度学习中的反向传播算法的

问题描述本文将详细解决一个问题：PyTorch是如何处理深度学习中的反向传播算法的。介绍深度学习中的反向传播算法是训练神经网络的核心算法之一。PyTorch是一个流行的深度学…

人工智能 2024年1月4日
0087
pandas从dataframe中删除一个或多个数据列

pandas从dataframe中删除一个或多个数据列目录 pandas从dataframe中删除一个或多个数据列 #删除数据列的基本语法 Original: https://b…

人工智能 2023年5月30日
0076
喇叭正反相位测试音频_【设计规范_05】喇叭（speaker)原理及音腔设计规范

导读喇叭又名扬声器，现如今，人们对手机的要求越来也高，声音也是一个评价手机好坏的因素。为提高音质，喇叭的结构形式也发生了很多变化，由正出音变成侧出音，有单喇叭变成双喇叭，甚至是喇…

人工智能 2023年5月27日
00153
卷积神经网络（CNN）入门总结-基于tensorflow2-含有垃圾分类实战

说明本项目包含：1.基本理论知识总结2.tf2代码使用总结3.完整实例（以垃圾分类数据集作为例子）完整代码和数据集请点击下方链接跳转，fork后可以下载完整代码（jupyter …

人工智能 2023年7月3日
00129
opencv和mediapipe实现手势识别

本篇文章只是手势识别的一个demo，想要识别的精度更高，还需要添加其他的约束条件，这里只是根据每个手指关键点和手掌根部的距离来判断手指是伸展开还是弯曲的。关于mediapi pe的…

人工智能 2023年6月19日
0099
Android mediarecord: start failed: -38 麦克风通道被占用

啊哦~你想找的内容离你而去了哦内容不存在，可能是由于以下原因造成的： [En] The content does not exist and may be caused by t…

人工智能 2023年5月27日
00110

2024 年 6 月
一	二	三	四	五	六	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30

【进阶篇】全流程学习《20天掌握Pytorch实战》纪实 | Day03 | 文本数据建模流程范例

图片数据建模流程范例

大家都在看