深度学习笔记（3）——pytorch+TextCNN实现情感分类（外卖数据集）

2023年7月3日上午2:58 • 人工智能 • 阅读 108

文章目录

0 前言
1 数据准备
1.1 常量
1.2 加载数据集
2 数据预处理
3 文本表示
4 TextCNN模型
5 模型训练
6 模型评估
7 总览
8 完整代码

0 前言

使用数据集：某外卖数据集，共有11987条数据，标签数为2。
配置环境：Rtx3060 Laptop

1 数据准备

1.1 常量

包括batch_size、epochs、textcnn的滑动窗口大小、隐藏层、特征的大小、标签类别数等。


BATCH_SIZE = 64
EPOCHS = 50
WINDOWS_SIZE = [2, 4, 3]
MAX_LEN = 200
EMBEDDING_DIM = 600
FEATURE_SIZE = 200
N_CLASS = 2

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
loss_func = nn.CrossEntropyLoss()

loss_list, accuracy_list = [], []

1.2 加载数据集

使用pandas读取数据，使用LabelEncoder映射label


def data_prepare():

    dataset = pd.read_csv('../data/waimai10k.txt', delimiter=',')
    labels = np.array(dataset['label'])
    labels = LabelEncoder().fit_transform(labels)
    return dataset, labels

2 数据预处理

步骤：

去除无用字符
jieba分词
去低频词
序列化列表

注意这里没有去停用词


def clear_character(sentence):
    pattern1 = re.compile('[a-zA-Z0-9]')
    pattern2 = re.compile(u'[^\s1234567890:：' + '\u4e00-\u9fa5]+')
    pattern3 = re.compile('[%s]+' % re.escape(punctuation + string.punctuation))
    line1 = re.sub(pattern1, '', sentence)
    line2 = re.sub(pattern2, '', line1)
    line3 = re.sub(pattern3, '', line2)
    new_sentence = ''.join(line3.split())
    return new_sentence

def preprocessing(df, col_name):
    t1 = time.time()
    print('去除无用字符')
    df[col_name + '_processed'] = df[col_name].apply(clear_character)

    print('中文分词')
    cut_words = []
    for content in df[col_name + '_processed'].values:
        seg_list = jieba.lcut(content)
        cut_words.append(seg_list)

    print('去低频词')
    min_threshold = 20
    word_list = []
    for seg_list in cut_words:
        word_list.extend(seg_list)
    counter = Counter(word_list)
    delete_list = []
    for k, v in counter.items():
        if v < min_threshold:
            delete_list.append(k)
    print(f'要去除掉低频词数量:{len(delete_list)}')
    for seg_list in tqdm(cut_words):
        for seg in seg_list:
            if seg in delete_list:
                seg_list.remove(seg)

    print('序列化列表')
    with open('../data/cut_words_waimai.pkl', 'wb') as f:
        pickle.dump(cut_words, f)

    t2 = time.time()
    print(f'共耗时{t2 - t1}秒')

3 文本表示

包括3个步骤：生成word2index、sent2index和sent2indexs，均是直接用索引表示


def compute_word2index(sentences, word2index):
    for sentences in sentences:
        for word in sentences:
            if word not in word2index:
                word2index[word] = len(word2index)
    return word2index

def compute_sent2index(sentence, max_len, word2index):
    sent2index = [word2index.get(word, 0) for word in sentence]
    if len(sentence) < max_len:
        sent2index += (max_len - len(sentence)) * [0]
    else:
        sent2index = sentence[:max_len]
    return sent2index

def text_embedding():

    with open('../data/cut_words_waimai.pkl', 'rb') as f:
        sentences = pickle.load(f)

    word2index = {"PAD": 0}
    word2index = compute_word2index(sentences, word2index)
    sent2indexs = []
    for sent in sentences:
        sentence = compute_sent2index(sent, MAX_LEN, word2index)
        sent2indexs.append(sentence)
    return word2index, sent2indexs

4 TextCNN模型

模型图（摘自论文）：

模型共4大层：
第一层：词嵌入层
输入维度： l e n ( w o r d 2 i n d e x ) 输出维度： e m b e d d i n g d i m 输入维度：len(word2index) \quad 输出维度： embeddingdim 输入维度：l e n (w or d 2 in d e x )输出维度：e mb e dd in g d im
第二层：一维卷积层+带泄露的Relu+一维最大池化层
卷积层：
输入维度： e m b e d d i n g d i m 输出维度： f e a t u r e s i z e 卷积核： h 输入维度：embedding_dim \quad 输出维度： featuresize 卷积核：h 输入维度：e mb e dd in g d im 输出维度：f e a t u res i ze 卷积核：h
激活层：
LeakyReLU()
池化层：
池化核： m a x l e n − h + 1 池化核：maxlen-h+1 池化核：ma x l e n −h +1
第三层：Dropout层
第四层：全连接层
输入维度： f e a t u r e ∗ l e n ( w i n d o w s s i z e ) 输出维度： n c l a s s 输入维度：feature * len(windowssize) \quad 输出维度：nclass 输入维度：f e a t u re ∗l e n (w in d o w ss i ze )输出维度：n c l a ss
其中，h ∈ w i n d o w s s i z e = ( 2 , 4 , 3 ) h \in windowssize=(2,4,3)h ∈w in d o w ss i ze =(2 ,4 ,3 )

import numpy as np
import torch
from torch import nn
from torch.utils.data import Dataset

class TextCNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, feature_size, windows_size, max_len, n_class):
        super(TextCNN, self).__init__()

        self.embed = nn.Embedding(num_embeddings=vocab_size, embedding_dim=embedding_dim)

        self.conv1 = nn.ModuleList([
            nn.Sequential(nn.Conv1d(in_channels=embedding_dim, out_channels=feature_size, kernel_size=h),
                          nn.LeakyReLU(),
                          nn.MaxPool1d(kernel_size=max_len - h + 1),
                          )
            for h in windows_size]
        )

        self.dropout = nn.Dropout(p=0.25)

        self.fc1 = nn.Linear(in_features=feature_size * len(windows_size), out_features=n_class)

    def forward(self, x):
        x = self.embed(x)
        x = x.permute(0, 2, 1)
        x = [conv(x) for conv in self.conv1]
        x = torch.cat(x, 1)
        x = x.view(-1, x.size(1))
        x = self.dropout(x)
        x = self.fc1(x)
        return x

class MyDataSet(Dataset):
    def __init__(self, vectors, labels):
        self.vectors = torch.LongTensor(np.array(vectors))
        self.labels = torch.LongTensor(np.array(labels))

    def __getitem__(self, index):
        vector, label = self.vectors[index], self.labels[index]
        return vector, label

    def __len__(self):
        return len(self.vectors)

打印下来：

TextCNN(
  (embed): Embedding(4074, 600)
  (conv1): ModuleList(
    (0): Sequential(
      (0): Conv1d(600, 200, kernel_size=(2,), stride=(1,))
      (1): LeakyReLU(negative_slope=0.01)
      (2): MaxPool1d(kernel_size=199, stride=199, padding=0, dilation=1, ceil_mode=False)
    )
    (1): Sequential(
      (0): Conv1d(600, 200, kernel_size=(4,), stride=(1,))
      (1): LeakyReLU(negative_slope=0.01)
      (2): MaxPool1d(kernel_size=197, stride=197, padding=0, dilation=1, ceil_mode=False)
    )
    (2): Sequential(
      (0): Conv1d(600, 200, kernel_size=(3,), stride=(1,))
      (1): LeakyReLU(negative_slope=0.01)
      (2): MaxPool1d(kernel_size=198, stride=198, padding=0, dilation=1, ceil_mode=False)
    )
  )
  (dropout): Dropout(p=0.25, inplace=False)
  (fc1): Linear(in_features=600, out_features=2, bias=True)

5 模型训练

封装了2个函数


def get_accuracy(model, datas, labels):
    out = torch.softmax(model(datas), dim=1, dtype=torch.float32)
    predictions = torch.max(input=out, dim=1)[1]
    y_predict = predictions.to('cpu').data.numpy()
    y_true = labels.to('cpu').data.numpy()
    accuracy = accuracy_score(y_true, y_predict)
    return accuracy

def train(model, dataloaer, optimizer, epoch):
    model.train()
    for i, (datas, labels) in enumerate(dataloaer):

        datas = datas.to(DEVICE)
        labels = labels.to(DEVICE)

        out = model(datas)

        loss = loss_func(out, labels)

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

        if i % 30 == 0:
            loss_list.append(loss.item())
            accuracy = get_accuracy(model, datas, labels)
            accuracy_list.append(accuracy)
            print('Train Epoch:%d Loss:%0.6f Accuracy:%0.6f' % (epoch, loss.item(), accuracy))

6 模型评估


def plot_curve(accuracy_list, loss_list, model_name):

    accuracy_array = np.array(accuracy_list).reshape(EPOCHS, -1)
    accuracy_array = np.mean(accuracy_array, axis=1)
    loss_array = np.array(loss_list).reshape(EPOCHS, -1)
    loss_array = np.mean(loss_array, axis=1)

    plt.rcParams['figure.figsize'] = (16, 8)
    plt.subplots(1, 2)
    plt.subplot(1, 2, 1)
    plt.plot(range(EPOCHS), loss_array)
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.title('Loss Curve')
    plt.subplot(1, 2, 2)
    plt.plot(range(EPOCHS), accuracy_array)
    plt.xlabel('epoch')
    plt.ylabel('accuracy')
    plt.title('Accuracy Cure')
    plt.savefig(f'../figure/waimai10k_{model_name}.png')

最终的评估曲线：

经过了50个epoch的训练，在训练集上的准确率有98%

7 总览

pycharm的结构图

代码：


def execute():

    dataset, labels = data_prepare()

    preprocessing(dataset, 'review')

    word2index, sent2indexs = text_embedding()

    train_dataset = MyDataSet(sent2indexs, labels)
    dataloader_train = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True)

    vocab_size = len(word2index)
    model = TextCNN(vocab_size=vocab_size, embedding_dim=EMBEDDING_DIM, windows_size=WINDOWS_SIZE,
                    max_len=MAX_LEN, feature_size=FEATURE_SIZE, n_class=N_CLASS).to(DEVICE)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for i in range(EPOCHS):
        print(f'{i+1}/{EPOCHS}')
        train(model, dataloader_train, optimizer, i+1)

    torch.save(model.state_dict(), '../model/textcnn_waimai.pkl')

    plot_curve(accuracy_list, loss_list, 'TextCNN')

if __name__ == '__main__':
    execute()

8 完整代码

import pickle
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import time
import jieba
import re
import torch
import string
from zhon.hanzi import punctuation
from tqdm import tqdm
from torch import nn, optim
from torch.utils.data import DataLoader
from collections import Counter
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from textcnn import TextCNN, MyDataSet

BATCH_SIZE = 64
EPOCHS = 50
WINDOWS_SIZE = [2, 4, 3]
MAX_LEN = 200
EMBEDDING_DIM = 600
FEATURE_SIZE = 200
N_CLASS = 2

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
loss_func = nn.CrossEntropyLoss()

loss_list, accuracy_list = [], []

def clear_character(sentence):
    pattern1 = re.compile('[a-zA-Z0-9]')
    pattern2 = re.compile(u'[^\s1234567890:：' + '\u4e00-\u9fa5]+')
    pattern3 = re.compile('[%s]+' % re.escape(punctuation + string.punctuation))
    line1 = re.sub(pattern1, '', sentence)
    line2 = re.sub(pattern2, '', line1)
    line3 = re.sub(pattern3, '', line2)
    new_sentence = ''.join(line3.split())
    return new_sentence

def preprocessing(df, col_name):
    t1 = time.time()
    print('去除无用字符')
    df[col_name + '_processed'] = df[col_name].apply(clear_character)

    print('中文分词')
    cut_words = []
    for content in df[col_name + '_processed'].values:
        seg_list = jieba.lcut(content)
        cut_words.append(seg_list)

    print('去低频词')
    min_threshold = 20
    word_list = []
    for seg_list in cut_words:
        word_list.extend(seg_list)
    counter = Counter(word_list)
    delete_list = []
    for k, v in counter.items():
        if v < min_threshold:
            delete_list.append(k)
    print(f'要去除掉低频词数量:{len(delete_list)}')
    for seg_list in tqdm(cut_words):
        for seg in seg_list:
            if seg in delete_list:
                seg_list.remove(seg)

    print('序列化列表')
    with open('../data/cut_words_waimai.pkl', 'wb') as f:
        pickle.dump(cut_words, f)

    t2 = time.time()
    print(f'共耗时{t2 - t1}秒')

def compute_word2index(sentences, word2index):
    for sentences in sentences:
        for word in sentences:
            if word not in word2index:
                word2index[word] = len(word2index)
    return word2index

def compute_sent2index(sentence, max_len, word2index):
    sent2index = [word2index.get(word, 0) for word in sentence]
    if len(sentence) < max_len:
        sent2index += (max_len - len(sentence)) * [0]
    else:
        sent2index = sentence[:max_len]
    return sent2index

def data_prepare():

    dataset = pd.read_csv('../data/waimai10k.txt', delimiter=',')
    labels = np.array(dataset['label'])
    labels = LabelEncoder().fit_transform(labels)
    return dataset, labels

def text_embedding():

    with open('../data/cut_words_waimai.pkl', 'rb') as f:
        sentences = pickle.load(f)

    word2index = {"PAD": 0}
    word2index = compute_word2index(sentences, word2index)
    sent2indexs = []
    for sent in sentences:
        sentence = compute_sent2index(sent, MAX_LEN, word2index)
        sent2indexs.append(sentence)
    return word2index, sent2indexs

def get_accuracy(model, datas, labels):
    out = torch.softmax(model(datas), dim=1, dtype=torch.float32)
    predictions = torch.max(input=out, dim=1)[1]
    y_predict = predictions.to('cpu').data.numpy()
    y_true = labels.to('cpu').data.numpy()
    accuracy = accuracy_score(y_true, y_predict)
    return accuracy

def train(model, dataloaer, optimizer, epoch):
    model.train()
    for i, (datas, labels) in enumerate(dataloaer):

        datas = datas.to(DEVICE)
        labels = labels.to(DEVICE)

        out = model(datas)

        loss = loss_func(out, labels)

        optimizer.zero_grad()

        loss.backward()

        optimizer.step()

        if i % 30 == 0:
            loss_list.append(loss.item())
            accuracy = get_accuracy(model, datas, labels)
            accuracy_list.append(accuracy)
            print('Train Epoch:%d Loss:%0.6f Accuracy:%0.6f' % (epoch, loss.item(), accuracy))

def plot_curve(accuracy_list, loss_list, model_name):

    accuracy_array = np.array(accuracy_list).reshape(EPOCHS, -1)
    accuracy_array = np.mean(accuracy_array, axis=1)
    loss_array = np.array(loss_list).reshape(EPOCHS, -1)
    loss_array = np.mean(loss_array, axis=1)

    plt.rcParams['figure.figsize'] = (16, 8)
    plt.subplots(1, 2)
    plt.subplot(1, 2, 1)
    plt.plot(range(EPOCHS), loss_array)
    plt.xlabel('epoch')
    plt.ylabel('loss')
    plt.title('Loss Curve')
    plt.subplot(1, 2, 2)
    plt.plot(range(EPOCHS), accuracy_array)
    plt.xlabel('epoch')
    plt.ylabel('accuracy')
    plt.title('Accuracy Cure')
    plt.savefig(f'../figure/waimai10k_{model_name}.png')

def execute():

    dataset, labels = data_prepare()

    preprocessing(dataset, 'review')

    word2index, sent2indexs = text_embedding()

    train_dataset = MyDataSet(sent2indexs, labels)
    dataloader_train = DataLoader(dataset=train_dataset, batch_size=BATCH_SIZE, shuffle=True)

    vocab_size = len(word2index)
    model = TextCNN(vocab_size=vocab_size, embedding_dim=EMBEDDING_DIM, windows_size=WINDOWS_SIZE,
                    max_len=MAX_LEN, feature_size=FEATURE_SIZE, n_class=N_CLASS).to(DEVICE)
    optimizer = optim.Adam(model.parameters(), lr=0.001)

    for i in range(EPOCHS):
        print(f'{i+1}/{EPOCHS}')
        train(model, dataloader_train, optimizer, i+1)

    torch.save(model.state_dict(), '../model/textcnn_waimai.pkl')

    plot_curve(accuracy_list, loss_list, 'TextCNN')

if __name__ == '__main__':
    execute()

Original: https://blog.csdn.net/m0_46275020/article/details/126433633
Author: 热爱旅行的小李同学
Title: 深度学习笔记（3）——pytorch+TextCNN实现情感分类（外卖数据集）

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/666725/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

输电线路图像数据集

本人根据项目需求、研究兴趣收集了一些输电线路数据集、电网数据集、电气工程相关数据集，可结合人工智能中图像识别、目标检测、图像处理等技术实现智能化的设备状态诊断！包含以下数据集：（…

人工智能 2023年6月25日
0058
python连接janusgraph

访问Janusgraph的方式有多种，python也是其中之一。利用Janusgraph的python连接，能够将Januagraph的图计算嵌入到其他python项目中，比如快速…

人工智能 2023年6月1日
0089
python机器学习一元线性回归梯度下降法的实现【Python机器学习系列（四）】

python机器学习一元线性回归梯度下降法的实现【Python机器学习系列（四）】文章目录 ①首先读取数据集 ②初始化相关参数 ③定义计算代价函数–>MS…

人工智能 2023年6月19日
0092
数据挖掘：关联分析—Apriori算法

@ ; 前言关联分析是用于发掘数据间关联度的分析技术，即通过发掘事务数据集内每项数据组合出现的概率。广泛应用日常各领域，例如，在生物信息学中的功能基因定位、医疗领域的病症关联分析…

人工智能 2023年6月19日
00123
D435i相机的标定及VINS-Fusion config文件修改

引言当我们想使用D435i相机去跑VINS-Fusion时，如果不把标定过的相机信息写入config文件中就运行，这样运动轨迹会抖动十分严重，里程计很容易漂。接下来将介绍如何标定…

人工智能 2023年7月26日
0079
PLC如何实现二阶滤波器算法(二阶巴特沃斯低通滤波器FIR_Filter)

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月27日
0076
第10章主成分分析(PCA)

1 概念主成分是选出比原始变量个数少，能解释大部分资料中的变异的几个新变量 主成分分&am…

人工智能 2023年6月11日
0066
【翻译】Fast Patch-based Style Transfer of Arbitrary Style

基于补丁的任意风格的快速迁移文章目录 Abstract 1. Introduction 2. Related Work 3. 风格迁移的新目标 * 3.1. 风格互换（Style…

人工智能 2023年7月13日
0092
自动驾驶算法-滤波器系列（八）——IMM交互多模型介绍

IMM交互多模型介绍 1. 简介 * （1）IMM(Interacting Multiple Model) （2）马尔科夫概率转移矩阵 2. 算法流程 * （1）输入交互（模型j）…

人工智能 2023年6月2日
0091
大模型高效、加速的运算清华大模型课程

59 BMTrain工具包先了解显存都去了哪里 1.模型的所有参数 2.模型的梯度参数两和模型参数量是一个数量级的 3.中间的计算结果。4.优化器 60 接下来看一下多个GPU…

人工智能 2023年6月4日
0082
ORB算法与opencv实现

摘要：本文主要描述ORB算法原理以及opencv中ORB算法的实现。关键字：ORB,FAST,BRIEF ORB算法ICCV论文：ORB：an efficient alternat…

人工智能 2023年6月19日
00102
回归模型的score得分为负_sklearn之计算回归模型的四大评价指标（explained_variance_score、mean_absolute_error、mean_squared_error…

def calPerformance(y_true,y_pred): ”’ 模型效果指标评估 y_true：真实的数据值 y_pred：回归模型预测的数据值…

人工智能 2023年6月18日
0083
【摄影技术基础】图像处理之图像采集

提示：文章写完后，目录可以自动生成，如何生成可参考右边的帮助文档文章目录摄影基础理论 * 光圈、快门速度和ISO被称之为曝光金三角，是因为为了追求曝光的最佳效果，三者必须完美平…

人工智能 2023年6月20日
0099
提升目标检测模型性能的tricks

提升目标检测模型性能的tricks bag of freebies * pixel-wise调整 – 几何畸变光照变化遮挡 – Random Erase…

人工智能 2023年7月9日
00114
和计算机视觉一样，自然语言处理也是人工智能的重要组成部分

不同于谷歌和Facebook的吸睛，微软在人工智能领域显得十分低调，但其实早在1991年微软便成立研究院，专攻人机交互、自然语言处理、机器学习、语音识别和语音合成、计算机视觉5个方…

人工智能 2023年5月30日
00102
深度学习技术在不同方向的应用及相关开源项目

2.1 多分类问题：图像分类任务是模型根据输入的图像进行预估。比如Esteva 等基于 Inception v3 主干网络,直接使用多达13万份带标注的临床影像数据来训练，训练任…

人工智能 2023年5月27日
0082

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

深度学习笔记（3）——pytorch+TextCNN实现情感分类（外卖数据集）

文章目录

大家都在看