基于BERT实现简单的情感分类任务

基于BERT实现简单的情感分类任务

项目链接:

https://github.com/yyxx1997/pytorch/tree/master/bert-sst2

任务简介

情感分类是指根据文本所表达的含义和情感信息将文本划分成褒扬的或贬义的两种或几种类型,是对文本作者倾向性和观点、态度的划分,因此有时也称倾向性分析(opinion analysis)。

本文通过简单的情感二分类任务作为样例,展示如何利用预训练模型 BERT进行简单的Finetune过程。

数据准备

此任务以演示BERT用法为主,数据集采用SST-2的子集,即在原本数据集基础上进行抽取得到的部分,总计10000条。

SST-2数据集

SST数据集: 斯坦福大学发布的一个情感分析数据集,主要针对电影评论来做情感分类,因此SST属于单个句子的文本分类任务(其中SST-2是二分类,SST-5是五分类,SST-5的情感极性区分的更细致)

SST数据集地址:https://nlp.stanford.edu/sentiment/index.html

有关SST数据的处理部分不再赘述,这里给出抽取结果:sst2_shuffled.tsv

示例

0——positive
1——negative

sentiment polaritysentence1this is the case of a pregnant premise being wasted by a…0is office work really as alienating as ‘bartleby’ so effectively…0horns and halos benefits from serendipity but also reminds…1heavy-handed exercise in time-vaulting literary pretension.0easily one of the best and most exciting movies of the year.1you . . . get a sense of good intentions derailed by a failure…1johnson has , in his first film , set himself a task he is not nearly up to.

数据加载

在这里并不体现参数调优的过程,只设置训练集和测试集,没有验证集。

def load_sentence_polarity(data_path, train_ratio=0.8):

    all_data = []

    categories = set()
    with open(data_path, 'r', encoding="utf8") as file:
        for sample in file.readlines():

            polar, sent = sample.strip().split("\t")
            categories.add(polar)
            all_data.append((polar, sent))
    length = len(all_data)
    train_len = int(length * train_ratio)
    train_data = all_data[:train_len]
    test_data = all_data[train_len:]
    return train_data, test_data, categories

定义Dataset和Dataloader为后续模型提供数据:

class BertDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset
        self.data_size = len(dataset)

    def __len__(self):
        return self.data_size

    def __getitem__(self, index):

        return self.dataset[index]

def coffate_fn(examples):
    inputs, targets = [], []
    for polar, sent in examples:
        inputs.append(sent)
        targets.append(int(polar))
    inputs = tokenizer(inputs,
                       padding=True,
                       truncation=True,
                       return_tensors="pt",
                       max_length=512)
    targets = torch.tensor(targets)
    return inputs, targets

data_path = "sst2_shuffled.tsv"

train_data, test_data, categories = load_sentence_polarity(
    data_path=data_path, train_ratio=train_ratio)

train_dataset = BertDataset(train_data)
test_dataset = BertDataset(test_data)

train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              collate_fn=coffate_fn,
                              shuffle=True)
test_dataloader = DataLoader(test_dataset,
                             batch_size=1,
                             collate_fn=coffate_fn)

DataLoader主要有以下几个参数:
Args:

  • dataset (Dataset): dataset from which to load the data.

  • batch_size (int, optional): how many samples per batch to load(default: 1).

  • shuffle (bool, optional): set to True to have the data reshuffled at every epoch (default: False).

  • collate_fn : 传入一个处理数据的回调函数

DataLoader工作流程:

  1. 先从dataset中取出batch_size个数据
  2. 对每个batch,执行collate_fn传入的函数以改变成为适合模型的输入
  3. 下个epoch取数据前先对当前的数据集进行shuffle,以防模型学会数据的顺序而导致过拟合

有关Dataset和Dataloader具体可参考文章:Pytorch入门:DataLoader 和 Dataset

模型介绍

本文采用最简单的 BertModel,预训练模型加载的是 bert-base-uncased,在此基础上外加 Linear层进行线性映射达到二分类目的:

from transformers import BertModel

class BertSST2Model(nn.Module):

    def __init__(self, class_size, pretrained_name='bert-base-uncased'):
"""
        Args:
            class_size  :指定分类模型的最终类别数目,以确定线性分类器的映射维度
            pretrained_name :用以指定bert的预训练模型
"""
        super(BertSST2Model, self).__init__()

        self.bert = BertModel.from_pretrained(pretrained_name,
                                              return_dict=True)

        self.classifier = nn.Linear(768, class_size)

模型整体效果图如下(图片来源:网络):

基于BERT实现简单的情感分类任务
由图中可以看出,输入在经过12个层之后,利用【CLS】标记完成最终的分类任务。但这里需要注意的是:
  • BertModel对【CLS】标签所在位置最后会经过一个Pooler池化层,所以并不是直接拿最后隐层的对应值进行的线性映射。
  • Linear层以Pooler的输出作为输入,是一般BERT分类任务的通用做法

Pooler池化层具体可参考 transformers源码

Finetune过程

参数设定

训练准备阶段,设置超参数和全局变量

batch_size = 16 # 同时训练的数据大小
num_epoch = 10  # 训练轮次
check_step = 2  # 用以训练中途对模型进行检验:每check_step个epoch进行一次测试和保存模型
data_path = "sst2_shuffled.tsv"  # 数据所在地址
train_ratio = 0.8  # 训练集比例
learning_rate = 1e-5  # 优化器的学习率

优化器和损失函数

optimizer = Adam(model.parameters(), learning_rate)  #使用Adam优化器
CE_loss = nn.CrossEntropyLoss()  # 使用crossentropy作为二分类任务的损失函数

训练

model.train()
for epoch in range(1, num_epoch + 1):

    total_loss = 0
    for batch in tqdm(train_dataloader, desc=f"Training Epoch {epoch}"):

        inputs, targets = [x.to(device) for x in batch]

        optimizer.zero_grad()

        bert_output = model(inputs)

        loss = CE_loss(bert_output, targets)

        loss.backward()

        optimizer.step()

        total_loss += loss.item()

测试


  acc = 0
   for batch in tqdm(test_dataloader, desc=f"Testing"):
       inputs, targets = [x.to(device) for x in batch]
       with torch.no_grad():
           bert_output = model(inputs)
"""
           .argmax()用于取出一个tensor向量中的最大值对应的下表序号,dim指定了维度
           假设 bert_output为3*2的tensor:
           tensor
           [
               [3.2,1.1],
               [0.4,0.6],
               [-0.1,0.2]
           ]
           则 bert_output.argmax(dim=1) 的结果为:tensor[0,1,1]
"""
           acc += (bert_output.argmax(dim=1) == targets).sum().item()

   print(f"Acc: {acc / len(test_dataloader):.2f}")

运行结果

模型在数据集上的准确率由50%以下上升到85%左右,有明显提升。

完整代码


import torch
import torch.nn as nn
from torch.optim import Adam
from torch.utils.data import Dataset, DataLoader
from transformers import BertModel
from tqdm import tqdm
import os
import time
from transformers import BertTokenizer
from transformers import logging

logging.set_verbosity_error()

os.environ["CUDA_VISIBLE_DEVICES"] = "0"

class BertSST2Model(nn.Module):

    def __init__(self, class_size, pretrained_name='bert-base-chinese'):
"""
        Args:
            class_size  :指定分类模型的最终类别数目,以确定线性分类器的映射维度
            pretrained_name :用以指定bert的预训练模型
"""

        super(BertSST2Model, self).__init__()

        self.bert = BertModel.from_pretrained(pretrained_name,
                                              return_dict=True)

        self.classifier = nn.Linear(768, class_size)

    def forward(self, inputs):

        input_ids, input_tyi, input_attn_mask = inputs['input_ids'], inputs[
            'token_type_ids'], inputs['attention_mask']

        output = self.bert(input_ids, input_tyi, input_attn_mask)

        categories_numberic = self.classifier(output.pooler_output)
        return categories_numberic

def save_pretrained(model, path):

    os.makedirs(path, exist_ok=True)
    torch.save(model, os.path.join(path, 'model.pth'))

def load_sentence_polarity(data_path, train_ratio=0.8):

    all_data = []

    categories = set()
    with open(data_path, 'r', encoding="utf8") as file:
        for sample in file.readlines():

            polar, sent = sample.strip().split("\t")
            categories.add(polar)
            all_data.append((polar, sent))
    length = len(all_data)
    train_len = int(length * train_ratio)
    train_data = all_data[:train_len]
    test_data = all_data[train_len:]
    return train_data, test_data, categories

"""
torch提供了优秀的数据加载类Dataloader,可以自动加载数据。
1. 想要使用torch的DataLoader作为训练数据的自动加载模块,就必须使用torch提供的Dataset类
2. 一定要具有__len__和__getitem__的方法,不然DataLoader不知道如何如何加载数据
这里是固定写法,是官方要求,不懂可以不做深究,一般的任务这里都通用
"""

class BertDataset(Dataset):
    def __init__(self, dataset):
        self.dataset = dataset
        self.data_size = len(dataset)

    def __len__(self):
        return self.data_size

    def __getitem__(self, index):

        return self.dataset[index]

def coffate_fn(examples):
    inputs, targets = [], []
    for polar, sent in examples:
        inputs.append(sent)
        targets.append(int(polar))
    inputs = tokenizer(inputs,
                       padding=True,
                       truncation=True,
                       return_tensors="pt",
                       max_length=512)
    targets = torch.tensor(targets)
    return inputs, targets

batch_size = 32
num_epoch = 5
check_step = 1
data_path = "./sst2_shuffled.tsv"
train_ratio = 0.8
learning_rate = 1e-5

train_data, test_data, categories = load_sentence_polarity(
    data_path=data_path, train_ratio=train_ratio)

train_dataset = BertDataset(train_data)
test_dataset = BertDataset(test_data)
"""
DataLoader主要有以下几个参数:
Args:
    dataset (Dataset): dataset from which to load the data.

    batch_size (int, optional): how many samples per batch to load(default: 1).

    shuffle (bool, optional): set to  to have the data reshuffled at every epoch (default: ).

    collate_fn : 传入一个处理数据的回调函数
DataLoader工作流程:
1. 先从dataset中取出batch_size个数据
2. 对每个batch,执行collate_fn传入的函数以改变成为适合模型的输入
3. 下个epoch取数据前先对当前的数据集进行shuffle,以防模型学会数据的顺序而导致过拟合
"""
train_dataloader = DataLoader(train_dataset,
                              batch_size=batch_size,
                              collate_fn=coffate_fn,
                              shuffle=True)
test_dataloader = DataLoader(test_dataset,
                             batch_size=1,
                             collate_fn=coffate_fn)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

pretrained_model_name = 'bert-base-uncased'

model = BertSST2Model(len(categories), pretrained_model_name)

model.to(device)

tokenizer = BertTokenizer.from_pretrained(pretrained_model_name)

optimizer = Adam(model.parameters(), learning_rate)
CE_loss = nn.CrossEntropyLoss()

timestamp = time.strftime("%m_%d_%H_%M", time.localtime())

model.train()
for epoch in range(1, num_epoch + 1):

    total_loss = 0

    for batch in tqdm(train_dataloader, desc=f"Training Epoch {epoch}"):

        inputs, targets = [x.to(device) for x in batch]

        optimizer.zero_grad()

        bert_output = model(inputs)

        loss = CE_loss(bert_output, targets)

        loss.backward()

        optimizer.step()

        total_loss += loss.item()

    acc = 0
    for batch in tqdm(test_dataloader, desc=f"Testing"):
        inputs, targets = [x.to(device) for x in batch]

        with torch.no_grad():
            bert_output = model(inputs)
"""
            .argmax()用于取出一个tensor向量中的最大值对应的下表序号,dim指定了维度
            假设 bert_output为3*2的tensor:
            tensor
            [
                [3.2,1.1],
                [0.4,0.6],
                [-0.1,0.2]
            ]
            则 bert_output.argmax(dim=1) 的结果为:tensor[0,1,1]
"""
            acc += (bert_output.argmax(dim=1) == targets).sum().item()

    print(f"Acc: {acc / len(test_dataloader):.2f}")

    if epoch % check_step == 0:

        checkpoints_dirname = "bert_sst2_" + timestamp
        os.makedirs(checkpoints_dirname, exist_ok=True)
        save_pretrained(model,
                        checkpoints_dirname + '/checkpoints-{}/'.format(epoch))

Original: https://blog.csdn.net/weixin_45101959/article/details/122971674
Author: 墨菲是一只喵
Title: 基于BERT实现简单的情感分类任务

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/545179/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球