用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

2023年6月26日上午1:09 • 人工智能 • 阅读 77

本文属于huggingface.transformers全部文档学习笔记博文的一部分。
全文链接：huggingface transformers包文档学习笔记（持续更新ing…）

本部分网址：https://huggingface.co/docs/transformers/main/en/training
本部分以文本分类任务为例，介绍transformers上如何微调预训练模型。
由于本人主要使用PyTorch框架，因此本文仅介绍使用transformers.Trainer（文档：https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer）和使用原生PyTorch来进行微调的方法。
由于教程中的代码是分散的，所以我会在这两个部分的最后一节各自呈现完整的脚本代码。
此外，由于①本人有用自己数据集的需要。②由于我的服务器不好挂代理，所以用datasets不方便。所以本文将用一些篇幅介绍不使用datasets实现所需功能的方式。（但是也会介绍本部分文档所提到datasets包的使用内容）
（大陆地区无法加载数据集和指标的解决方案可参考我之前撰写的博文：huggingface.datasets无法加载数据集的解决方案_诸神缄默不语的博客-CSDN博客）
另请注意：我一部分代码是在jupyter notebook上跑的，一部分代码是用脚本跑的，而且使用的环境有所改变，所以输出的环境可能不统一。

一个本文代码可用的Python环境：Python 3.8，PyTorch 1.8.1，cudatoolkit 10.2，transformers 4.18.0，datasets 2，scikit-learn 1.0.2
（据我观察别的版本应该也可以，影响不大）

在一个特定任务的数据集上训练预训练模型，就叫微调（finetune）。

文章目录

1. datasets包
*
1.1 datasets包的安装
1.2 datasets简易入门
1.3 Yelp Reviews数据集的加载和预处理
1.4 将自定义数据集转换为datasets的数据集格式
2. 使用Trainer（以PyTorch为后端框架）进行微调
*
2.1 定义分类模型
2.2 训练超参数
2.3 指标
2.4 Trainer
2.5 完整的脚本代码
3. 使用原生PyTorch进行微调
*
3.1 数据集
3.2 神经网络模型
3.3 优化器和learning rate scheduler
3.4 运行设备
3.5 Training Loop
3.6 指标
3.7 完整的脚本代码
4. 教程中给出的其他学习资源
datasets包

datasets包的官方GitHub项目：huggingface/datasets: 🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools

datasets包可以加载很多公开数据集，并对其进行预处理。
datasets包的建构参考了TFDS项目：tensorflow/datasets: TFDS is a collection of datasets ready to use with TensorFlow, Jax, …

1.1 datasets包的安装

如果使用anaconda作为包管理环境，并已经使用pip安装的transformers包，则可以直接使用pip来安装datasets：

pip install datasets

其他安装方式可参考datasets文档：https://huggingface.co/docs/datasets/installation

1.2 datasets简易入门

本部分使用的函数参考官方GitHub项目的README文件。

所有可用的数据集： all_available_datasets=datasets.list_datasets()
列出前5条： all_available_datasets[:5]
输出： ['assin', 'ar_res_reviews', 'ambig_qa', 'bianet', 'ag_news']
加载数据集： datasets.load_dataset(dataset_name, **kwargs)
以本教程所使用的 "yelp_review_full"数据集为例： dataset=datasets.load_dataset("yelp_review_full")
所有可用的指标： datasets.list_metrics()
加载指标： datasets.load_metric(metric_name, **kwargs)

其他所需的函数会在后文跟着例子一起讲述。

; 1.3 Yelp Reviews数据集的加载和预处理

数据集在huggingface上的官方网址：yelp_review_full · Datasets at Hugging Face

这是个用于英文短文本分类（情感分类）任务的数据集，是Yelp（美国点评网站）上的评论（ text）和对应的评分星级（1-5星）（ label）。
提取自Yelp Dataset Challenge 2015数据集。出自该论文： Xiang Zhang, Junbo Zhao, Yann LeCun. Character-level Convolutional Networks for Text Classification. Advances in Neural Information Processing Systems 28 (NIPS 2015)

加载并查看数据集：

from datasets import load_dataset

dataset = load_dataset("yelp_review_full")
dataset["train"][100]

输出：

{'label': 0,
 'text': 'My expectations for McDonalds are t rarely high. But for one to still fail so spectacularly...that takes something special!\\nThe cashier took my friends\'s order, then promptly ignored me. I had to force myself in front of a cashier who opened his register to wait on the person BEHIND me. I waited over five minutes for a gigantic order that included precisely one kid\'s meal. After watching two people who ordered after me be handed their food, I asked where mine was. The manager started yelling at the cashiers for \\"serving off their orders\\" when they didn\'t have their food. But neither cashier was anywhere near those controls, and the manager was the one serving food to customers and clearing the boards.\\nThe manager was rude when giving me my order. She didn\'t make sure that I had everything ON MY RECEIPT, and never even had the decency to apologize that I felt I was getting poor service.\\nI\'ve eaten at various McDonalds restaurants for over 30 years. I\'ve worked at more than one location. I expect bad days, bad moods, and the occasional mistake. But I have yet to have a decent experience at this store. It will remain a place I avoid unless someone in my party needs to avoid illness from low blood sugar. Perhaps I should go back to the racially biased service of Steak n Shake instead!'}

用datasets的map函数（文档：https://huggingface.co/docs/datasets/process.html#map）将文本格式的原数据 text列值经tokenize后转换为模型可以读取的格式（tokenizer的输出）：

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mypath/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True,max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

转换后的数据集：

注意这里原教程中tokenizer没有max_length参数，但是这样的话，这一部分就会输出警告：

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

后面训练时就会报错：

Traceback (most recent call last):
  File "mypath/huggingfacedatasets1.py", line 47, in <module>
    trainer.train()
  File "myenv/lib/python3.8/site-packages/transformers/trainer.py", line 1396, in train
    for step, inputs in enumerate(epoch_iterator):
  File "myenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in __next__
    data = self._next_data()
  File "myenv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 557, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "myenv/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "myenv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 66, in default_data_collator
    return torch_default_data_collator(features)
  File "myenv/lib/python3.8/site-packages/transformers/data/data_collator.py", line 130, in torch_default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 72 at dim 1 (got 118)
</module>

这一bug的产生就是由于同一batch的sequence不同长。所以要加max_length入参。
但是奇怪的是，我用colab试了一遍，就发现可以直接运行……但是我没找到模型的max_length本身是定义在哪里的？
使用512是因为在colab运行时看了一下 len(tokenized_datasets['train'][0]['input_ids'])，发现是512，说明默认定义的max_length值反正是512。至于为什么我这个还需要手动定义，鬼知道。
我一开始猜测是因为我使用的 pretrained_model_name_or_path参数是本地地址，而不是这个里面的某一个键：

测试证明似乎不是，see：
手动输入max_length入参的情况：

不手动输入max_length入参的情况：

修改模型max_model_input_sizes，并不手动输入max_length入参的情况：

那我就特么的纳闷这个属性是拿来干嘛的了……

我用 dir(tokenizer)硬找到了另一个看起来也很符合要求的属性名 model_max_length，经测试发现这个应该才真的是：

但是这个实验结果没有在整个的实验上重做，因为我觉得应该没有必要，因为这两个情况是一样的（可以参考我之前撰写的博文huggingface.transformers术语表_诸神缄默不语的博客-CSDN博客2.2节， tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=512)意为所有sequence都固定为512长度， tokenizer(batch_sentences, padding='max_length', truncation=True)意为所有sequence都固定为模型max_length长度，当模型max_length就是512时，两种情况等价）。事实上我觉得手动加 max_length入参可能更好，更适宜于控制代码。

为了加快示例代码的训练速度，我们抽样出一个较小的数据集来做示例：

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

1.4 将自定义数据集转换为datasets的数据集格式

更多细节可参考datasets官方文档https://huggingface.co/docs/datasets/v2.0.0/en/loading

（本文专门作此条撰写，主要是为了以后用Trainer时将自定义数据集转为datasets.Dataset，会比较方便（Trainer的 *_dataset入参可以接受datasets.Dataset或torch的数据集，如果是datasets.Dataset的话看起来应该会按列名自动输入模型所需的入参，而且看本教程示例是可以直接用list格式的。Trainer会自动移除其他列（后面训练时输出会显示这一部分内容）。如果是torch数据集我咋知道它要输入啥，有点麻烦。所以直接转换成datasets.Dataset就不用麻烦了）

本文直接以in-memory的dict对象格式（别的in-memory数据类似。暂时没有考虑大到一次加载不到内存的情况，等以后遇到这种情况了再解决）的yelp_review_full数据集为例（ Dataset.load_from_dict()文档：https://huggingface.co/docs/datasets/v2.0.0/en/package_reference/main_classes#datasets.Dataset.from_dict）：

①将上文得到的small_train_dataset转换为dict对象（键是列名，值是列值（list）），作为示例数据：

example_dict={'label':small_train_dataset['label'],'text':small_train_dataset['text']}

②将这个字典转为Dataset：

example_dataset=datasets.Dataset.from_dict(example_dict)
example_dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 1000
})

将Dataset组合为DatasetDict的方法我还没有找到，但是看起来不太需要，因为DatasetDict本质上就相当于一个Dataset的字典，对DatasetDict的操作应该就相当于对其所有值做原地操作。不需要专门使用这个类，如果有需要类似DatasetDict的操作，对所有Dataset都做一遍就行。

使用Trainer（以PyTorch为后端框架）进行微调

; 2.1 定义分类模型

这个数据集的标签有5类。

from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("mypath/bert-base-cased", num_labels=5)

输出：

Some weights of the model checkpoint at mypath/bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).

- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at mypath/bert-base-cased and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.

（对该输出的解释可参考我之前写的博文：Some weights of the model checkpoint at mypath/bert-base-chinese were not used when initializing Ber_诸神缄默不语的博客-CSDN博客）

这个代码也可以这么写：

from transformers import AutoConfig,AutoModelForSequenceClassification

model_path="mypath/bert-base-cased"
config=AutoConfig.from_pretrained(model_path,num_labels=5)
model=AutoModelForSequenceClassification.from_pretrained(model_path,config=config)

2.2 训练超参数

TrainingArguments类（文档：https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments）包含了所有可调的超参数、训练设置。在本教程中用的是默认超参数。

定义checkpoint存储位置：

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer")

2.3 指标

Trainer不会自动评估模型，所以需要传递给它用以计算和打印指标的函数。
更多指标相关的内容可参考：https://huggingface.co/docs/datasets/metrics.html

accuracy（准确率）指标的huggingface官方网页：Hugging Face – The AI community building the future.

加载准确率指标：

import numpy as np
metric=datasets.load_metric("accuracy")

在metric上调用 compute()方法，就可以计算预测值（模型返回值中的logits）的准确率了。

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

如果想要在微调过程中监测指标的变化情况，需要在TrainingArguments中定义evaluation_strategy超参，以在每个epoch结束时打印测试集上的指标）：

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

2.4 Trainer

定义Trainer对象：

from transformers import Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

开始训练：

trainer.train()

脚本运行输出：（在这里就可以看到，text列没有传入模型）

The following columns in the training set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

myenv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
  0%|                                                                                  | 0/96 [00:00<?, ?it/s]myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

  warnings.warn('Was asked to gather along dimension 0, but all '
 33%|████████████████████████▎                                                | 32/96 [00:19<00:23,  2.73it/s]The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.219325304031372, 'eval_accuracy': 0.487, 'eval_runtime': 5.219, 'eval_samples_per_second': 191.609, 'eval_steps_per_second': 6.131, 'epoch': 1.0}
 33%|████████████████████████▎                                                | 32/96 [00:24<00:23,  2.73it/smyenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

  warnings.warn('Was asked to gather along dimension 0, but all '
 67%|████████████████████████████████████████████████▋                        | 64/96 [00:37<00:11,  2.87it/s]The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 1.0443027019500732, 'eval_accuracy': 0.57, 'eval_runtime': 5.1937, 'eval_samples_per_second': 192.539, 'eval_steps_per_second': 6.161, 'epoch': 2.0}
 67%|████████████████████████████████████████████████▋                        | 64/96 [00:42<00:11,  2.87it/smyenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

  warnings.warn('Was asked to gather along dimension 0, but all '
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [00:55<00:00,  2.87it/s]The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
{'eval_loss': 0.9776290655136108, 'eval_accuracy': 0.598, 'eval_runtime': 5.2137, 'eval_samples_per_second': 191.803, 'eval_steps_per_second': 6.138, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [01:00<00:00,  2.87it/s]

Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 60.8009, 'train_samples_per_second': 49.341, 'train_steps_per_second': 1.579, 'train_loss': 1.0931960741678874, 'epoch': 3.0}
100%|█████████████████████████████████████████████████████████████████████████| 96/96 [01:00<00:00,  1.58it/s]
</code>

jupyter notebook的输出效果，看起来比脚本输出更清晰一些：

The following columns in the training set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

myenv/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 96
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

  warnings.warn('Was asked to gather along dimension 0, but all '

The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

  warnings.warn('Was asked to gather along dimension 0, but all '
The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
myenv/lib/python3.8/site-packages/torch/nn/parallel/_functions.py:65: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.

  warnings.warn('Was asked to gather along dimension 0, but all '
The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8

Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=96, training_loss=1.1009167830149333, metrics={'train_runtime': 60.9212, 'train_samples_per_second': 49.244, 'train_steps_per_second': 1.576, 'total_flos': 789354427392000.0, 'train_loss': 1.1009167830149333, 'epoch': 3.0})

因为我为了debug在colab上也跑了一遍，所以也展示一下colab上的输出效果（我也用了GPU，但还是比在本地慢了很多，不知道为啥。我本地是有4张卡，但这明显不止慢了4倍啊）：

The following columns in the training set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

/usr/local/lib/python3.7/dist-packages/transformers/optimization.py:309: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 1000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 375

The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8
The following columns in the evaluation set  don't have a corresponding argument in BertForSequenceClassification.forward and have been ignored: text. If text are not expected by BertForSequenceClassification.forward,  you can safely ignore this message.

***** Running Evaluation *****
  Num examples = 1000
  Batch size = 8

Training completed. Do not forget to share your model on huggingface.co/models =)

TrainOutput(global_step=375, training_loss=1.2140440266927084, metrics={'train_runtime': 780.671, 'train_samples_per_second': 3.843, 'train_steps_per_second': 0.48, 'total_flos': 789354427392000.0, 'train_loss': 1.2140440266927084, 'epoch': 3.0})

（注意这里还有一点在于torch.nn.parallel的报错，colab运行时没有报错，我怀疑要么是因为colab只有一张卡，要么是因为torch版本的问题（我本地用的是PyTorch 1.8.1，colab是PyTorch 1.10）。但是这玩意不好验证，我就猜猜）

2.5 完整的脚本代码

import datasets
import numpy as np
from transformers import AutoTokenizer,AutoModelForSequenceClassification,TrainingArguments,Trainer

dataset=datasets.load_from_disk("datasets/yelp_full_review_disk")

tokenizer = AutoTokenizer.from_pretrained("pretrained_models/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"],padding="max_length",truncation=True,max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

model = AutoModelForSequenceClassification.from_pretrained("pretrained_models/bert-base-cased",
                                                            num_labels=5)

training_args = TrainingArguments(output_dir="pt_save_pretrained",evaluation_strategy="epoch")

metric=datasets.load_metric('datasets/accuracy.py')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

使用原生PyTorch进行微调

Trainer虽好，屁事太多。太难debug了，不如直接用原生PyTorch写。

这一部分的理解可参考我之前写的博文60分钟闪击速成PyTorch（Deep Learning with PyTorch: A 60 Minute Blitz）学习笔记_诸神缄默不语的博客-CSDN博客

一个training loop：
将训练数据输入模型，得到预测结果→计算损失函数→计算梯度→更新参数→重新将训练数据输入模型，得到预测结果

如果在notebook上照着之前的代码后面继续跑的，建议先把之前的模型、Trainer之类的先删掉，清一下cuda上的缓存，以省出内存。或者直接重启notebook：

del model
del trainer
torch.cuda.empty_cache()

3.1 数据集

预处理dataset（后文会介绍如何使用Python原生的数据对象来生成所需的数据集）：

from torch.utils.data import DataLoader

tokenized_datasets = tokenized_datasets.remove_columns(["text"])

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

使用自己的数据集：

示例数据集是这么得到的：

example_dict={'labels':dataset['train']['label'],'text':dataset['train']['text']}

展示数据：

print(type(example_dict['labels']))
print(example_dict['labels'][12345])
print(type(example_dict['text']))
print(example_dict['text'][12345])

输出：

<class 'list'>
2
<class 'list'>
I went here in search of a crepe with Nutella and I got a really good crepe. I wouldn't exactly say this place is authentic French because you've got Americans cooking the food,  but my crepe was still good. \n\nIt doesn't taste like the ones I had in France, Carmon's puts a twist on (or maybe it was just overcooked) theirs by making the crepe more firm. \n\nThe whipped cream was also made fresh and delightful. The prices were horrid though.\n\nCrepes don't cost that much to make, so they're clearly overpricing here. Price is the only reason I won't come back so often.

</class></class>

①使用torch的DataSet和DataLoader类（跟上面将datasets.Dataset最后得到的东西相当于是一样的）：

from torch.utils.data import Dataset,DataLoader

class YelpDataset(Dataset):
    def __init__(self,dict_data) -> None:
"""
        dict_data: dict格式的data，键labels对应标签列表（元素是数值），键text对应文本列表
"""
        super(YelpDataset,self).__init__()

        self.data=dict_data

    def __getitem__(self, index):
        return [self.data['text'][index],self.data['labels'][index]]

    def __len__(self):
        return len(self.data['text'])

def collate_fn(batch):
    pt_batch=tokenizer([b[0] for b in batch],padding=True,truncation=True,max_length=512,
                        return_tensors='pt')
    labels=torch.tensor([b[1] for b in batch])
    return {'labels':labels,'input_ids':pt_batch['input_ids'],'token_type_ids':pt_batch['token_type_ids'],
            'attention_mask':pt_batch['attention_mask']}

train_data=YelpDataset(example_dict)

train_dataloader=DataLoader(train_data,batch_size=8,shuffle=True,collate_fn=collate_fn)

②手写DataLoader：
在每个training loop中，如此遍历（大多数变量我觉得都能看名字就看出来什么意思，就不做详细介绍了）：


train_data_length=len(example_dict['labels'])

if train_data_length%batch_size==0:
    batch_num=int(train_data_length/batch_size)
else:
    batch_num=int(train_data_length/batch_size)+1

for b in range(batch_num):
    index_begin=b*batch_size
    index_end=min(train_data_length,index_begin+batch_size)

    this_batch_text=example_dict['text'][index_begin:index_end]
    this_batch_labels=example_dict['labels'][index_begin:index_end]

    pt_batch=tokenizer(this_batch_text,padding=True,truncation=True,max_length=512,return_tensors='pt')

3.2 神经网络模型

定义分类模型：

from transformers import AutoModelForSequenceClassification

model=AutoModelForSequenceClassification.from_pretrained("mypath/bert-base-cased",
                                                        num_labels=5)

3.3 优化器和learning rate scheduler

在前文可以看到，transformers的Trainer默认调用的是transformers的AdamW优化器，并会报此警告：
FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set no_deprecation_warning=True to disable this warning

所以以前的AdamW都别用了，用PyTorch官方的AdamW优化器：

from torch.optim import AdamW
optimizer = AdamW(model.parameters(), lr=5e-5)

从Trainer创建默认的learning rate scheduler：

from transformers import get_scheduler

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

3.4 运行设备

指定设备（单卡情况）并将模型转移到指定设备上：

import torch

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

3.5 Training Loop

tqdm包官网：https://tqdm.github.io/

from tqdm.auto import tqdm

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

输出：

在真实代码中还可以加上早停、保存在验证集上指标最高的模型等功能。

3.6 指标

和使用Trainer时一样，用datasets包的Metric来计算指标。
这里的验证过程是在训练结束后，通过Metric的add_batch()函数（文档：https://huggingface.co/docs/datasets/package_reference/main_classes.html?highlight=add_batch#datasets.Metric.add_batch）来累积所有batch。

metric = load_metric("accuracy")
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

metric.compute()

输出： {'accuracy': 0.588}

3.7 完整的脚本代码

from tqdm.auto import tqdm

import torch
from torch.utils.data import DataLoader
from torch.optim import AdamW

import datasets
from transformers import AutoTokenizer,AutoModelForSequenceClassification,get_scheduler

dataset=datasets.load_from_disk("datasets/yelp_full_review_disk")

tokenizer = AutoTokenizer.from_pretrained("pretrained_models/bert-base-cased")

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length",truncation=True,max_length=512)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

tokenized_datasets = tokenized_datasets.remove_columns(["text"])

tokenized_datasets = tokenized_datasets.rename_column("label", "labels")

tokenized_datasets.set_format("torch")

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(1000))

train_dataloader = DataLoader(small_train_dataset, shuffle=True, batch_size=8)
eval_dataloader = DataLoader(small_eval_dataset, batch_size=8)

model=AutoModelForSequenceClassification.from_pretrained\
                        ("pretrained_models/bert-base-cased",num_labels=5)

optimizer = AdamW(model.parameters(), lr=5e-5)

num_epochs = 3
num_training_steps = num_epochs * len(train_dataloader)
lr_scheduler = get_scheduler(
    name="linear", optimizer=optimizer, num_warmup_steps=0, num_training_steps=num_training_steps
)

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
model.to(device)

progress_bar = tqdm(range(num_training_steps))

model.train()
for epoch in range(num_epochs):
    for batch in train_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

metric=datasets.load_metric('datasets/accuracy.py')
model.eval()
for batch in eval_dataloader:
    batch = {k: v.to(device) for k, v in batch.items()}
    with torch.no_grad():
        outputs = model(**batch)

    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    metric.add_batch(predictions=predictions, references=batch["labels"])

print(metric.compute())

教程中给出的其他学习资源
🤗 Transformers Examples：这个我有撰写相关学习笔记博文的计划。
🤗 Transformers Notebooks：这个我也也许会撰写相关学习笔记。

Original: https://blog.csdn.net/PolarisRisingWar/article/details/123939061
Author: 诸神缄默不语
Title: 用huggingface.transformers.AutoModelForSequenceClassification在文本分类任务上微调预训练模型

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/651872/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

盘点PC端超好用的文字转语音工具，赶紧收藏起来

文语转换一直是日常生活中常见的问题。目前，市场主要分为直播配音和软件配音。 [En] Text-to-speech has always been a common problem…

人工智能 2023年5月25日
0086
点云PCL库学习-双目图像转化为点云PCD并显示

文章目录双目视觉模型代码实现总结参考文章一、双目视觉模型和RGBD相机主动发射光线实现测距不同，双目测距原理通过配置立体摄像头（通常由左眼相机和右眼相机两个水平放置的相…

人工智能 2023年5月26日
0080
MXNe

问题描述 MXNet是一个开源的深度学习框架，其提供了一组Python接口，用于构建和训练深度神经网络。在使用MXNet时，经常会遇到如何处理数据集的问题。本文将介绍如何使用MXN…

人工智能 2023年12月31日
0046
相机内参模型Scaramuzza/ocam详解

文章目录 * – + 1. 论文总述 + 2. 全向相机single viewpoint的重要性 + 3. 以前的全向相机标定 + 4. 2D –> …

人工智能 2023年6月1日
0074
python 百度语音识别

人工智能 2023年5月23日
0096
mediapipe u3dplugin使用记录

吐槽：第一次遇到这么难的节目，80%的时间都花在了网络传输上，但要珍惜，生活还要继续。 [En] Complaint: when you encounter such a diff…

人工智能 2023年5月24日
00149
论文解读-DeepMove: Predicting Human Mobility with Attentional Recurrent Networks

这是2018年发表的一篇论文，该论文提出了用于从长而稀疏的轨迹预测移动性的注意力递归网络。在DeepMove中，1）我们首先设计了一个多模态嵌入递归神经网络，通过联合嵌入控制人类…

人工智能 2023年7月14日
0090
机器学习之train_test_split详解

文章目录前言一.train_test_split是什么？二、使用步骤 * 1.引入库 2.读入数据 3.参数意义总结前言 train_test_split是python在…

人工智能 2023年6月15日
00108
安装pytorch3d最简单方法

安装pytorch3d的最简单方法前言一、pytorch3d是什么？二、安装步骤 * 1.添加anaconda源（最最最最最关键！！） 2.创建环境 3.安装pytorch和…

人工智能 2023年7月20日
0055
【Python数据分析】数据挖掘建模——聚类分析

聚类算法是在没有给定划分类别的情况下，根据数据相似度进行样本分组的一种方法，是一种无监督学习方法。聚类的输入是一组未被标记的样本，聚类根据数据自身的距离或相似度将他们划分为若干组，…

人工智能 2023年7月15日
0051
【模板】MST最小生成树（Prim算法、Krustra算法）

给一张n个点的图，从中选 n-1条边，使得所选边权和最小的情况下生成一个树。解法核心：贪心 1、核心思路：点集拓展 2、核心操作：贪心（优先队列实现） + 判环（集合/标记实现…

人工智能 2023年6月29日
0079
Attention的汇总与辨析_Additive、Multiplication、Scaled dot-product、Self Attention、Multi-head Self-Attention

DDR3 SDRAM的时序图，供学习调用DDR3使用。目录 DDR3 SDRAM Specification 1. Functional Description …&…

人工智能 2023年5月28日
0075
YOLO V5源码详解

1.数据读取首先读取图片以及标签路径，并将标签存入缓存，对单标签情况、特定类别、以及是否保持长方形等情况分别进行处理。如果需要进行mosaic 数据增强，首先找到中心点，将图片…

人工智能 2023年6月23日
0074
PyTorch深度学习入门

💂 个人网站:【海拥】【萌怪大冒险】【2048 】 🤟 风趣幽默的前端学习课程：👉28个案例趣学前端 💅 想寻找共同学习交流、摸鱼划水的小伙伴，请点击【摸鱼大军】 *💬 免费…

人工智能 2023年7月24日
0075
OpenCV—-简单对象分类

题目要求：在上一章OpenCV—-简单目标提取和分割中尝试使用opencv连通性方法获取了目标的面积和轮廓信息，本章节将尝试对这些特征进行整合，使用opencv中ml库…

人工智能 2023年7月2日
0082
数据挖掘-模型怎么解决业务需求（五）

🤵‍♂️ 个人主页：@Lingxw_w的个人主页✍🏻作者简介：计算机科学与技术研究生在读🐋 希望大家多多支持，我们一起进步！😄如果文章对你有帮助的话，欢迎评论 💬点赞👍🏻 收藏 📂…

人工智能 2023年6月24日
00112

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31