pytorch 使用BART模型进行中文自动摘要

2023年7月22日下午1:24 • 人工智能 • 阅读 88

系列文章

摘要

fine-tune BART模型实现中文自动摘要
如何fine-tune BART模型参见系列文章1
博文提供了数据集和训练好的模型，从结果可以看出，模型学习到了摘要的能力，但是选择适当的位置进行终止，能力较差。

实现

数据准备

首先安装相应的包

! python --version
! pip install datasets transformers rouge-score jieba
! pip install --upgrade ipywidgets tqdm
! jupyter nbextension enable --py widgetsnbextension
! pip install fire

导入包和数据集
数据集见我上传的：

from ipywidgets import IntProgress
import tqdm
from datasets import load_dataset
dataset = load_dataset('json', data_files='nlpcc_data.json', field='data')

下面对数据进行处理：

dataset["train"]

def flatten(example):
    return {
        "document": example["content"],
        "summary": example["title"],
        "id":"0"
    }
dataset = dataset["train"].map(flatten, remove_columns=["title", "content"])

下面加载完整的BART模型，以备fine-tune

TokenModel = "bert-base-chinese"

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(TokenModel)

model_checkpoint = "facebook/bart-large-cnn"
if model_checkpoint in ["t5-small", "t5-base", "t5-larg", "t5-3b", "t5-11b"]:
    prefix = "summarize: "
else:
    prefix = ""

max_input_length = 1024
max_target_length = 256

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["document"]]
    model_inputs = tokenizer(inputs, max_length=max_input_length, truncation=True)

    with tokenizer.as_target_tokenizer():
        labels = tokenizer(examples["summary"], max_length=max_target_length, truncation=True)

    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

dataset

raw_datasets = dataset
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

装载数据

from datasets import dataset_dict
import datasets

train_data_txt, validation_data_txt = dataset.train_test_split(test_size=0.1).values()
train_data_txt, test_data_tex = train_data_txt.train_test_split(test_size=0.1).values()

dd = datasets.DatasetDict({"train":train_data_txt,"validation": validation_data_txt,"test":test_data_tex })

raw_datasets = dd
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

处理好的数据集

raw_datasets

import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_elements(dataset, num_examples=5):
    assert num_examples  len(dataset), "Can't pick more elements than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset)-1)
        while pick in picks:
            pick = random.randint(0, len(dataset)-1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

预览数据

show_random_elements(raw_datasets["train"])

安装transformer

! pip install transformers

from transformers import AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq, Seq2SeqTrainingArguments, Seq2SeqTrainer

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)


import warnings
from pathlib import Path
from typing import List, Tuple, Union

import fire
from torch import nn

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, PreTrainedModel
from transformers.utils import logging

logger = logging.get_logger(__name__)

抽取部分模型 fine-tune

def copy_layers(src_layers: nn.ModuleList, dest_layers: nn.ModuleList, layers_to_copy: List[int]) -> None:
    layers_to_copy = nn.ModuleList([src_layers[i] for i in layers_to_copy])
    assert len(dest_layers) == len(layers_to_copy), f"{len(dest_layers)} != {len(layers_to_copy)}"
    dest_layers.load_state_dict(layers_to_copy.state_dict())

LAYERS_TO_COPY = {

    12: {
        1: [0],
        2: [0, 6],
        3: [0, 6, 11],
        4: [0, 4, 8, 11],
        6: [0, 2, 4, 7, 9, 11],
        9: [0, 1, 2, 4, 5, 7, 9, 10, 11],
        12: list(range(12)),
    },
    16: {
        1: [0],
        2: [0, 15],
        3: [0, 8, 15],
        4: [0, 5, 10, 15],
        6: [0, 3, 6, 9, 12, 15],
        8: [0, 2, 4, 6, 8, 10, 12, 15],
        9: [0, 1, 3, 5, 7, 9, 11, 13, 15],
        12: [0, 1, 2, 3, 4, 5, 6, 7, 9, 11, 13, 15],
        16: list(range(16)),
    },
    6: {1: [0], 2: [0, 5], 3: [0, 2, 5], 4: [0, 1, 3, 5], 6: list(range(6))},
}
LAYERS_TO_SUPERVISE = {

    6: {1: [5], 2: [3, 5], 3: [1, 4, 5], 4: [1, 2, 4, 5]},
    12: {1: [11], 2: [5, 11], 3: [3, 7, 11], 6: [1, 3, 5, 8, 10, 11]},
    16: {1: [15], 4: [4, 9, 12, 15], 8: [1, 3, 5, 7, 9, 11, 13, 15]},
}

def create_student_by_copying_alternating_layers(
    teacher: Union[str, PreTrainedModel],
    save_path: Union[str, Path] = "student",
    e: Union[int, None] = None,
    d: Union[int, None] = None,
    copy_first_teacher_layers=False,
    e_layers_to_copy=None,
    d_layers_to_copy=None,
    **extra_config_kwargs
) -> Tuple[PreTrainedModel, List[int], List[int]]:

    _msg = "encoder_layers and decoder_layers cannot be both None-- you would just have an identical teacher."
    assert (e is not None) or (d is not None), _msg
    if isinstance(teacher, str):
        AutoTokenizer.from_pretrained(teacher).save_pretrained(save_path)
        teacher = AutoModelForSeq2SeqLM.from_pretrained(teacher).eval()
    else:

        assert isinstance(teacher, PreTrainedModel), f"teacher must be a model or string got type {type(teacher)}"
    init_kwargs = teacher.config.to_diff_dict()

    try:
        teacher_e, teacher_d = teacher.config.encoder_layers, teacher.config.decoder_layers
        if e is None:
            e = teacher_e
        if d is None:
            d = teacher_d
        init_kwargs.update({"encoder_layers": e, "decoder_layers": d})
    except AttributeError:
        teacher_e, teacher_d = teacher.config.num_layers, teacher.config.num_decoder_layers
        if e is None:
            e = teacher_e
        if d is None:
            d = teacher_d
        init_kwargs.update({"num_layers": e, "num_decoder_layers": d})

    init_kwargs.update(extra_config_kwargs)

    student_cfg = teacher.config_class(**init_kwargs)
    student = AutoModelForSeq2SeqLM.from_config(student_cfg)

    info = student.load_state_dict(teacher.state_dict(), strict=False)
    assert info.missing_keys == [], info.missing_keys

    if copy_first_teacher_layers:
        e_layers_to_copy, d_layers_to_copy = list(range(e)), list(range(d))
        logger.info(
            f"Copied encoder layers {e_layers_to_copy} and decoder layers {d_layers_to_copy}. Saving them to {save_path}"
        )
        student.save_pretrained(save_path)
        return student, e_layers_to_copy, d_layers_to_copy

    if e_layers_to_copy is None:
        e_layers_to_copy: List[int] = pick_layers_to_copy(e, teacher_e)
    if d_layers_to_copy is None:
        d_layers_to_copy: List[int] = pick_layers_to_copy(d, teacher_d)

    try:
        copy_layers(teacher.model.encoder.layers, student.model.encoder.layers, e_layers_to_copy)
        copy_layers(teacher.model.decoder.layers, student.model.decoder.layers, d_layers_to_copy)
    except AttributeError:
        copy_layers(teacher.encoder.block, student.encoder.block, e_layers_to_copy)
        copy_layers(teacher.decoder.block, student.decoder.block, d_layers_to_copy)
    logger.info(
        f"Copied encoder layers {e_layers_to_copy} and decoder layers {d_layers_to_copy}. Saving them to {save_path}"
    )
    student.config.init_metadata = dict(
        teacher_type=teacher.config.model_type,
        copied_encoder_layers=e_layers_to_copy,
        copied_decoder_layers=d_layers_to_copy,
    )
    student.save_pretrained(save_path)

    return student, e_layers_to_copy, d_layers_to_copy

def pick_layers_to_copy(n_student, n_teacher):
    try:
        val = LAYERS_TO_COPY[n_teacher][n_student]
        return val
    except KeyError:
        if n_student != n_teacher:
            warnings.warn(
                f"no hardcoded layers to copy for teacher {n_teacher} -> student {n_student}, defaulting to first {n_student}"
            )
        return list(range(n_student))

model, list_en, list_de = create_student_by_copying_alternating_layers(model, 'trian.pth', 12, 3)

batch_size = 2
args = Seq2SeqTrainingArguments(
    output_dir="results",
    num_train_epochs=1,
    do_train=True,
    do_eval=True,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,

    warmup_steps=500,
    weight_decay=0.1,
    label_smoothing_factor=0.1,
    predict_with_generate=True,
    logging_dir="logs",
    logging_steps=50,
    save_total_limit=3,
)

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

import jieba
import numpy as np

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)

    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    decoded_preds = ["\n".join(jieba.cut(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(jieba.cut(label.strip())) for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)

    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}

    prediction_lens = [np.count_nonzero(pred != tokenizer.pad_token_id) for pred in predictions]
    result["gen_len"] = np.mean(prediction_lens)

    return {k: round(v, 4) for k, v in result.items()}

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

训练与训练平台

开始训练，我是用的华为云，colab的GPU时间不足以完成训练

trainer.train()

保存和加载模型

torch.save(model.state_dict(), "BART.pth")

import torch
model.load_state_dict(torch.load('BART.pth'))

def generate_summary(test_samples, model):
    inputs = tokenizer(
        test_samples,
        padding="max_length",
        truncation=True,
        max_length=max_input_length,
        return_tensors="pt",
    )
    input_ids = inputs.input_ids.to(model.device)
    attention_mask = inputs.attention_mask.to(model.device)
    outputs = model.generate(input_ids, attention_mask=attention_mask)
    output_str = tokenizer.batch_decode(outputs, skip_special_tokens=True)
    return outputs, output_str

测试

可以做一个测试：

(结果是一个新闻，有领导人姓名没法放上来，我就不放了，可以看github)

能学习到部分摘要能力，但后面该终止摘要的地方没有自动终止,仍有不足

可供下载的数据集与模型

数据集：nlpcc_data.json
模训练好的模型：BART模型-包含网络参数

Github

https://github.com/downw/summrization

Original: https://blog.csdn.net/weixin_43718786/article/details/119741580
Author: keep-hungry
Title: pytorch 使用BART模型进行中文自动摘要

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/709077/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

【数据挖掘·总复习】各大算法优缺点汇总||总结整理||~自食用

step by step. 1. Apriori算法（1）优点（2）缺点 2.1 决策树算法（ID3）（1）优点（2）缺点 2.2 ID3与C4.5的比较 3. 贝叶斯算法…

人工智能 2023年6月2日
0089
基于SOM算法的Iris数据分类

基于SOM算法的Iris数据分类自组织特征映射神经网络SOM(Self-Organizing Feature Map)是一种无监督学习算法，不同于一般神经网络基于损失函数的优化训…

人工智能 2023年7月2日
0061
AcWing_4262

差分 + 贪心。先将两个序列做差，对作完差的序列取差分序列 s s s 。我们需要求将 s s s 转化为全零差分序列的最小操作次数 t t t 。证明：将相应的操作取逆操…

人工智能 2023年6月29日
0070
SKnet论文解读

本文讲述sknet的核心部分:自适应性的注意力编码机制 SKNet 对不同输入使用的卷积核感受野不同,参数权重也不同,可以自适应的对输出进行处理注:本人才疏学浅,文章难免有疏漏之…

人工智能 2023年7月27日
0061
CloneNotSupportedException的解决方案 + Object的clone方法分析

主要是介绍各种格式流行的软件设计模式，对于程序员的进一步提升起推进作用，有时间可以随便翻翻~~ 23种设计模式汇集如果你还不了解设计模式是什么的话？那就先看设计模式引言 ! 学…

人工智能 2023年6月26日
00104
transformers库的使用【二】tokenizer的使用，模型的保存自定义

在之前提到过，标记器（tokenizer）是用来对文本进行预处理的一个工具。首先，标记器会把输入的文档进行分割，将一个句子分成单个的word（或者词语的一部分，或者是标点符号） …

人工智能 2023年5月31日
00106
cepstra

问题：关于cepstra的问题介绍在语音处理中，Cepstra（倒谱系数）是用于表示声音信号的频谱信息的一种特征。Cepstra被广泛用于许多应用领域，如声音识别、语音合成和说…

人工智能 2024年1月3日
0047
学习笔记：多模态

1.多模态数据：不同的存在形式或信息来源均可被称之为一种模态。由两种或两种以上模态组成的数据称之为多模态数据（多模态用来表示不同形态的数据形式，或者同种形态不同的格式，一般表示文…

人工智能 2023年6月23日
0086
UNETR 论文精解

引言本文会结合论文UNETR: Transformers for 3D Medical Image Segmentation和代码深入讲解。阅读这篇文章之前最好了解UNET网络和…

人工智能 2023年6月25日
00101
yolov1代码解读

yolov1论文解读前面已经对yolov1的原理做了一个了解，下面就来看一下yolov1的代码实现过程 yolov1的代码倒是比Faster-Rcnn简单多了，但是一些逻辑顺序和F…

人工智能 2023年7月9日
0089
深度学习（初识tensorflow2.版本）之三好学生成绩问题（1）

🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🔝🥰 博客首页：knighthood2001😗 欢迎点赞👍评论🗨️❤️ 热爱python，期待与大家一同进步成长！！❤️👀 给大家推荐一款很火爆的刷题、面试求…

人工智能 2023年5月23日
0099
4、docker 容器保存加载和退出

1、容器保存为镜像文件 docker commit将容器提交为镜像。容器本身只是内存对象，容器关闭后，里面的内容不保存。所以，要保存容器内配置，需将容器存为镜像文件，需要的时候再加…

人工智能 2023年7月27日
0065
python行转列 pandas_pandas.DataFrame的pivot()和unstack()实现行转列

df=pd.DataFrame(np.random.randn(20).reshape(4,5),index=[[‘a’,’a’,&…

人工智能 2023年7月9日
0072
残差网络—ResNet

ResNet-34 再34层的ResNet的结构简图当中：首先是卷积层，然后是池化层，有连接线的结构就是一个残差结构再这个34层的ResNet是由一系列的残差结构组成的。最后通过一…

人工智能 2023年7月21日
0072
自动驾驶碰撞检查

详细驾驶图解学驾驶技术资料基本动作练习目标：建立上下车的安全意识，掌握正确的驾驶姿势。安全确认内容：确认汽车前后没有人和障碍物。上车、下车方法 1.安全确认确认车的前后…

人工智能 2023年6月10日
0076
PyTorch学习笔记之多层感知机

PyTorch学习笔记之多层感知机多层感知机也叫人工神经网络，除了输入输出层，它中间可以有多个隐含层。为实现多层感知机，先从梯度的知识开始了解。什么是梯度导数、偏微分均为标量…

人工智能 2023年7月14日
0090

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31