Bert+LSTM+CRF命名实体识别pytorch代码详解

2023年5月27日下午6:48 • 人工智能 • 阅读 93

Bert+LSTM+CRF命名实体识别

从0开始解析源代码。

理解原代码的逻辑，具体了解为什么使用预训练的bert，bert有什么作用，网络的搭建是怎么样的，训练过程是怎么训练的，输出是什么
调试运行源代码

NER目标

NER是named entity recognized的简写，对人名、地名、 机构名、 日期时间、 专有名词等进行识别。

结果输出标注方法

使用细粒度标注是为每个单词赋予一个标签，其中连续的单词可能是一个标签，这与原始数据集的结构不同，因此需要对数据进行处理并转换为相应的细粒度标注形式。

[En]

Using fine-grained tagging is to give a label to each word, in which the continuous word may be a label, which is different from the structure of the original data set, so the data needs to be processed and transformed into the corresponding fine-grained annotation form.

数据集形式修改

形式：

{
    "text": "浙商银行企业信贷部叶老桂博士则从另一个角度对五道门槛进行了解读。叶老桂认为，对目前国内商业银行而言，",
    "label": {
        "name": {
            "叶老桂": [
                [9, 11],
                [32, 34]
            ]
        },
        "company": {
            "浙商银行": [
                [0, 3]
            ]
        }
    }
}

修改后数据集对应格式：

sentence: ['温', '格', '的', '球', '队', '终', '于', '又', '踢', '了', '一', '场', '经', '典', '的', '比', '赛', '，', '2', '比', '1', '战', '胜', '曼', '联', '之', '后', '枪', '手', '仍', '然', '留', '在', '了', '夺', '冠', '集', '团', '之', '内', '，']
label: ['B-name', 'I-name', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-organization', 'I-organization', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']

数据预处理

对于一个句子不进行分词，原因是NER为序列标注任务，需要确定边界，分词后就可能产生错误的分词结果影响效果（B-x，I-x这种连续性，分词后会影响元意思表达）。

def preprocess(self, mode):
"""
        params:
            words：将json文件每一行中的文本分离出来，存储为words列表
            labels：标记文本对应的标签，存储为labels
        examples:
            words示例：['生', '生', '不', '息', 'C', 'S', 'O', 'L']
            labels示例：['O', 'O', 'O', 'O', 'B-game', 'I-game', 'I-game', 'I-game']
"""
np.savez_compressed(output_dir, words=word_list, labels=label_list)

保存的文件也还是一句是一句的，所以后续处理中只有CLS，不需要终止符。

数据集分集与分batch

def dev_split(dataset_dir):
    """split dev set"""
    data = np.load(dataset_dir, allow_pickle=True)
    words = data["words"]
    labels = data["labels"]
    x_train, x_dev, y_train, y_dev = train_test_split(words, labels, test_size=config.dev_split_size, random_state=0)
    return x_train, x_dev, y_train, y_dev

调用train_test_split实现分train和dev的数据集。

将数据转化形式，用idx表示，构造NERDataset类表示使用数据集

    def __init__(self, words, labels, config, word_pad_idx=0, label_pad_idx=-1):
        self.tokenizer = BertTokenizer.from_pretrained(config.bert_model, do_lower_case=True)
        self.label2id = config.label2id
        self.id2label = {_id: _label for _label, _id in list(config.label2id.items())}
        self.dataset = self.preprocess(words, labels)
        self.word_pad_idx = word_pad_idx
        self.label_pad_idx = label_pad_idx
        self.device = config.device

    def preprocess(self, origin_sentences, origin_labels):
"""
        Maps tokens and tags to their indices and stores them in the dict data.

        examples:
            word:['[CLS]', '浙', '商', '银', '行', '企', '业', '信', '贷', '部']
            sentence:([101, 3851, 1555, 7213, 6121, 821, 689, 928, 6587, 6956],
                        array([ 1,  2,  3,  4,  5,  6,  7,  8,  9]))
            label:[3, 13, 13, 13, 0, 0, 0, 0, 0]
"""
        data = []
        sentences = []
        labels = []

        for line in origin_sentences:

            words = []
            word_lens = []
            for token in line:
                words.append(self.tokenizer.tokenize(token))
                word_lens.append(len(token))

            words = ['[CLS]'] + [item for token in words for item in token]
            token_start_idxs = 1 + np.cumsum([0] + word_lens[:-1])

            sentences.append((self.tokenizer.convert_tokens_to_ids(words), token_start_idxs))

        for tag in origin_labels:
            label_id = [self.label2id.get(t) for t in tag]
            labels.append(label_id)
        for sentence, label in zip(sentences, labels):
            data.append((sentence, label))
        return data

preprocess处理token和word，记录每个token在word中的起始位置用于后续的对齐，对于每个单词进行tokennize（中文无变化，英文可能会有，但数据处理过程中将单词分成字母，所以无影响），然后在句首加上开始字符，因为生成第一个单词也需要概率因此句首不能省略,然后就是将字符转化成idx存储，tag也转化成idx；

类中的功能函数

def __getitem__(self, idx):
    """sample data to get batch"""
    word = self.dataset[idx][0]
    label = self.dataset[idx][1]
    return [word, label]
def __len__(self):
    """get dataset size"""
    return len(self.dataset)

可以对访问和访问长度进行索引。

[En]

Access and access length can be indexed.

encode_plus可以直接编码，但这里不能使用：align限制

因为单词要和标签对应，直接tokennize后编码，不能确定与标签的对应关系；

tokennize（）

对于英文一个token通过tokennize会得到多个word：cutting->cut+##ing；

np.cumsum（a）累计计数

[1,1,1]--->[1,2,3]

模型架构

首先要明确，是继承bert基类，然后自定义forward函数就建好网络了，基本结构试：

class Module(nn.Module):
    def __init__(self):
        super(Module, self).__init__()

    def forward(self, x):

        return x
data = .....

module = Module()

module(data)

关于forward的解释

nn.module中实现时就在call函数中定义了调用forward，然后传参就自动调用了。

定义__call__方法的类可以当作函数调用，具体参考Python的面向对象编程。也就是说，当把定义的网络模型model当作函数调用的时候就自动调用定义的网络模型的forward方法。nn.Module 的__call__方法部分源码如下所示：

def __call__(self, *input, **kwargs):
result = self.forward(*input, **kwargs)

BERT 模式：选择对应，在代码的不同部分都有切换（model.eval();model.train()）

train
eval
predict

nonezero（）函数

a = mat([[1,1,0],[1,1,0],[1,0,3]])
print(a.nonzero())

squeeze()函数介绍

去掉为1的维度，如 [[0,1,2],[1,2,3]]dim(1,2,3)-->squeeze(1)--->[[0,1,2].[1,2,3]]

CRF层训练

训练目标：lstm输出分数+转移分数+前面序列的累计转移分数也就是 emission Score和transition Score(ref)，函数使用，初始设置只需要标签数目，后续forward需要batch；如果想要知道结果需要使用 decode函数

>>> import torch
>>> from torchcrf import CRF
>>> num_tags = 5
>>> model = CRF(num_tags)
emissions = torch.randn(seq_length, batch_size, num_tags)
>>> model(emissions, tags, mask=mask)
tensor(-10.8390, grad_fn=<SumBackward0>)

>>> model.decode(emissions)
[[3, 1, 3], [0, 1, 0]]

引用这个图：

模型构造：

class BertNER(BertPreTrainedModel):
    def __init__(self, config):
        super(BertNER, self).__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.bilstm = nn.LSTM(
            input_size=config.lstm_embedding_size,
            hidden_size=config.hidden_size // 2,
            batch_first=True,
            num_layers=2,
            dropout=config.lstm_dropout_prob,
            bidirectional=True
        )
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.crf = CRF(config.num_labels, batch_first=True)

        self.init_weights()

直接使用pytorch已经实现的函数，设置好bert层，后面通过droupout非线性层随机失活，然后使加上双向LSTM， 注意双向的隐藏层是将两个方向的直接拼接，因此每个的长度设置为总的隐藏层输出长度的一半；然后接线性层，得到的是对于这些tag的每一个的分数，对于每一个位置，都给出是n钟tag的分数，这些分数作为crf层得到输入；然后进入crf层；

初始化权重：对于预训练模型，直接加载已有参数，对不可用的参数随机初始化。

[En]

Initialization weight: for the pre-training model, the existing parameters are loaded directly, and the parameters that are not available will be initialized randomly.

设置前向传播训练，：

def forward(self, input_data, token_type_ids=None, attention_mask=None, labels=None,
            position_ids=None, inputs_embeds=None, head_mask=None):
    input_ids, input_token_starts = input_data
    outputs = self.bert(input_ids,
                        attention_mask=attention_mask,
                        token_type_ids=token_type_ids,
                        position_ids=position_ids,
                        head_mask=head_mask,
                        inputs_embeds=inputs_embeds)
    sequence_output = outputs[0]

    origin_sequence_output = [layer[starts.nonzero().squeeze(1)]
                              for layer, starts in zip(sequence_output, input_token_starts)]

    padded_sequence_output = pad_sequence(origin_sequence_output, batch_first=True)

    padded_sequence_output = self.dropout(padded_sequence_output)
    lstm_output, _ = self.bilstm(padded_sequence_output)

    logits = self.classifier(lstm_output)
    outputs = (logits,)
    if labels is not None:
        loss_mask = labels.gt(-1)
        loss = self.crf(logits, labels, loss_mask) * (-1)
        outputs = (loss,) + outputs

        return outputs

如果标签存在就计算loss，否则就是输出线性层对应的结果，这样便于通过后续crf的decode函数解码得到预测结果。在 train.py/evaluate()里面用到了：

            batch_output = model((batch_data, batch_token_starts),
                                 token_type_ids=None, attention_mask=batch_masks)[0]

            batch_output = model.crf.decode(batch_output, mask=label_masks)

各个层的作用为：

bert

单词的嵌入表示提供了通过大规模训练，结果更具普遍性，因此使用预训练模型，但参数具有更好的初始化值。

[En]

The embedded representation of the word provides that through large-scale training, the results are more generalized, so the pre-training model is used, but the parameters have a better initialization value.

lstm

从这里开始是正式的模型内容，这里是双向lstm，能够学习句子的上下文内容，从而给出每个字的标注。

crf

由于原始句法约束，lstm没有学习到原始的句法约束，因此使用条件随机场crf层来限制句法要求，从而加强结果。loss为发射分数和转移分数统一的分数，越小越好

验证

使用f1 score，兼顾了分类模型的精确率和召回率，最大为1，最小为0，越大越好。

模型训练

训练时采用patience_counter策略，如果连续patience_counter次f1值没有提升，而且已经达到了最小训练次数，训练停止，代码实现为：

def train(train_loader, dev_loader, model, optimizer, scheduler, model_dir):
    """train the model and test model performance"""

    if model_dir is not None and config.load_before:
        model = BertNER.from_pretrained(model_dir)
        model.to(config.device)
        logging.info("--------Load model from {}--------".format(model_dir))
    best_val_f1 = 0.0
    patience_counter = 0

    for epoch in range(1, config.epoch_num + 1):
        train_epoch(train_loader, model, optimizer, scheduler, epoch)
        val_metrics = evaluate(dev_loader, model, mode='dev')
        val_f1 = val_metrics['f1']
        logging.info("Epoch: {}, dev loss: {}, f1 score: {}".format(epoch, val_metrics['loss'], val_f1))
        improve_f1 = val_f1 - best_val_f1
        if improve_f1 > 1e-5:
            best_val_f1 = val_f1
            model.save_pretrained(model_dir)
            logging.info("--------Save best model!--------")
            if improve_f1 < config.patience:
                patience_counter += 1
            else:
                patience_counter = 0
        else:
            patience_counter += 1

        if (patience_counter >= config.patience_num and epoch > config.min_epoch_num) or epoch == config.epoch_num:
            logging.info("Best val f1: {}".format(best_val_f1))
            break
    logging.info("Training Finished!")

参数更新，学习率衰减

采用学习率分离，adamW优化采纳数，动态调整学习率的策略。

设置控制系数不衰减的项，然后 optimizer_grouped_parameters要将全部的参数都写进去，注意写法的不同：crf层的参数学习率更高，而且写法不同是直接的parameters，见下文写法：

    if config.full_fine_tuning:

        bert_optimizer = list(model.bert.named_parameters())
        lstm_optimizer = list(model.bilstm.named_parameters())
        classifier_optimizer = list(model.classifier.named_parameters())
        no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
        optimizer_grouped_parameters = [
            {'params': [p for n, p in bert_optimizer if not any(nd in n for nd in no_decay)],
             'weight_decay': config.weight_decay},
            {'params': [p for n, p in bert_optimizer if any(nd in n for nd in no_decay)],
             'weight_decay': 0.0},
            {'params': [p for n, p in lstm_optimizer if not any(nd in n for nd in no_decay)],
             'lr': config.learning_rate * 5, 'weight_decay': config.weight_decay},
            {'params': [p for n, p in lstm_optimizer if any(nd in n for nd in no_decay)],
             'lr': config.learning_rate * 5, 'weight_decay': 0.0},
            {'params': [p for n, p in classifier_optimizer if not any(nd in n for nd in no_decay)],
             'lr': config.learning_rate * 5, 'weight_decay': config.weight_decay},
            {'params': [p for n, p in classifier_optimizer if any(nd in n for nd in no_decay)],
             'lr': config.learning_rate * 5, 'weight_decay': 0.0},
            {'params': model.crf.parameters(), 'lr': config.learning_rate * 5}
        ]

    else:
        param_optimizer = list(model.classifier.named_parameters())
        optimizer_grouped_parameters = [{'params': [p for n, p in param_optimizer]}]
    optimizer = AdamW(optimizer_grouped_parameters, lr=config.learning_rate, correct_bias=False)
    train_steps_per_epoch = train_size // config.batch_size
    scheduler = get_cosine_schedule_with_warmup(optimizer,
                                                num_warmup_steps=(config.epoch_num // 10) * train_steps_per_epoch,
                                                num_training_steps=config.epoch_num * train_steps_per_epoch)

    logging.info("--------Start Training!--------")
    train(train_loader, dev_loader, model, optimizer, scheduler, config.model_dir)

源代码这里不微调逻辑存有问题，原github已提交issue，暂时没有回应（没用到）

结果分析

f1score最终为0.79；

在书籍、公司、游戏、政府、人名上f1 score都大于0.8，效果较好；

原数据：

模型BiLSTM+CRFRoberta+SoftmaxRoberta+CRFRoberta+BiLSTM+CRFaddress47.3757.50
64.11

63.15book65.7175.3280.94
81.45

company71.0676.7180.10
80.62

game76.2882.9083.74
85.57

government71.2979.02
83.14

81.31movie67.5383.2383.11
85.61

name71.4988.1287.44
88.22

organization73.2974.3080.32
80.53

position72.3377.39
78.95

78.82scene51.1662.5671.36
72.86 overall

67.4775.9079.34
79.64

这里使用的是bert预训练模型，可以看到从预训练模型上说，和 roberta在各个数据上稍微差一些，但最后的差值和原本实验结果相近。

实验test时的bad—case分析

枪手，这里的系统被误判为组织。

[En]

Shooter, the system here is misjudged as organization.

教委错判为政府；

彩票监管部门自认为是政府，实则是组织。

[En]

The lottery regulatory department thinks it is the government, but it is actually the organization.

中国木业中心自认为是一家公司，其实是一个组织。

[En]

The China Wood Center thinks it is a company, but it is actually an organization.

可以看出由于有了条件随机场的限制，没有明显的B-peron后面跟I-name这种错误，出现的错误大都是内容上的，即使是人也不一定分清，可见这个模型的强大。

; 参考

是对文章里面不涉及的部分的进一步解析，适合小白开箱使用。

源码为：传送门

Original: https://blog.csdn.net/qq_48034566/article/details/123794375
Author: Remember00000
Title: Bert+LSTM+CRF命名实体识别pytorch代码详解

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/527289/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

方差递推公式_时间序列的几个基本概念:样本自协方差函数、自协方差函数、自相关函数、偏自相关系数…

1. 样本自协方差函数 2. 自协方差函数 3. 自相关函数 4. 偏自相关系数 1. 样本自协方差函数对于满足均值遍历性、二阶矩遍历性的平稳时间序列一次具体观测值，总体平均可转…

人工智能 2023年6月18日
00114
【原创】基于TensorFlow2识别人是否配戴眼镜的研究

随着时代的发展，电子产品种类越来越丰富，方便我们日常生活的同时，也造成了越来越多人面临近视问题，为更快速准确地统计人群的戴眼镜率，本文基于TensorFlow2，先运用爬虫爬取到了…

人工智能 2023年5月23日
0073
如何高效构建融资担保场景化分析体系

融资担保业务的商业模式以银行贷款担保业务为例：融资担保公司一端对接中小微企业，一端对接银行，通过为借款方增信助其获得银行资金，并从中收取1%-3%的担保费。融资担保业务痛点一…

人工智能 2023年7月18日
0048
【Transformers】第 9 章：处理很少或没有标签

使用无标签数据我们将考虑的第一个技术是零样本分类，它适用于根本没有标记数据的环境。这在工业中非常普遍，并且可能因为没有带有标签的历史数据或者因为获取数据的标签很困难而发生。在本…

人工智能 2023年6月29日
0094
如何理解线性回归的多重共线性、岭回归和Lasso（案例：波士顿房价数据集）

前言：本文主要介绍多重共线性、岭回归和Lasso的概念、公式推导及sklearn应用，使用的数据集为波士顿房价数据集、加利福尼亚房价数据集。目录如何从行列式理解多重共线性？如…

人工智能 2023年6月17日
0066
基于 Amazon SageMaker 利用 MONAI 处理医疗影像数据实践

介绍神经网络已被证明可有效解决复杂的计算机视觉任务，例如对象检测、图像相似性和分类。随着低成本 GPU 的发展，构建和部署神经网络的计算成本已大幅降低。然而，大多数技术旨在处理视…

人工智能 2023年7月14日
00101
基于Python的人脸识别课堂考勤系统（毕设）

一个菜鸟搞毕业设计的过程分享，可能对迷茫的你起到一点点作用！序言在着手开发项目之前我们一定要对系统进行一个初步的规划，比如系统可以实现什么功能，是否需要开发GUI页面（大部分导…

人工智能 2023年6月18日
00119
车道线检测数据集介绍

1.Tusimple数据集特点：位于高速路，天气晴朗，车道线清晰，特点是车道线以点来标注。(ground_truth:json格式) （提供带有实例级车道注释的大规模图像数据。 …

人工智能 2023年7月27日
0077
Linux —— 进程概念超详解！(持续更新……)

目录 1.什么是进程？ 2.进程的状态 3.Linux是怎么做的 4.Linux的进程管理 5.僵尸进程 6.孤儿进程 7.进程优先级 8.进程的四个重要概念 9.环境变量 1.什…

人工智能 2023年6月26日
0088
【图像处理】sobel边缘检测的实现

Sobel算子是图像边缘检测中最重要的算子之一,该算子包含两组3×3的矩阵，分别为横向及纵向，将之与图像作2D卷积，即可分别得出横向及纵向的亮度差分近似值。Gx,Gy的值…

人工智能 2023年6月18日
0099
【小程序-开篇】国内IT技术圈的技能树貌似点歪了？

欢迎来到我的博客📔博主是一名大学在读本科生，主要学习方向是前端。 🍭目前已经更新了【Vue】、【React–从基础到实战】、【TypeScript】等等系列专栏 🌈博客主页👉cod…

人工智能 2023年5月31日
0087
python opencv cv2.putText()显示中文问题

本文章的所有代码和相关文章，仅用于经验技术交流分享，禁止将相关技术应用到不正当途径，滥用技术产生的风险与本人无关。本文章是自己学习的一些记录。开始在做项目想将自己的想法输出显…

人工智能 2023年5月26日
00124
YOLO5 训练自定义数据集及避坑指南

数据集完全可以沿用VOCDevkit格式数据集 |–VOC2007 |—Annotations |—ImageSets |—-Lay…

人工智能 2023年7月9日
00111
复盘：基于attention的多任务多模态情绪情感识别，基于BERT实现文本情感分类（pytorch实战）

复盘：基于attention机制的多任务多模态情绪情感识别（pytorch实战），基于BERT实现文本情感分类 提示&#xFF1…

人工智能 2023年7月21日
00120
R语言矩阵运算：矩阵转置、计算逆矩阵、两个矩阵的相乘、构建nxn对角（单位）矩阵

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年6月25日
00111
ubuntu18下TensorRT加速YOLOV5推理（超详细）

TensorRT是Nvidia官方给的C++推理加速工具，如同OpenVINO之于Intel。支持诸多的AI框架，如Tensorflow，Pytorch，Caffe，MXNet等。…

人工智能 2023年7月14日
0091

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Bert+LSTM+CRF命名实体识别pytorch代码详解

NER目标

结果输出标注方法

数据集形式修改

数据预处理

数据集分集与分batch

将数据转化形式，用idx表示，构造NERDataset类表示使用数据集

类中的功能函数

encode_plus可以直接编码，但这里不能使用：align限制

tokennize（）

np.cumsum（a）累计计数

模型架构

关于forward的解释

BERT 模式 ：选择对应，在代码的不同部分都有切换（model.eval();model.train()）

nonezero（）函数

squeeze()函数介绍

CRF层训练

bert

lstm

crf

验证

模型训练

参数更新，学习率衰减

结果分析

实验test时的bad—case分析

; 参考

大家都在看

BERT 模式：选择对应，在代码的不同部分都有切换（model.eval();model.train()）