NLP实战：面向中文电子病历的命名实体识别

2023年5月27日下午7:27 • 人工智能 • 阅读 73

一.前言

本篇文章是关于NLP中的中文命名实体识别（Named Entity Recognition，NER）的实战项目，该项目利用了大型预训练语言模型BERT和BiLSTM神经网络结构来进行NER任务，文章详细介绍了NER的概念、数据集的预处理、模型的设计与实现。话不多说，直接上干货。

二.命名实体识别基础

2.1 什么是命名实体识别？

命名实体识别旨在 抽取非结构化文本中的命名实体（文本中具有特定意义的实体），例如人名、地名、组织名。NER是众多NLP应用的基础，包括知识图谱、信息检索、文本理解等等。

说具体点，命名实体识别指从文本中识别出 实体的边界以及实体的类型，其中实体的类型需要提前预定义好。NER的数学定义为：给定文本序列s = < w 1 , w 2 , . . . , w N > s = s =，输出一组元组< I s , I e , t > ，其中每个元组是序列s s s中的一个 命名实体提及。在每个元组中I s I_s I s 和I e I_e I e 表示实体的起止下标（即在文本中的索引下标），t t t是实体类别，来源于预定义类别集。下图为一个命名实体识别的例子，它表示从一个句子中抽取了三个命名实体 Michalel Jeffrey Jordan、Brooklyn、New York，其类别分别为 Person、Location、Location。

; 2.2 评价指标

命名实体识别的性能可以从两个维度进行评估：

[En]

The performance of named entity recognition can be evaluated from two dimensions:

单个实体类别
*所有实体类别

在单个实体类别上，通常使用精确率（percision）、召回率（recall）和F1分数（F1-score）。

在所有实体类别上，通常使用可以定义的两组不同的度量：

[En]

On all entity categories, you typically use two different sets of metrics that can be defined:

指标1：宏精确率（macro-P）、宏召回率（macro-R）和宏F1（macro-F1）
指标2：微精确率（micro-P）、微召回率（micro-R）和微F1（micro-F1）

其中宏F1主要受到稀有类别的影响，而微F1主要受到大类实体的影响。

限于篇幅，就不介绍严格指标和松弛指标这两个概念了，有兴趣的可以自己去看一下。

在实体对识别结果和真实标签进行测量时，可以调用 seqeval库来计算这些指标，该库的安装命令为：

pip install seqeval

三.数据集预处理

3.1 数据集说明

本次实战的数据集来源于CCKS2019的测评任务—— 面向中文电子病历的医疗实体识别及属性抽取。（数据集官方开源地址请点这里）。在该测评任务中实际包含了两个子任务：医疗命名实体识别、医疗实体及属性抽取。本文只做了医疗命名实体识别这一块，即 subtask1，该子任务的数据集分为两部分：

训练集： subtask1_training_part1.txt和 subtask1_training_part2.txt；
测试集： subtask1_test_set_with_answer.json。

在每个文档中，每行都为1个 json对象。每个对象都包含原始中文文本（通过 originalText键来获取）和对应的实体集（通过 entities来获取），下面是实体集中的一个实体的示例：

{"label_type": "药物", "overlap": 0, "start_pos": 470, "end_pos": 473}

其中

label_type表示实体类别；
start_pos表示实体的开始索引；
end_pos表示实体的结束索引。

整个数据集的实体类别如下，我通过词典将每个中文实体类别映射到英文：

[En]

The entity categories for the entire dataset are as follows, and I map each Chinese entity category to English through a dictionary:

entity_types = {
    '疾病和诊断': 'Disease&Dianonsis',
    '影像检查': 'Check',
    '实验室检验': 'Inspection',
    '手术': 'Surgery',
    '药物': 'Medicine',
    '解剖部位': 'AnatomicalSite'
}

3.2 预处理

3.2.1 数据集重新划分

在本文中，只有训练集和测试集。按照机器学习的划分思路，最好有一个验证集（NLP中成为开发集）。为此，我采取的措施是固定测试集不变，从训练集中拆分出一个验证集来，我选取了训练集样本（json对象）的30 % 30\%3 0 %来作为验证集。

3.2.2 处理为BIO序列

划分完数据集，本文将原始数据集处理为BIO序列。简单地说就是根据实体类型来 为文本中的每个字符贴标签，在NER中每个实体类型可以分为两种标签，假设实体类型为 Ntype，则该实体的开头字符标签为 B-Ntype，中间字符串标签为 I-Ntype，其它不属于实体的字符都标注为 O。对应于此概念，该数据集包含如下标签类型：

LABEL = ("O", "B-Disease&Dianonsis", "I-Disease&Dianonsis", "B-Check",
         "I-Check", "B-Inspection", "I-Inspection", "B-Surgery", "I-Surgery",
         "B-Medicine", "I-Medicine", "B-AnatomicalSite", "I-AnatomicalSite")

处理为BIO序列算法步骤为：对于每个样本，首先创建与其文本字符数长度相同的标签序列，标签序列中全为 'O'标签。然后遍历该文本对应的实体集，读取每个实体的开始索引 start_pos和结束索引 end_pos，将标签序列中开始索引处的标签替换为 B-实体类型，将标签序列中[ s t a r t _ p o s + 1 , e n d _ p o s ) [start_pos + 1, end_pos)[s t a r t _p o s +1 ,e n d _p o s )处的标签替换为 I-实体类型。

在处理为BIO序列的过程中，顺便统计了一下训练集、验证集和测试集三部分中各类实体的数量，其可视化结果如下所示：

横轴中：0疾病和诊断、1影像检查、2实验室检验、3手术、4药物、5解剖部位

3.2.3 预处理源码

上述前处理过程的源代码如下：

[En]

The source code of the above preprocessing process is shown as follows:

import json

entity_types = {
    '疾病和诊断': 'Disease&Dianonsis',
    '影像检查': 'Check',
    '实验室检验': 'Inspection',
    '手术': 'Surgery',
    '药物': 'Medicine',
    '解剖部位': 'AnatomicalSite'
}

def load_org_dataset(data_path):
"""
    加载原始数据集
"""
    datasets = []
    with open(data_path, "r", encoding="utf-8-sig") as fp:
        for sample in fp.readlines():
            js_data = json.loads(sample)
            datasets.append(js_data)
    return datasets

def to_bio_sequence(datasets, save_path):
"""
    处理为BIO序列
"""
    entity_nums = {k: 0 for k in entity_types.keys()}
    bios = []
    for data in datasets:
        cur_text = data.get('originalText')
        entities = data.get('entities')
        cur_label = ['O'] * len(cur_text)
        for entity in entities:

            label_type = entity.get('label_type')
            entity_nums[label_type] += 1
            entity_type = entity_types.get(label_type)

            start_pos = entity.get('start_pos')
            end_pos = entity.get('end_pos')

            cur_label[start_pos] = 'B-' + entity_type
            for idx in range(start_pos + 1, end_pos):
                cur_label[idx] = 'I-' + entity_type

        for c, l in zip(cur_text, cur_label):

            if c == " ":
                continue
            bios.extend(f"{c}\t{l}\n")

            if c in ["。", "；"]:
                bios.append('\n')

    print(entity_nums)
    with open(save_path, "w", encoding="utf-8") as fp:
        fp.writelines(bios)

if __name__ == "__main__":

    train_org_path = "yidu-s4k/subtask1_training_part1.txt"
    train_org_path1 = "yidu-s4k/subtask1_training_part2.txt"
    test_org_path = "yidu-s4k/subtask1_test_set_with_answer.json"
    train_dataset = load_org_dataset(train_org_path)
    train_dataset.extend(load_org_dataset(train_org_path1))
    test_dataset = load_org_dataset(test_org_path)

    train_size = int(len(train_dataset) * 0.7)

    train_path = "processed/train.txt"
    dev_path = "processed/dev.txt"
    test_path = "processed/test.txt"
    to_bio_sequence(train_dataset[:train_size], train_path)
    to_bio_sequence(train_dataset[train_size:], dev_path)
    to_bio_sequence(test_dataset, test_path)

四.模型实现

4.1 自定义数据集类

在Pytorch中可以继承 torch.utils.data.Dataset类来定义自己的数据集，然后调用 DataLoader()函数来批量加载数据，本文针对BIO序列的自定义数据集如下所示。在本次实验中，没有像之前几篇文章一样使用类似Word2Vec之类的词嵌入，而是采用了Hugging Face提供的bert-base-chinese-ws模型，自从BERT横空出世之后，大规模预训练语言模型在各种NLP任务上开始屠榜。限于篇幅，本文不展开讲了，感兴趣的可以自己去研读一下，这也是现在NLP中比较火的词向量获取方式了。

import torch
import torch.utils.data as data
from torch.utils.data import DataLoader
from transformers import BertTokenizerFast

LABEL = ("O", "B-Disease&Dianonsis", "I-Disease&Dianonsis", "B-Check",
         "I-Check", "B-Inspection", "I-Inspection", "B-Surgery", "I-Surgery",
         "B-Medicine", "I-Medicine", "B-AnatomicalSite", "I-AnatomicalSite")

tag2idx = {tag: idx for idx, tag in enumerate(LABEL)}

idx2tag = {idx: tag for idx, tag in enumerate(LABEL)}

class YiduS4K(data.Dataset):
    def __init__(self, data_path, pretrained_path, maxlen=256):
        super().__init__()
        self.maxlen = maxlen
        self.buildingDataset(data_path)
        self.tokenizer = BertTokenizerFast.from_pretrained(pretrained_path)

        self.cls_id = self.tokenizer.convert_tokens_to_ids("[CLS]")

        self.sep_id = self.tokenizer.convert_tokens_to_ids("[SEP]")

    def __len__(self, ):
        return len(self.dataset)

    def __getitem__(self, index):
        words, tags = self.dataset[index]
        x, y = [], []
        for w, t in zip(words, tags):
            xx = self.tokenizer.encode(w, add_special_tokens=False)
            yy = tag2idx[t]
            x.extend(xx)
            y.append(yy)
        x = [self.cls_id] + x[:self.maxlen - 2] + [self.sep_id]
        y = [0] + y[:self.maxlen - 2] + [0]

        return x, y

    def buildingDataset(self, data_path):
"""
        构建数据集
"""
        self.dataset = []
        with open(data_path, "r", encoding="utf-8") as fp:
            content = fp.read()
        sentences = content.split("\n\n")
        for sentence in sentences:
            cur_text, cur_label = [], []
            for pair in sentence.split("\n"):
                if not pair:
                    continue
                c, l = pair.split("\t")
                cur_text.append(c)
                cur_label.append(l)
            if not cur_text:
                continue
            self.dataset.append([cur_text, cur_label])

        self.dataset.sort(key=lambda x: len(x[0]))

def padding(batch):
"""
    填充样本使得长度与batch中最长的样本一致
"""

    maxlen = max([len(sample[0]) for sample in batch])
    x, y = [], []

    for sample in batch:
        x.append(sample[0] + [0] * (maxlen - len(sample[0])))
        y.append(sample[1] + [0] * (maxlen - len(sample[1])))
    f = torch.LongTensor
    return f(x), f(y)

if __name__ == "__main__":
    pass

注：bert-base-chinese-ws的下载可以通过如下命令：

git lfs install
git clone https://huggingface.co/ckiplab/bert-base-chinese-ws

4.2 模型设计

本次实验的模型为 Bert-BiLSTM，输入首先通过BERT层获取词向量，然后使用BiLSTM来提取双向语义信息，最后基于此来对每个字进行分类。下面的本实验的模型示意图（图是我本科毕设模型魔改出来的，毕设实际使用的模型比这个复杂一点）。

模型的实现源码如下：

import torch
import torch.nn as nn
from transformers import AutoModel

class BertBilstm(nn.Module):
    def __init__(self,
                 output_size,
                 embed_size,
                 num_layers,
                 hidden_size,
                 drop_prob,
                 pretrained_path="bert-base-chinese-ws"):
        super(BertBilstm, self).__init__()
        self.output_size = output_size

        self.bert = AutoModel.from_pretrained(pretrained_path)

        self.bilstm = nn.LSTM(bidirectional=True,
                              num_layers=num_layers,
                              input_size=embed_size,
                              hidden_size=hidden_size,
                              batch_first=True,
                              dropout=drop_prob)

        self.fc = nn.Linear(2 * hidden_size, output_size)

    def forward(self, x):

        with torch.no_grad():
            embed_x = self.bert(x)[0]
        lstmout, _ = self.bilstm(embed_x)
        return self.fc(lstmout)

if __name__ == "__main__":
    pass

五.实验及结果展示

5.1 实验设置

本次实验的环境为：

&#x64CD;&#x4F5C;&#x7CFB;&#x7EDF;: Win10
Python&#x7248;&#x672C;:
Pytorch&#x7248;&#x672C;: 1.8
&#x4E3B;&#x8981;&#x4F9D;&#x8D56;&#x5E93;: seqeval-1.2.2, transformers-4.17.0

本次实验参数设置如下(因时间原因不进行参数调整)：

[En]

The parameters of this experiment are set as follows (there is no parameter adjustment due to time reasons):

params = {
    'pretrained_path': "bert-base-chinese-ws",
    "lr": 0.001,
    "batch_size": 64,
    "epochs": 10,
    "output_size": len(LABEL),
    "embed_size": 768,
    "hidden_size": 256,
    "num_layers": 1,
    "drop_prob": 0.5
}

本实验采用的训练方式是在训练集上更新模型参数，然后通过验证集选择最优模型，最后在最优模型上对测试集进行评估。

[En]

The training mode used in this experiment is to update the model parameters on the training set, then select a best model through the verification set, and finally evaluate the test set on the best model.

5.2 实验结果展示

首先展示训练集上的loss以及验证集上的f1-score随训练的epoch的变化情况：

然后显示测试集上单个类别以及所有类别的总体实验结果：

[En]

Then show the overall experimental results for a single category on the test set, as well as for all categories:

单从这一结果可以看出，不同类别实体的识别准确率差异很大。

[En]

Looking at this result alone, we can see that there is a great difference in the recognition accuracy of different categories of entities.

; 六.结语

完整项目源码：获取地址
参考资料：

A Survey on Deep Learning for Named Entity Recognition（2.1和2.2节参考）
ckiplab/bert-base-chinese-ws

以上就是本文的全部内容，如果你感觉不错，你可以为博主点赞或关注，你的支持是博主进步的不竭动力，当然，如果你有任何问题，请批评纠正！

[En]

The above is the whole content of this article, if you feel good, you can like or follow the blogger, your support is an inexhaustible driving force for the progress of bloggers, of course, if you have any questions, please criticize and correct them!

Original: https://blog.csdn.net/qq_42103091/article/details/124695544
Author: 斯曦巍峨
Title: NLP实战：面向中文电子病历的命名实体识别

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/527403/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

公司招了个五年经验的测试员，见识到了真正的测试天花板

啊哦~你想找的内容离你而去了哦内容不存在，可能为如下原因导致： ① 内容还在审核中 ② 内容以前存在，但是由于不符合新的规定而被删除 ③ 内容地址错误 ④ 作者删除了内容。可…

人工智能 2023年7月5日
0060
UCL葡萄酒（red white wine quality）数据集字段解释、数据导入实战

UCL葡萄酒（red white wine quality）数据集字段解释、数据导入实战目录 UCL葡萄酒（red white wine quality）数据集字段解释、数据导入…

人工智能 2023年6月19日
0087
前端工程师个人的价值在哪里（换一个人能不能做？）【前端晋升必看】

最近脉脉上这张图片在前端圈子中火了起来。这个故事中”毫无人性”的评委在晋升中对前端同学提出了两个置疑点：业务方向是产品经理决定的，即使前端高质量高效率的完…

人工智能 2023年6月30日
0065
强化学习：确定性策略梯度（DDPG）

1，确定性策略梯度随机性策略梯度算法被广泛应用于解决大型动作空间或者连续动作空间的强化学习问题。其基本思想是将策略表示成以为参数的策略函数。基于采样数据，通过调整参数使得最…

人工智能 2023年6月13日
0062
中小企业该如何选择合适，性价比超高的CRM客户管理系统？

企业选择产品，面临的不仅是经济成本，还有时间成本、人力成本，因此最怕”试”，万一失败，损失重大，因此选型一定要慎之又慎。那么，企业如何才能选择一款最合适的…

人工智能 2023年6月27日
0067
CelebA数据集下载|HTTPSConnectionPool(host=‘drive.google.com‘, port=443)|RuntimeError:Dataset not found

CeleA是香港中文大学的开放数据，包含10177个名人身份的202599张图片，并且都做好了特征标记，这个数据集对人脸相关的训练来说是非常好用的数据集。但是它不像其他数据集一样…

人工智能 2023年6月24日
00109
新词汇–知识图谱概念、应用（智能搜索、数据分析、NLP）

参考链接：什么是知识图谱为什么需要知识图谱？知识图谱的技术与应用 1.概念在图书情报界称为知识域可视化或知识领域映射地图，是显示知识发展进程与结构关系的一系列各种不同的图形，…

人工智能 2023年6月1日
00105
TensorFlowLite + Armnn 实现神经网络推理

随着深度学习技术的飞速发展，越来越多的神经网络可以运行嵌入式设备上了，但是网上的教程多以安卓平台为主，这可能是因为手机平板等移动设备装机量巨大，所以大家都比较关注，而嵌入式linu…

人工智能 2023年5月24日
0082
tensorrt安装教程

nvcc- V https://developer.nvidia.com/nvidia-tensorrt-download下载并解压缩至自己期望的目录中打开bashrc文件 su…

人工智能 2023年5月26日
0058
CUDA的卸载（v10.0）与安装（v10.2）

文章目录 CUDA v10.0的卸载 CUDA v10.2的安装 * 首先查看CUDA驱动的版本：在cmd中输入nvidia-smi 下载CUDA 配置环境变量：安装CUDNN …

人工智能 2023年7月27日
00164
中兴c600olt数据配置_OLT(ONU)语音业务数据标准配置指导-zte

中兴语音业务配置流程： (本数据规范以常见开局方式为例) 假设在 OLT 口下注册一个 onu onu 的语音 vlan 3000, IP:10.65.3.22, 语音网关为： 1…

人工智能 2023年5月27日
00146
论文阅读|基于领域知识图谱的多文档摘要生成与应用

论文地址：基于领域知识图谱的多文档摘要生成与应用 ; 先验知识 1.多文档摘要技术：（理解：类似于每篇文章的摘要、关键词，方便通过标签筛选是否是你需要的内容）利用计算机将同一主题下…

人工智能 2023年6月1日
0080
MMDetection亲测安装教程

MMDetection是一个基于 PyTorch 的目标检测开源工具箱。接下来就安装看看吧。本人安装环境：系统环境：Ubuntu 20.04.2 LTS cuda版本：11.0…

人工智能 2023年7月20日
0055
pandas 根据列名索引多列数据_pandas之DataFrame取行列（df.loc(),df.iloc()）以及索引…

import pandas as pd import numpy as np df = pd.DataFrame(np.arange(24).reshape(6,4),index=…

人工智能 2023年7月6日
0071
伪造语音检测数据集

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、C币套餐、付费专栏及课程。 Original: https:…

人工智能 2023年5月27日
0071
简单的聊天机器人（软件机器人）拥有QA机器人，闲聊机器人，任务机器人，场景机器人等等一些。

一、智能聊天机器人 1.智能聊天机器人使用自然语言来模拟一种形式的人类对话。人机对话的流程。 [En] Use natural language to simulate a fo…

人工智能 2023年5月25日
0070

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

NLP实战：面向中文电子病历的命名实体识别

2.1 什么是命名实体识别？

; 2.2 评价指标

3.1 数据集说明

3.2 预处理

3.2.1 数据集重新划分

3.2.2 处理为BIO序列

3.2.3 预处理源码

4.1 自定义数据集类

4.2 模型设计

5.1 实验设置

5.2 实验结果展示

大家都在看