Huggingface抱抱脸实体命名识别NER案例

2023年5月31日上午4:52 • 人工智能 • 阅读 95

NLP之Huggingface抱抱脸实体命名识别NER案例

注：大家觉得博客好的话，别忘了点赞收藏呀，本人每周都会更新关于人工智能和大数据相关的内容，内容多为原创，Python Java Scala SQL 代码，CV NLP 推荐系统等，Spark Flink Kafka Hbase Hive Flume等等~写的都是纯干货，各种顶会的论文解读，一起进步。
Huggingface github：https://github.com/huggingface/
NER任务打标签神奇：https://github.com/doccano/doccano

博学谷IT学习技术支持

前言

Huggingface抱抱脸框架是当下非常流行的自然语言处理的框架，可以做各种自然语言处理任务，本文主要是运用Huggingface解决ner任务。ner任务也是很多其他任务的子任务。以前运用双向lstm加crf较多，本文直接采用bert。

; 一、训练数据集

2 B-year
0 I-year
1 I-year
9 I-year
年 I-year
成 O
人 O
高 B-exam
考 I-exam
招 O
生 O
统 O
一 O
考 B-exam
试 I-exam
时 O
间 O
表 O

数据为上面所示，其中B-year 代表年份开头，I-year代表年份结尾，B-exam代表考试开头，I-exam代表考试结尾，O代表其他。
ps标签工具有很多，本人使用的是开源的doccano。地址已经共享在文章开头。

二、读取数据代码

代码如下（示例）：

#这里自己写了一个函数用来读取数据，data_dir改成自己的路径
data_dir = ''
def read_data(file_path):
    file_path = Path(file_path)

    raw_text = file_path.read_text(encoding='UTF-8').strip()
    raw_docs = re.split(r'\n\t?\n', raw_text)
    # raw_docs = file_path.read_text().strip()
    token_docs = []
    tag_docs = []
    for doc in raw_docs:
        tokens = []
        tags = []
        for line in doc.split('\n'):
            token, tag = line.split(' ')
            tokens.append(token)
            tags.append(tag)
        token_docs.append(tokens)
        tag_docs.append(tags)
    return token_docs, tag_docs

train_texts, train_tags = read_data(data_dir + '/train.txt')
test_texts, test_tags = read_data(data_dir + '/val.txt')
val_texts, val_tags = read_data(data_dir + '/test.txt')

#unique_tags 代表有多少种标签，tag2id表示每种标签对应的id，id2tag表示每种id对应的标签。后面需要。
unique_tags = set(tag for doc in train_tags for tag in doc)
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
id2tag = {id: tag for tag, id in tag2id.items()}

三、导入Bert Tokenizer

代码如下（示例）：

from transformers import AutoTokenizer, BertTokenizerFast
#is_split_into_words表示已经分词好了
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
train_encodings = tokenizer(train_texts, is_split_into_words=True,return_offsets_mapping=True, padding=True, truncation=True,max_length=512)
val_encodings = tokenizer(val_texts, is_split_into_words=True,return_offsets_mapping=True, padding=True, truncation=True,max_length=512)

四、标签对齐

由于需要加上cls和padding，所以需要对标签做对应的处理，格外生成的用-100代替。代码如下


def encode_tags(tags, encodings):
    labels = [[tag2id[tag] for tag in doc] for doc in tags]
    #print(labels)
    encoded_labels = []
    for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
        # 创建全由-100组成的矩阵
        doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
        arr_offset = np.array(doc_offset)
        # set labels whose first offset position is 0 and the second is not 0
        if len(doc_labels) >= 510:#防止异常
            doc_labels = doc_labels[:510]
        doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
        encoded_labels.append(doc_enc_labels.tolist())

    return encoded_labels

train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)

五、构建数据集

class NerDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

train_encodings.pop("offset_mapping") # 训练不需要这个
val_encodings.pop("offset_mapping")
train_dataset = NerDataset(train_encodings, train_labels)
val_dataset = NerDataset(val_encodings, val_labels)

六、导入Bert 预训练模型进行微调

这里会先下载预训练模型
num_labels根据自己的任务进行修改，我这里改成5。

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer
model = AutoModelForTokenClassification.from_pretrained('ckiplab/albert-base-chinese-ner',num_labels=5,
                                                        ignore_mismatched_sizes=True,
                                                        id2label=id2tag,
                                                        label2id=tag2id
                                                        )

七、自定义评估标准

from datasets import load_metric
metric = load_metric("seqeval")

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # 不要管-100那些，剔除掉
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

八、开始训练

checkpoint = 'bert-base-chinese'
num_train_epochs = 1000
per_device_train_batch_size=8
per_device_eval_batch_size=8

training_args = TrainingArguments(
    output_dir='./output',          # 输入路径
    num_train_epochs=num_train_epochs,              # 训练epoch数量
    per_device_train_batch_size=per_device_train_batch_size,  # 每个GPU的BATCH
    per_device_eval_batch_size=per_device_eval_batch_size,
    warmup_steps=500,                # warmup次数
    weight_decay=0.01,               # 限制权重的大小
    logging_dir='./logs',
    logging_steps=10,
    save_strategy='steps',
    save_steps=1000,
    save_total_limit=1,
    evaluation_strategy='steps',
    eval_steps=1000
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)

trainer.train()
trainer.evaluate()

model.save_pretrained("./checkpoint/model/%s-%sepoch" % (checkpoint, num_train_epochs))

九、Demo运用

这里跑个小的demo，案例是input_str = ‘2009年高考在北京的报名费是2009元’

import torch
import numpy as np

def get_token(input):
    english = 'abcdefghijklmnopqrstuvwxyz'
    output = []
    buffer = ''
    for s in input:
        if s in english or s in english.upper():
            buffer += s
        else:
            if buffer: output.append(buffer)
            buffer = ''
            output.append(s)
    if buffer: output.append(buffer)
    return output

from transformers import AutoModelForTokenClassification, TrainingArguments, Trainer

model = AutoModelForTokenClassification.from_pretrained('./output/checkpoint-2000')

from transformers import BertTokenizerFast
tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')

if __name__ == '__main__':
    input_str = '2009年高考在北京的报名费是2009元'
    input_char = get_token(input_str)
    input_tensor = tokenizer(input_char, is_split_into_words=True, padding=True, truncation=True,
                             return_offsets_mapping=True, max_length=512, return_tensors="pt")
    input_tokens = input_tensor.tokens()
    offsets = input_tensor["offset_mapping"]
    ignore_mask = offsets[0, :, 1] == 0
    # print(input_tensor)
    input_tensor.pop("offset_mapping")  # 不剔除的话会报错
    outputs = model(**input_tensor)
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
    predictions = outputs.logits.argmax(dim=-1)[0].tolist()
    print(predictions)
    results = []

    tokens = input_tensor.tokens()
    idx = 0
    while idx < len(predictions):
        if ignore_mask[idx]:
            idx += 1
            continue
        pred = predictions[idx]
        label = model.config.id2label[pred]
        if label != "O":
            # 不加B-或者I-
            label = label[2:]
            start = idx
            end = start + 1
            # 获取所有token I-label
            all_scores = []
            all_scores.append(probabilities[start][predictions[start]])
            while (
                    end < len(predictions)
                    and model.config.id2label[predictions[end]] == f"I-{label}"
            ):
                all_scores.append(probabilities[end][predictions[end]])
                end += 1
                idx += 1
            # 得到是他们平均的
            score = np.mean(all_scores).item()
            word = input_tokens[start:end]
            results.append(
                {
                    "entity_group": label,
                    "score": score,
                    "word": word,
                    "start": start,
                    "end": end,
                }
            )
        idx += 1
    for i in range(len(results)):
        print(results[i])

总结

Huggingface提供非常简便的框架，本案例是基于bert的ner任务，以后还会更新gpt等等其他Huggingface案例。感兴趣的可以持续关注。

Original: https://blog.csdn.net/weixin_53280379/article/details/125355146
Author: 陈万君Allen
Title: Huggingface抱抱脸实体命名识别NER案例

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/548780/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

激光雷达基于欧几里德聚类找最近点和最近距离

雷达型号：robosens helios（32线）系统平台：Linux18.04+ros 写在前面：新手写程序肯定不是最优程序，可以说是相对最笨的方法，但是能实现功能，发出来自己做…

人工智能 2023年5月31日
0073
裂缝检测专题（3）裂缝数据集dataset总结1-分类

裂缝检测技术-基于图像处理 * – 用于裂缝分类用于裂缝分类 Concrete Crack Images for Classification 像素值：227&#21…

人工智能 2023年6月18日
0059
【微信小程序入门到精通】— AppID和个性配置你学会了么？

目录前言一、AppID 的获取二、初始化配置 * 2.1 新建项目并初始化 2.2 代码初始化 – 2.2.1 app.json 2.2.2 project.co…

人工智能 2023年5月31日
00105
人工智能知识全面讲解：人脸识别技术

早在40年前，图像识别领域就有很多关于人脸识别的研究。但是在当时，传统算法在普通图像识别中已经很难取得良好的识别效果，更何况还要从人脸中提取更加细微的特征。在很长一段时间里，人脸识…

人工智能 2023年6月16日
0056
pandas学习笔记：按照指定的条件筛选和修改某一列的值

import pandas as pd import numpy as np df = pd.DataFrame(data={"数据":[1,2,3,4,5,6…

人工智能 2023年7月9日
0071
【历史上的今天】11 月 12 日：USB 3.0 发布；图灵机论文被发表；TinyOS 创作者诞生

整理 | 王启隆透过「历史上的今天」，从过去看未来，从现在亦可以改变未来。今天是 2022 年 11 月 12 日，在 112 年前的今天，我国著名数学家华罗庚先生诞生，他也是…

人工智能 2023年6月28日
0048
模型剪枝概述

定义：剪枝方法探索模型权重中的冗余，并尝试删除/修剪冗余和非关键的权重。目的：减小模型大小并加速模型训练/推断，同时不会显着降低模型性能。分类：根据剪枝位置：非结构化剪枝和结构化…

人工智能 2023年6月16日
0073
代码随想录52——动态规划：300最长递增子序列、674最长连续递增序列、 718最长重复子数组

文章目录 1.300最长递增子序列 * 1.1.题目 1.2.解答 2.674最长连续递增序列 * 2.1.题目 2.2.解答 3.718最长重复子数组 * 3.1.题目 3.2….

人工智能 2023年6月27日
0051
Tensorflow(2.0+) 对五类医学图像进行分类

文章目录前言一，导入 TensorFlow 和其他库二，加载并探索数据集 * – 1，浏览数据集 2，加载图像的路径 3，可视化一些图像三，创建数据集 * &…

人工智能 2023年7月1日
0064
时间序列预测 | Python实现DeepAR模型时间序列预测

我可以回答这个问题。以下是一个使用Keras 实现_Transformer _时间序列预测模型_的 _Python_代码示例： _python_ import numpy as …

人工智能 2023年6月11日
0098
使用 OpenCV 处理图像和视频

介绍众所周知，计算机视觉在机器学习和人工智能领域获得了巨大的普及。图像识别技术允许计算机处理比人眼更多的信息，通常更快、更准确，或者只是在人们不参与观看的情况下处理。因此，你可能…

人工智能 2023年6月18日
0045
两条命令解决移动硬盘无法弹出的问题

dosapp-1218.rar usb-hdd-1218.rar usb-zip-fdd-1218.rar usboot-v1.68.rar 最新DOS 制作全攻略（软盘＋光盘＋…

人工智能 2023年6月27日
0090
分类、目标检测中的评价指标

一：准确率（Accuracy）、错误率（Error rate）准确率 = 正确分类的样本数 / 总样本数错误率 = 错误分类的样本数 / 总样本数 = 1 – 准…

人工智能 2023年7月2日
0063
【不断更新】数据分析的模型随笔

序号不分先后，学到一个新增一个目录 1.电商黄金收入公式 2.CPDA闭环模型 3.AARRR模型 4.AAAAA模型 5.EDIT模型 6.CRISP-DM模型 7.SEMMA…

人工智能 2023年7月16日
0085
channel-wise卷积–学习笔记

抵扣说明： 1.余额是钱包充值的虚拟货币，按照1:1的比例进行支付金额的抵扣。2.余额无法直接购买下载，可以购买VIP、付费专栏及课程。 Original: https://blo…

人工智能 2023年7月27日
0052
java学习笔记 day03-Java基础语法

就是对常量或者变量进行操作的符号。比如： + – * / 用运算符把常量或者变量连接起来的，符合Java语法的式子就是表达式。比如：a + b 这个…

人工智能 2023年6月30日
0067

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30

Huggingface抱抱脸实体命名识别NER案例

博学谷IT学习技术支持

文章目录

大家都在看