transformers实践:基于BERT训练自己的NER模型

文章目录

transformers实践:基于BERT训练自己的NER模型
数据集处理
训练过程
模型的调用和使用
附：两个数据集说明：
附：参考

transformers实践:基于BERT训练自己的NER模型

基于训练好的BERT进行迁移NER的原理如下：

官方样例集成的很好，直接运行run_ner.py即可，下面对几个步骤(数据预处理、运行参数、模型调用)做下补充说明

; 数据集处理

run_ner.p中train_file要求的格式样例，如
https://github.com/huggingface/transformers/blob/master/tests/fixtures/tests_samples/conll/sample.json

{"words": ["痛", "1", "天", "。"], "ner": ["B-SIGNS", "O", "O", "O"]}
{"words": ["痛", "5", "天", "。"], "ner": ["B-SIGNS", "O", "O", "O"]}

数据集1：CCKS 2020：面向中文电子病历的医疗实体及事件抽取（一）医疗命名实体识别的原始数据集的格式为：

{
  "train": [
    [
      [
        "女",
        "O"
      ],
      [
        "性",
        "O"
      ],
      [
        "，",
        "O"
      ],
      [
        "8",
        "O"
      ],
      [
        "8",
        "O"
      ]
    ]
  ]
}

数据集2：
天池竞赛CBLUE数据集转换
中文医疗信息处理挑战榜CBLUE(Chinese Biomedical Language Understanding Evaluation)数据集


    "text": "患者缘于1小时前不慎伤及腰部，伤后疼痛，腰部活动受限，未"
    "entities": [
      {
        "start_idx": 12,
        "end_idx": 13,
        "type": "BODY",
        "entity": "腰部"
      },

分别对应下面两个转换函数

def convert_array_to_conll(fname, out_file=None):
    with open(fname, 'rt') as f:
        c = json.load(f)

    for k, v in c.items():
        out = []
        for sentence in v:
            words = list(map(lambda x: x[0], sentence))
            ner = list(map(lambda x: x[1], sentence))
            out.append({"words": words, "ner": ner})
        if out_file is None:
            out_file = fname.replace('.json', '_' + k + '_t.json')
        with open(out_file, 'a') as f:
            for x in out:
                f.write(json.dumps(x, ensure_ascii=False)+'\n')

def convert_simple_to_conll(fname, out_file=None):
    """&#x628A;&#x7B80;&#x4ECB;&#x6A21;&#x5F0F;&#x7684;&#x683C;&#x5F0F;&#x8F6C;&#x6362;&#x4E3A;conll&#x683C;&#x5F0F;
    "text": "&#x60A3;&#x8005;&#x7F18;&#x4E8E;1&#x5C0F;&#x65F6;&#x524D;&#x4E0D;&#x614E;&#x4F24;&#x53CA;&#x8170;&#x90E8;&#xFF0C;&#x4F24;&#x540E;&#x75BC;&#x75DB;&#xFF0C;&#x8170;&#x90E8;&#x6D3B;&#x52A8;&#x53D7;&#x9650;&#xFF0C;&#x672A;"
    "entities": [
      {
        "start_idx": 12,
        "end_idx": 13,
        "type": "BODY",
        "entity": "&#x8170;&#x90E8;"
      },
    &#x8F6C;&#x6362;&#x4E3A;&#xFF1A;
    {"words": ["&#x75DB;", "1", "&#x5929;", "&#x3002;"], "ner": ["B-SIGNS", "O", "O", "O"]}
    {"words": ["&#x75DB;", "5", "&#x5929;", "&#x3002;"], "ner": ["B-SIGNS", "O", "O", "O"]}
"""
    with open(fname, 'rt') as f:
        c = json.load(f)
    out = []
    for x in c:
        text = list(x['text'])
        ner = ['O']*len(text)
        for e in x["entities"]:
            # print(x['text'],text,len(ner),e)
            _type = e['type'].upper()
            ner[e['start_idx']] = 'B-'+_type
            ner[e['start_idx']+1:e['end_idx']+1] = ['I-'+_type] * \
                (e['end_idx']-e['start_idx'])
        out.append({"words": text, "ner": ner})
    if out_file is None:
        out_file = fname.replace('.json', '_t.json')
    with open(out_file, 'a') as f:
        for x in out:
            f.write(json.dumps(x, ensure_ascii=False)+'\n')

训练过程

上一步预处理后的数据集就可以拿来训练了，如下

python run_ner.py \
  --model_name_or_path bert-base-chinese \
  --train_file train_data_train_t.json \
  --validation_file train_data_test_t.json \
  --output_dir ./tmp/ner1 \
  --do_train \
  --do_eval

默认的epoch为3，模型输出到./test-ner目录，就可以直接通过from_pretrained加载了

另外也支持no Trainer方式：

export TASK_NAME=ner
python run_ner_no_trainer.py \
--model_name_or_path  bert-base-chinese \
--train_file train_data_train_t.json \
--validation_file train_data_test_t.json \
--task_name $TASK_NAME \
--per_device_train_batch_size 8 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir ./tmp/$TASK_NAME/

但这种方式输出的目录只有两个文件：config.json和pytorch_model.bin，只能通过from_pretrained加载模型，而且还要手工补文件

也可以定义更多个性化的训练参数

python run_ner.py \
  --model_name_or_path bert-base-chinese \
  --train_file CMeEE_dev_t.json \
  --validation_file CMeEE_train_t.json \
  --output_dir ./tmp/CMeEE \
  --save_steps 2000 \
  --logging_steps 200  \
  --eval_steps 500  \
  --evaluation_strategy steps  \
  --num_train_epochs 21  \
  --do_train \
  --do_eval

运行流程如下：

读取参数说明Training/evaluation parameters TrainingArguments
datasets.builder
下载，生成json数据，保存在缓冲 /root/.cache/huggingface/datasets/
configuration_utils.py
加载模型的config.json，结合数据集进行调整(没看明白为何写两遍日志)
tokenization_utils_base.py
加载vocab.txt
加载tokenizer.json
加载added_tokens.json 可以为空
加载special_tokens_map.json 可以为空
加载tokenizer_config.json
modeling_utils.py：
加载pytorch_model.bin
缓冲处理后的数据集
trainer.py开始训练

模型的调用和使用

首先设置环境变量USE_TORCH，表示使用的是pytorch版（USE_TF表示使用tensorflow）

import os
os.environ['USE_TORCH']="1"
import torch

加载训练好的模型

from transformers import AutoModelForTokenClassification, AutoTokenizer
pretrained_model ='./transformers/examples/pytorch/token-classification/tmp/CMeEE/'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model)
model = AutoModelForTokenClassification.from_pretrained(pretrained_model)

进行推理

sequence='遇有发热、咽峡炎和淋巴结肿大三联症，血中淋巴细胞增多并出现异淋时，称传染性单核细胞增多症（infectiousmononucleosis，IM；简称传单）。'
print(tokenizer.decode(tokenizer.encode(sequence)))
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")
outputs = model(inputs)
outputs= outputs.logits
predictions = torch.argmax(outputs, dim=2)
for token, prediction in zip(tokens, predictions[0].numpy()):
     print((token, model.config.id2label[prediction]))

结果如下:

[CLS] &#x9047; &#x6709; &#x53D1; &#x70ED; &#x3001; &#x54BD; &#x5CE1; &#x708E; &#x548C; &#x6DCB; &#x5DF4; &#x7ED3; &#x80BF; &#x5927; &#x4E09; &#x8054; &#x75C7; &#xFF0C; &#x8840; &#x4E2D; &#x6DCB; &#x5DF4; &#x7EC6; &#x80DE; &#x589E; &#x591A; &#x5E76; &#x51FA; &#x73B0; &#x5F02; &#x6DCB; &#x65F6; &#xFF0C; &#x79F0; &#x4F20; &#x67D3; &#x6027; &#x5355; &#x6838; &#x7EC6; &#x80DE; &#x589E; &#x591A; &#x75C7; &#xFF08; infectiousmononucleosis &#xFF0C; [UNK] &#xFF1B; &#x7B80; &#x79F0; &#x4F20; &#x5355; &#xFF09; &#x3002; [SEP]
('[CLS]', 'O')
('&#x9047;', 'O')
('&#x6709;', 'O')
('&#x53D1;', 'B-DIS')
('&#x70ED;', 'I-DIS')
('&#x3001;', 'O')
('&#x54BD;', 'B-DIS')
('&#x5CE1;', 'I-DIS')
('&#x708E;', 'I-DIS')
('&#x548C;', 'O')
('&#x6DCB;', 'B-DIS')
('&#x5DF4;', 'I-DIS')
('&#x7ED3;', 'I-DIS')
('&#x80BF;', 'I-DIS')
('&#x5927;', 'I-DIS')
('&#x4E09;', 'O')
('&#x8054;', 'O')
('&#x75C7;', 'O')
('&#xFF0C;', 'O')
('&#x8840;', 'O')
('&#x4E2D;', 'O')
('&#x6DCB;', 'B-BOD')
('&#x5DF4;', 'I-BOD')
('&#x7EC6;', 'I-BOD')
('&#x80DE;', 'I-BOD')
('&#x589E;', 'I-SYM')
('&#x591A;', 'I-SYM')
('&#x5E76;', 'O')
('&#x51FA;', 'O')
('&#x73B0;', 'O')
('&#x5F02;', 'I-SYM')
('&#x6DCB;', 'I-SYM')
('&#x65F6;', 'O')
('&#xFF0C;', 'O')
('&#x79F0;', 'O')
('&#x4F20;', 'B-DIS')
('&#x67D3;', 'I-DIS')
('&#x6027;', 'I-DIS')
('&#x5355;', 'I-DIS')
('&#x6838;', 'I-DIS')
('&#x7EC6;', 'I-DIS')
('&#x80DE;', 'I-DIS')
('&#x589E;', 'I-DIS')
('&#x591A;', 'I-DIS')
('&#x75C7;', 'I-DIS')
('&#xFF08;', 'O')
('in', 'B-DIS')
('##fe', 'I-DIS')
('##ct', 'I-DIS')
('##ious', 'I-DIS')
('##mon', 'I-DIS')
('##on', 'I-DIS')
('##uc', 'I-DIS')
('##le', 'I-DIS')
('##os', 'I-DIS')
('##is', 'I-DIS')
('&#xFF0C;', 'O')
('[UNK]', 'B-DIS')
('&#xFF1B;', 'O')
('&#x7B80;', 'O')
('&#x79F0;', 'O')
('&#x4F20;', 'B-DIS')
('&#x5355;', 'I-DIS')
('&#xFF09;', 'O')
('&#x3002;', 'O')
('[SEP]', 'I-DIS')

附：两个数据集说明：

相同模型，对下列数据集的的测试结果为：
CCKS 2020：面向中文电子病历的医疗实体及事件抽取（一）医疗命名实体识别的运行情况

***** train metrics *****
  epoch                    =        3.0
***** eval metrics *****
  eval_accuracy           =     0.9776
  eval_f1                 =     0.9275
  eval_loss               =     0.0872
  eval_precision          =     0.9175
  eval_recall             =     0.9377
  eval_runtime            = 0:00:02.22
  eval_samples            =         57
  eval_samples_per_second =     25.644
  eval_steps_per_second   =        1.8

中文医疗信息处理挑战榜CBLUE(Chinese Biomedical Language Understanding Evaluation)数据集的运行情况

***** train metrics *****
  epoch                    =       20.0
  train_loss               =     0.0231
  train_runtime            = 0:25:01.89
  train_samples            =      10000
  train_samples_per_second =    133.165
  train_steps_per_second   =      8.323
***** eval metrics *****
  epoch                   =       20.0
  eval_accuracy           =     0.8211
  eval_f1                 =     0.5732
  eval_loss               =     1.7064
  eval_precision          =     0.5383
  eval_recall             =     0.6129
  eval_runtime            = 0:01:08.33
  eval_samples            =      15000
  eval_samples_per_second =    219.501
  eval_steps_per_second   =     13.726

发现CCKS的数据质量明显高于CBLUE

附：参考

https://huggingface.co/transformers/task_summary.html#named-entity-recognition
https://huggingface.co/transformers/training.html

Original: https://blog.csdn.net/lichangzhen2008/article/details/119946739
Author: 钢铁峡
Title: transformers实践:基于BERT训练自己的NER模型

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/531736/

转载文章受原作者版权保护。转载请注明原作者出处！

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

transformers实践:基于BERT训练自己的NER模型

文章目录

大家都在看