bert 文本分类实战

前言:

由于课题需要,学习自然语言处理(NLP),于是在网上找了找文章和代码进行学习,在此记录,课题代码就不展示了,使用网上的代码和大家分享。思想和代码大部分参考苏神,在此感谢。

任务目标:

希望bert模型解决的问题: 输入:一段话; 输出:这段话属于的类别。

任务实现原理:

本次模型为监督学习模型,根据已有标签的文本数据集,对bert模型进行训练。使用训练好的模型对句子进行预测,输出得到句子的类别。本质上属于多分类问题。

大致流程为,数据集预处理;划分数据集();对数据集加工处理(文本数据编码成符合bert输入的向量);构建模型(bert模型导入与使用); 将数据送入模型进行训练和预测。

模型总体结构:

bert 文本分类实战

具体代码:

数据集预处理:

将数据集读入,并打乱数据的排列顺序。

mainPath = 'bert多文本分类//'
rc = pd.read_csv(mainPath + 'data/tnews/toutiao_news_dataset.txt', delimiter="_!_", names=['labels', 'text'], header=None, encoding='utf-8')
rc = shuffle(rc)  #打乱顺序

bert 文本分类实战

划分数据集

将数据集划分为训练集和测试集(验证集)

构建全部所需数据集
data_list = []
for d in rc.iloc[:].itertuples():   #itertuples(): 将DataFrame迭代为元祖。
    data_list.append((d.text, d.labels))

取一部分数据做训练和验证
train_data = data_list[0:20000]
valid_data = data_list[20000:22000]

bert 文本分类实战

数据加工处理:

修改原有的字典:

修改原因:苏神解读,本来 Tokenizer 有自己的 _tokenize 方法,我这里重写了这个方法,是要保证 tokenize 之后的结果,跟原来的字符串长度等长(如果算上两个标记,那么就是等长再加 2)。 Tokenizer 自带的 _tokenize 会自动去掉空格,然后有些字符会粘在一块输出,导致 tokenize 之后的列表不等于原来字符串的长度了,这样如果做序列标注的任务会很麻烦。主要就是用 [unused1] 来表示空格类字符,而其余的不在列表的字符用 [UNK] 表示,其中 [unused*] 这些标记是未经训练的(随即初始化),是 Bert 预留出来用来增量添加词汇的标记,所以我们可以用它们来指代任何新字符。

#vocabPath里存储了大量的词语,每个词语对应的着一个编号  例如10640 posts
将词表中的词编号转换为字典
字典的形式为  '仑': 796,
#得到最原始的字典
tokenDict = {}
with codecs.open(vocabPath, 'r', encoding='utf-8') as reader:
    for line in reader:
        token = line.strip()                      # 去除首尾空格
        tokenDict[token] = len(tokenDict)

#原始的字典存在着瑕疵,在原始的字典上需要根据自己的数据集,创造自己的字典
重写tokenizer
class OurTokenizer(Tokenizer):
    def _tokenize(self, content):
        reList = []
        for t in content:
            if t in self._token_dict:
                reList.append(t)
            elif self._is_space(t):

                # 用[unused1]来表示空格类字符
                reList.append('[unused1]')
            else:
                # 不在列表的字符用[UNK]表示
                reList.append('[UNK]')
        return reList

#使用新的字典
tokenizer = OurTokenizer(tokenDict)

bert 文本分类实战

文本数据根据字典编码成符合bert输入的向量,逐批生成数据([X1,X2],Y),从而可以丢到模型中训练。

def seqPadding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X])

class data_generator:
    def __init__(self, data, batch_size=32, shuffle=True):  #&#x6784;&#x9020;&#x51FD;&#x6570;&#xFF0C;&#x4F7F;&#x7528;&#x65F6;&#x6267;&#x884C;
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            idxs = list(range(len(self.data))) #&#x6570;&#x636E;&#x5143;&#x7EC4;&#x4E0B;&#x6807;

            if self.shuffle:
                np.random.shuffle(idxs)        #&#x662F;&#x5426;&#x6253;&#x4E71;&#x6570;&#x636E;&#x4E0B;&#x6807;&#x987A;&#x5E8F;

            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)  #  encode&#x65B9;&#x6CD5;&#x53EF;&#x4EE5;&#x4E00;&#x6B65;&#x5230;&#x4F4D;&#x5730;&#x751F;&#x6210;&#x5BF9;&#x5E94;&#x6A21;&#x578B;&#x7684;&#x8F93;&#x5165;&#x3002;

                y = d[1]
                X1.append(x1)                           ## x1 &#x662F;&#x5B57;&#x5BF9;&#x5E94;&#x7684;&#x7D22;&#x5F15;  # x2 &#x662F;&#x53E5;&#x5B50;&#x5BF9;&#x5E94;&#x7684;&#x7D22;&#x5F15;
                X2.append(x2)
                Y.append([y])
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seqPadding(X1)             #&#x5982;&#x679C;&#x7B49;&#x4E8E;batchsize&#x6216;&#x8005;&#x6700;&#x540E;&#x4E00;&#x4E2A;&#x503C;&#x540E;&#x9762;&#x8865;&#x5145;0
                    X2 = seqPadding(X2)
                    Y = seqPadding(Y)
                    yield [X1, X2], Y
                    [X1, X2, Y] = [], [], []

构建模型和训练:

加载bert模型,并对bert模型的输出进行调整,使bert模型能够完成我们的任务目标。

&#x8BBE;&#x7F6E;&#x9884;&#x8BAD;&#x7EC3;bert&#x6A21;&#x578B;&#x7684;&#x8DEF;&#x5F84;
configPath = mainPath + 'chinese_roberta_wwm_ext_L-12_H-768_A-12/bert_config.json'
ckpPath = mainPath + 'chinese_roberta_wwm_ext_L-12_H-768_A-12/bert_model.ckpt'
vocabPath = mainPath + 'chinese_roberta_wwm_ext_L-12_H-768_A-12/vocab.txt'

bert&#x6A21;&#x578B;&#x8BBE;&#x7F6E;
bert_model = load_trained_model_from_checkpoint(configPath, ckpPath, seq_len=None)  # &#x52A0;&#x8F7D;&#x9884;&#x8BAD;&#x7EC3;&#x6A21;&#x578B;
for l in bert_model.layers:
    l.trainable = True

x1_in = Input(shape=(None,))
x2_in = Input(shape=(None,))

x = bert_model([x1_in, x2_in])

&#x53D6;&#x51FA;[CLS]&#x5BF9;&#x5E94;&#x7684;&#x5411;&#x91CF;&#x7528;&#x6765;&#x505A;&#x5206;&#x7C7B;
x = Lambda(lambda x: x[:, 0])(x)
p = Dense(15, activation='softmax')(x)

model = Model([x1_in, x2_in], p)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(1e-5), metrics=['accuracy'])
model.summary()

train_D = data_generator(train_data)
valid_D = data_generator(valid_data)

model.fit_generator(train_D.__iter__(), steps_per_epoch=len(train_D), epochs=5, validation_data=valid_D.__iter__(),
                    validation_steps=len(valid_D))

模型预测:

#&#x6D4B;&#x8BD5;&#x7684;&#x6570;&#x636E;&#x96C6;
str1 = "&#x4E0A;&#x6E2F;&#x4E3B;&#x573A;1-2&#x8D1F;&#x4E8E;&#x56FD;&#x5B89;&#xFF0C;&#x906D;&#x9047;&#x8054;&#x8D5B;&#x4E24;&#x8FDE;&#x8D25;&#xFF0C;&#x4E0A;&#x6E2F;&#x5230;&#x5E95;&#x8F93;&#x5728;&#x54EA;&#xFF1F;"
str2 = "&#x666E;&#x4EAC;&#x603B;&#x7EDF;&#x4F1A;&#x89C1;&#x4E86;&#x62DC;&#x767B;&#x603B;&#x7EDF;"
str3 = "&#x8FD9;3&#x8F86;10&#x4E07;&#x51FA;&#x5934;&#x5C0F;&#x94A2;&#x70AE;&#xFF0C;&#x968F;&#x4FBF;&#x6539;&#x6539;&#x8F7B;&#x677E;&#x79D2;&#x5954;&#x9A70;&#xFF0C;&#x7B2C;&#x4E00;&#x8F86;&#x8FD8;&#x662F;&#x9650;&#x91CF;&#x6B3E;"
predict_D = data_generator([(str1, 0), (str2, 3), (str3, 10)], shuffle=False)
#&#x83B7;&#x53D6;&#x603B;&#x7684;&#x6807;&#x7B7E;&#x7C7B;&#x522B;
#array(['&#x4F53;&#x80B2;', '&#x519B;&#x4E8B;', '&#x519C;&#x4E1A;', '&#x56FD;&#x9645;', '&#x5A31;&#x4E50;', '&#x623F;&#x4EA7;', '&#x6559;&#x80B2;', '&#x6587;&#x5316;', '&#x65C5;&#x6E38;', '&#x6C11;&#x751F;&#x6545;&#x4E8B;', '&#x6C7D;&#x8F66;','&#x7535;&#x7ADE;&#x6E38;&#x620F;', '&#x79D1;&#x6280;', '&#x8BC1;&#x5238;&#x80A1;&#x7968;', '&#x8D22;&#x7ECF;'], dtype=object)
output_label2id_file = os.path.join(mainPath, "model/keras_class/label2id.pkl")
if os.path.exists(output_label2id_file):
    with open(output_label2id_file, 'rb') as w:
        labes = pickle.load(w)

#&#x52A0;&#x8F7D;&#x4FDD;&#x5B58;&#x7684;&#x6A21;&#x578B;
from keras_bert import get_custom_objects
custom_objects = get_custom_objects()
model = load_model(mainPath + 'model/keras_class/tnews.h5', custom_objects=custom_objects)
#&#x4F7F;&#x7528;&#x751F;&#x6210;&#x5668;&#x83B7;&#x53D6;&#x6D4B;&#x8BD5;&#x7684;&#x6570;&#x636E;
tmpData = predict_D.__iter__()
#&#x9884;&#x6D4B;
preds = model.predict_generator(tmpData, steps=len(predict_D), verbose=1)
&#x6C42;&#x6BCF;&#x884C;&#x6700;&#x5927;&#x503C;&#x5F97;&#x4E0B;&#x6807;&#xFF0C;&#x5176;&#x4E2D;&#xFF0C;axis=1&#x8868;&#x793A;&#x6309;&#x884C;&#x8BA1;&#x7B97;
index_maxs = np.argmax(preds, axis=1)
result = [(x, labes[x]) for x in index_maxs]
print(result)

输出preds,index_maxs, result

bert 文本分类实战

完整代码

import pickle
from keras_bert import load_trained_model_from_checkpoint, Tokenizer
from keras.layers import *
from keras.models import Model
from keras.optimizers import Adam
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle
from keras.utils.vis_utils import plot_model
import codecs, gc
import keras.backend as K
import os
import pandas as pd
import numpy as np

&#x6587;&#x4EF6;&#x4E3B;&#x8DEF;&#x5F84;&#x5B9A;&#x4E49;
mainPath = '&#x4F60;&#x7684;&#x76EE;&#x5F55;/keras_bert&#x6587;&#x672C;&#x5206;&#x7C7B;&#x5B9E;&#x4F8B;/'

&#x4ECE;&#x6587;&#x4EF6;&#x4E2D;&#x8BFB;&#x53D6;&#x6570;&#x636E;&#xFF0C;&#x83B7;&#x53D6;&#x8BAD;&#x7EC3;&#x96C6;&#x548C;&#x9A8C;&#x8BC1;&#x96C6;
rc = pd.read_csv(mainPath + 'data/tnews/toutiao_news_dataset.txt', delimiter="_!_", names=['labels', 'text'],
                 header=None, encoding='utf-8')   #delimiter

rc = shuffle(rc)  # shuffle&#x6570;&#x636E;&#xFF0C;&#x6253;&#x4E71;

&#x628A;&#x7C7B;&#x522B;&#x8F6C;&#x6362;&#x4E3A;&#x6570;&#x5B57;
&#x4E00;&#x5171;15&#x4E2A;&#x7C7B;&#x522B;&#xFF1A;"&#x6559;&#x80B2;","&#x79D1;&#x6280;","&#x519B;&#x4E8B;","&#x65C5;&#x6E38;","&#x56FD;&#x9645;","&#x8BC1;&#x5238;&#x80A1;&#x7968;","&#x519C;&#x4E1A;","&#x7535;&#x7ADE;&#x6E38;&#x620F;"&#xFF0C;
"&#x6C11;&#x751F;&#x6545;&#x4E8B;","&#x6587;&#x5316;","&#x5A31;&#x4E50;","&#x4F53;&#x80B2;","&#x8D22;&#x7ECF;","&#x623F;&#x4EA7;","&#x6C7D;&#x8F66;"
class_le = LabelEncoder()
rc.iloc[:, 0] = class_le.fit_transform(rc.iloc[:, 0].values)

&#x4FDD;&#x5B58;&#x6807;&#x7B7E;&#x6587;&#x4EF6;
output_label2id_file = os.path.join(mainPath, "model/keras_class/label2id.pkl")
if not os.path.exists(output_label2id_file):
    with open(output_label2id_file, 'wb') as w:
        pickle.dump(class_le.classes_, w)

&#x6784;&#x5EFA;&#x5168;&#x90E8;&#x6240;&#x9700;&#x6570;&#x636E;&#x96C6;
data_list = []
for d in rc.iloc[:].itertuples():
    data_list.append((d.text, d.labels))

&#x53D6;&#x4E00;&#x90E8;&#x5206;&#x6570;&#x636E;&#x505A;&#x8BAD;&#x7EC3;&#x548C;&#x9A8C;&#x8BC1;
train_data = data_list[0:20000]
valid_data = data_list[20000:22000]

maxlen = 100  # &#x8BBE;&#x7F6E;&#x5E8F;&#x5217;&#x957F;&#x5EA6;&#x4E3A;100&#xFF0C;&#x8981;&#x4FDD;&#x8BC1;&#x5E8F;&#x5217;&#x957F;&#x5EA6;&#x4E0D;&#x8D85;&#x8FC7;512

&#x8BBE;&#x7F6E;&#x9884;&#x8BAD;&#x7EC3;&#x6A21;&#x578B;
configPath = mainPath + 'chinese_roberta_wwm_ext_L-12_H-768_A-12/bert_config.json'
ckpPath = mainPath + 'chinese_roberta_wwm_ext_L-12_H-768_A-12/bert_model.ckpt'
vocabPath = mainPath + 'chinese_roberta_wwm_ext_L-12_H-768_A-12/vocab.txt'

&#x5C06;&#x8BCD;&#x8868;&#x4E2D;&#x7684;&#x8BCD;&#x7F16;&#x53F7;&#x8F6C;&#x6362;&#x4E3A;&#x5B57;&#x5178;
tokenDict = {}
with codecs.open(vocabPath, 'r', encoding='utf-8') as reader:
    for line in reader:
        token = line.strip()
        tokenDict[token] = len(tokenDict)

&#x91CD;&#x5199;tokenizer
class OurTokenizer(Tokenizer):
    def _tokenize(self, content):
        reList = []
        for t in content:
            if t in self._token_dict:
                reList.append(t)
            elif self._is_space(t):

                # &#x7528;[unused1]&#x6765;&#x8868;&#x793A;&#x7A7A;&#x683C;&#x7C7B;&#x5B57;&#x7B26;
                reList.append('[unused1]')
            else:
                # &#x4E0D;&#x5728;&#x5217;&#x8868;&#x7684;&#x5B57;&#x7B26;&#x7528;[UNK]&#x8868;&#x793A;
                reList.append('[UNK]')
        return reList

tokenizer = OurTokenizer(tokenDict)

def seqPadding(X, padding=0):
    L = [len(x) for x in X]
    ML = max(L)
    return np.array([np.concatenate([x, [padding] * (ML - len(x))]) if len(x) < ML else x for x in X])

class data_generator:  #&#x5148;&#x5C06;&#x6570;&#x636E;&#x53D8;&#x6210;&#x5143;&#x7EC4;&#x7684;&#x5F62;&#x5F0F;&#x5728;&#x5582;&#x5165;&#x751F;&#x6210;&#x5668;
    def __init__(self, data, batch_size=32, shuffle=True):
        self.data = data
        self.batch_size = batch_size
        self.shuffle = shuffle
        self.steps = len(self.data) // self.batch_size
        if len(self.data) % self.batch_size != 0:
            self.steps += 1

    def __len__(self):
        return self.steps

    def __iter__(self):
        while True:
            idxs = list(range(len(self.data)))

            if self.shuffle:
                np.random.shuffle(idxs)

            X1, X2, Y = [], [], []
            for i in idxs:
                d = self.data[i]
                text = d[0][:maxlen]
                x1, x2 = tokenizer.encode(first=text)
                y = d[1]
                X1.append(x1)
                X2.append(x2)
                Y.append([y])
                if len(X1) == self.batch_size or i == idxs[-1]:
                    X1 = seqPadding(X1)
                    X2 = seqPadding(X2)
                    Y = seqPadding(Y)
                    yield [X1, X2], Y
                    [X1, X2, Y] = [], [], []

bert&#x6A21;&#x578B;&#x8BBE;&#x7F6E;
bert_model = load_trained_model_from_checkpoint(configPath, ckpPath, seq_len=None)  # &#x52A0;&#x8F7D;&#x9884;&#x8BAD;&#x7EC3;&#x6A21;&#x578B;

for l in bert_model.layers:
    l.trainable = True

x1_in = Input(shape=(None,))
x2_in = Input(shape=(None,))

x = bert_model([x1_in, x2_in])

&#x53D6;&#x51FA;[CLS]&#x5BF9;&#x5E94;&#x7684;&#x5411;&#x91CF;&#x7528;&#x6765;&#x505A;&#x5206;&#x7C7B;
x = Lambda(lambda x: x[:, 0])(x)
p = Dense(15, activation='softmax')(x)

model = Model([x1_in, x2_in], p)
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam(1e-5), metrics=['accuracy'])
model.summary()

train_D = data_generator(train_data)
valid_D = data_generator(valid_data)

model.fit_generator(train_D.__iter__(), steps_per_epoch=len(train_D), epochs=5, validation_data=valid_D.__iter__(),
                    validation_steps=len(valid_D))

model.save(mainPath + 'model/keras_class/tnews.h5', True, True)

&#x4FDD;&#x5B58;&#x6A21;&#x578B;&#x7ED3;&#x6784;&#x56FE;
plot_model(model, to_file='model/keras_class/tnews.png', show_shapes=True)

参考链接
https://blog.csdn.net/qq_39290990/article/details/121672141

Original: https://blog.csdn.net/qq_52200688/article/details/121726236
Author: wh来啦
Title: bert 文本分类实战

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/530486/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球