被课程大作业逼着学习了解了一下bert,转念一想,这不是正好用来解答英语完形填空作业吗,因此有了以下代码:
首先我们导入会用到的库啊,pytorch_pretrained_bert这个库我是第一次用啊,需要去pip install 或者conda install:
import numpy as np
import torch
from pytorch_pretrained_bert import BertTokenizer, BertForMaskedLM
import re
from random import *
之后进行一些bert模型及词库的导入:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #导入词库
bert = BertForMaskedLM.from_pretrained('bert-base-uncased') #导入模型
bert.eval()
bert.to('cuda:0') #移动到GPU,没有GPU就不要这行代码了
然后我先导入了问题(选项)文本,用的是txt文件:
#选项提取
choices=ques_proce('question.txt') #处理问题(选项)文本
choices_idx=[]
for choice in choices: #进行tokenize化
choice_idx=tokenizer.convert_tokens_to_ids(choice)
choices_idx.append(choice_idx)
处理问题的函数如下:
def ques_proce(file):
f = open(file, 'r', encoding='gb18030', errors='ignore')
buffer = f.readline()
choices=[]
while buffer!='':
list=buffer.split()
one_que=[list[idx] for idx in [2,4,6,8]]
choices.append(one_que)
buffer=f.readline()
return choices
question.txt大概是这样的,因为不太会字符串的处理,所以需要这个txt文件满足一定格式,比如题号、选项号、选项之间必须有个空格:
- A. cried B. talked C. shouted D. laughed
- A. spoke B. told C. shouted D. asked
- A. her B. him C. it D. them
- A. brought B. took C. put D. got
- A. only B. ago C. later D. before
- A. hurt B. well C. healthy D. bad
- A. on B. in C. out D. off
- A. other B. one C. another D. others
- A. much B. very C. still D. also
- A. kept B. pulled C. done D. thrown
建立一个预测矩阵,记录的是每一个问题的4个选项的概率:
#建立预测概率矩阵
ans_prob=[]
for i in range(len(choices)):
ans_prob.append([0.0,0.0,0.0,0.0])
之后就是重要的部分了,首先讲一下文本的处理函数,(可能很复杂)主要思想是对于每一个有缺(mask、问题)的句子,随机从前后选另一个句子来和这个句子组合作为bert模型的输入,代码的复杂是因为这个随机选择的句子也有可能是一个有缺(mask、问题)的句子,因此要首先调用sen2maskIdx函数来记录每个句子对应哪一个或一些问题(可能一个句子有很多缺)
def sen2maskIdx(sen_list):
now_mask=0
result=[]
for idx in range(len(sen_list)):
sen=sen_list[idx]
mask_num=sen.count('[MASK]')
maskIdx=[i+now_mask for i in range(mask_num)]
now_mask+=mask_num
result.append(maskIdx)
return result
def pass_proce(file,per_times): #per_times是指每个句子和其他句子组合多少次
f = open(file, 'r', encoding='gb18030', errors='ignore') #处理一下文本
buffer = f.read()
buffer = re.sub(u'\n', ' ', buffer)
buffer = re.sub(u'\(\d{1,2}\)_{1,9}', '[MASK]', buffer)
sen_list=buffer.split('.')
for_all_mask=[]
sen2maskidx=sen2maskIdx(sen_list) #建立句子到问题的映射
for sen in sen_list:
if '[MASK]' in sen: #仅对有缺的句子进行处理
for_this_mask=[]
for i in range(per_times):
ano_idx=randint(0,len(sen_list)-1)
while sen_list[ano_idx]==sen:
ano_idx = randint(0, len(sen_list))
if sen_list.index(sen)>ano_idx:
temp_sen = '[CLS]' + sen_list[ano_idx] + ' [SEP] ' + sen + '[SEP]'
# segments_idx = [0] * (1 + len(sen_list[ano_idx]) + 1) + [1] * (len(sen) + 1)
mask_idx = sen2maskidx[ano_idx] + sen2maskidx[sen_list.index(sen)]
else:
temp_sen = '[CLS]' + sen + ' [SEP] ' + sen_list[ano_idx] + '[SEP]'
# segments_idx = [0] * (1 + len(sen) + 1) + [1] * (len(sen_list[ano_idx]) + 1)
mask_idx = sen2maskidx[sen_list.index(sen)] + sen2maskidx[ano_idx]
for_this_mask.append((temp_sen,mask_idx))
for_all_mask.append(for_this_mask)
return for_all_mask
实验用的文章如下,网上搜的中考真题,因为水平有限,需要这个文本自己满足一定的格式,比如题号和下划线距离其他英文字母有个空格之类的,对标点符号也是:
Helen was seven years old. One day one of her teeth began to hurt. She (1)_ in class at school , and her teacher (2)___ kindly, “What’s the matter, Helen?”
“One of my teeth hurts, “answered Helen.
“Tell your mother about (3)_____ , ” said the teacher, “and then go to see the dentist.”
That afternoon Helen told her mother about her tooth, and her mother (4)_ her to the dentist’s a few hours (5)_. The dentist looked at the tooth and then said to Helen. “It’s very (6). I’m going to pull it (7)__ , and then you’re going to get a new tooth. It will be as nice as (8)______ next year.” Then he did it with no trouble.
The next day Helen’s teacher asked her about the tooth. She said to her, “Does it (9)______ hurt, Helen?”
“I don’t know. You’d better ask the dentist, “Helen answered.
“Why?” the teacher asked.
“Because the dentist has (10)______ it, ” Helen answered.
预测过程的代码如下,对pass_proce函数返回的很多个句子的组合依次进行处理,对于每一个组合,预测在其所有mask的位置是4个选项的概率,加到上面建立的那个概率矩阵里(为什么要加呢,因为每一个mask对应的句子会有很多组合,因此根据不同的组合会预测很多次,想当与综合这些预测结果一起做预测):
#文本处理
text=pass_proce("passage.txt",10) #处理文章文本
for mask_sen in text:
for per_sen in mask_sen:
tokenized_text = tokenizer.tokenize(per_sen[0])
broke_point=tokenized_text.index('[SEP]')
segments_ids=[0]*(broke_point+1)+[1]*(len(tokenized_text)-broke_point-1)
que_idxs=per_sen[1]
ids = torch.tensor([tokenizer.convert_tokens_to_ids(tokenized_text)])
segments_tensors = torch.tensor([segments_ids])
ids = ids.to('cuda:0')
segments_tensors = segments_tensors.to('cuda:0')
#mask的位置提取
mask_num=tokenized_text.count('[MASK]')
mask_idxs=[idx for idx in range(len(tokenized_text)) if tokenized_text[idx]=='[MASK]']
#预测答案
result = bert(ids,segments_tensors)
for i in range(mask_num):
mask_idx=mask_idxs[i]
this_ans_prob = [result[0][mask_idx][choice_idx] for choice_idx in choices_idx[que_idxs[i]]]
ans_prob[que_idxs[i]]=[ans_prob[que_idxs[i]][j]+this_ans_prob[j] for j in range(4)]
这就得到了预测结果(概率矩阵),我们可以对其做归一化,不过在这个任务里做归一化没什么用,如果以后有余力说不定可以考虑其他的比如不同权重的组合方式:
#归一化
for i in range(len(choices)):
for j in range(4):
ans_prob[i][j]/=10
简单地根据概率最大的那个做出预测结果即可:
#计算预测答案
print(ans_prob)
ans_pred=[]
for per_que in ans_prob:
ans=['A','B','C','D'][np.array(per_que).argmax(axis=0)]
ans_pred.append(ans)
print(ans_pred)
然后为了计算正确率,我们导入正确答案的那个文本文件,之后比较计算正确率即可:
def ans_proce(file):
f = open(file, 'r', encoding='gb18030', errors='ignore')
buffer = f.readline()
answers = []
while buffer!='':
buffer=buffer.strip()
answers.append(buffer)
buffer=f.readline()
# print(answers)
return answers
#导入正确答案
ans_conrrect=ans_proce('answer.txt')
#计算正确率
correct=0.0
for i in range(len(choices)):
if ans_pred[i]==ans_conrrect[i]:
correct+=1
print("the correct rate is :"+str(correct/len(choices)*100.0)+"%")
上文选用的文章的正确答案如下:
A
D
C
B
C
D
C
D
C
A
bert给出的答案与准确率是:
[‘A’, ‘D’, ‘C’, ‘B’, ‘C’, ‘D’, ‘C’, ‘B’, ‘C’, ‘C’]
the correct rate is :80.0%
下面是一些总结与思考(碎碎念):
(1)bert不加改进或者迁移学习的话,应该是不能用于那种一个选项有两个词的那种问题,比如那种从’others’、’the other’、’another’、’other’选择正确答案的那种,因为对于’the other’,bert没法预测一个缺实际要填两个单词的这种。可能会想,那我们搞两个缺不就行了,我想了想,对于’the other’给两个缺的话,那bert给出的概率会是两个缺分别为’the’和’other’的乘积,那肯定比一个单词的概率低啊(p
Original: https://blog.csdn.net/weixin_53563701/article/details/121070457
Author: 野生的野蛮人
Title: 利用“bert模型”预测英语“完形填空”答案
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/531584/
转载文章受原作者版权保护。转载请注明原作者出处!