语言模型(LM)介绍及实操

2023年5月28日上午10:26 • 人工智能 • 阅读 84

原文地址：https://medium.com/analytics-vidhya/a-comprehensive-guide-to-build-your-own-language-model-in-python-5141b3917d6d

文章开头便引用了一句话： _We tend to look through language and not realize how much power language has._我们往往低估了语言的力量。文本摘要抽取，文本生成，文本自动填充这些任务都依赖于Language Model (LM)，事实上，LM是大部分NLP任务的基石，本篇文章就带我们由浅入深，亲自实践LM去了解它的广度与深度。

什么是LM?

A language model learns to predict the probobality of a sequence of words. LM学习预测一个词序列出现的概率。如何理解一个词序列出现的概率？以一个机器翻译的例子来说明，在机器翻译任务中，通常是给你一个词序列，让你把它转换成另一个词序列，你需要估计转换后的词序列的概率分布，概率最高的那个词序列就是一个理想的翻译结果。比如以下两个词序列：the cat is small 和 small the is cat，很明显第一个词序列出现的概率更高。当模型能够学习到语言的规律(词序列的概率分布)时，就可以解决很多NLP任务。

LM的类别

有两种类型的LM：

Statistical Language Model 统计语言模型。这类LM利用一些传统的统计模型如N-gram, HMM，或者一些特定的统计规则来学习词的概率分布。
Neural Language Model 神经语言模型。利用神经网络来建模的语言模型。

下面将分别介绍这两类语言模型。

构建一个N-gram Model

N-gram 就是一个长度为N的词序列，可以通过下面这个例子来理解N-gram：

“I love reading blogs about data science on Analytics Vidhya.”

这个句子中可以抽出1-gram有”I”,”love”,”reading”等等由一个词构成的单元，2-gram包括”I love”,”love reading”,”reading blogs”等由两个连续词构成的序列。一个N-gram Model可以预测自然语言中一个长度为N的词序列出现的概率。

为了预测长度为N的语言序列的概率，即构建该N-gram Model，需要使用链式法则来获取N个词出现的联合概率分布：

p(w1…ws) = p(w1) . p(w2 | w1) . p(w3 | w1 w2) . p(w4 | w1 w2 w3) ….. p(wn | w1…wn-1)

可以发现为了求得联合概率分布，需要计算若干条件概率。这些条件概率都是在给定history的条件下来预测下一个词，即下图所示：

可以发现当历史很长时，若考虑所有的历史词p(wn | w1…wn-1)，会使得模型空间过大，模型过于复杂，同时也会有数据稀疏等问题。因此，通常使用马尔科夫假设，使得下一个要预测的词只与当前一个词有关，与其他历史词无关，从而使得条件概率简化为p(wn | wn-1)。

下面来实操构建一个N-gram Language Model。使用的Reuters数据集共包含10,788篇新闻文档，1,300,000个词。使用以下代码可以构建一个language model：

code courtesy of https://nlpforhackers.io/language-models/

from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict

Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

Count frequency of co-occurance
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1

Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

代码的逻辑非常简明清晰，首先统计出所有文章中出现的三元组(w1, w2, w3)，统计出每一对(w1, w2)在给定的条件下，w3的概率。基于这个模型，就可以不断得预测下一个即将出现的词。注意这里与之前说的马尔科夫假设不同，多考虑了一个历史词p(wn | w1…wn-1)=p(wn | wn-2wn-1)。

可以看到以”today the”为起始的两个词，采用不同的阈值生成的句子具有一定的可读性。这个N-gram模型与 Google、Alexa 和 Apple 等公司用于语言建模的基本原理相同。

N-gram Model的局限性

N-gram模型通常当N取值越大效果越好，但是随着N的增加计算的代价也会大幅增加，对内存资源的消耗是指数级增加的。
N-gram模型是离散化地建模语言模型，对于没有在语料中共同出现的词，联合概率为0。

构建一个Neural Language Model

深度学习在很多NLP任务上都取得了很好的表现，比如摘要生成，机器翻译。这些任务都是基于LM的，因而有很多研究开始致力于使用深度神经网络来建模LM。使用Neural LM可以建模字符级别(character level)或者词级别(word level)的LM，下面以字符级别的LM为例。

首先对问题进行描述，Neural LM要求通过给定的语料训练一个LM，随后在给定text的基础上生成后续的内容，使得其符合给定语料的风格同时满足语法要求。

下面尝试构建Neural LM，给定语料是独立宣言，引入需要的package，并读取独立宣言文本：

import numpy as np
import pandas as pd
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, GRU, Embedding
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint

file_name = "Declaration_of_Independence.txt"
data_text = ""
for line in open(file_name):
    data_text += line.strip()

随后对文本进行简单的过滤，过滤方式就是1)将大写字母都转换成小写字母。2)将’s结尾的单词去掉’s 3)去掉标点 4)去掉长度小于3的单词：

import re

def text_cleaner(text):
    # lower case text
    newString = text.lower()
    newString = re.sub(r"'s\b","",newString)
    # remove punctuations
    newString = re.sub("[^a-zA-Z]", " ", newString)
    long_words=[]
    # remove short word
    for i in newString.split():
        if len(i)>=3:
            long_words.append(i)
    return (" ".join(long_words)).strip()

preprocess the text
data_new = text_cleaner(data_text)

预处理完成就可以构建训练用的输入序列了。问题定义为给定30个字符，预测下一个字符。因此输入序列就是分别以每一个位置为起始位置，取长度为31的子序列作为训练用的输入序列：

def create_seq(text):
    length = 30
    sequences = list()
    for i in range(length, len(text)):
        # select sequence of tokens
        seq = text[i-length:i+1]
        # store
        sequences.append(seq)
    print('Total Sequences: %d' % len(sequences))
    return sequences

create sequences
sequences = create_seq(data_new)

下面将训练数据的字符序列转换成id，由于共有26个字母，因此id字典长度也为26

create a character mapping index
chars = sorted(list(set(data_new)))
mapping = dict((c, i) for i, c in enumerate(chars))

def encode_seq(seq):
    sequences = list()
    for line in seq:
        # integer encode line
        encoded_seq = [mapping[char] for char in line]
        # store
        sequences.append(encoded_seq)
    return sequences

encode the sequences
sequences = encode_seq(sequences)

此时的输入序列变成了若干长度为31的id list，id范围为0~25。下面划分训练集和验证集，验证集占比为10%：

from sklearn.model_selection import train_test_split

vocabulary size
vocab = len(mapping)
sequences = np.array(sequences)
create X and y
X, y = sequences[:,:-1], sequences[:,-1]
one hot encode y
y = to_categorical(y, num_classes=vocab)
create train and validation sets
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.1, random_state=42)

print('Train shape:', X_tr.shape, 'Val shape:', X_val.shape)

输出为Train shape: (6345, 30) Val shape: (706, 30)。下面来构建模型，模型由简单的三层组成。维度为50的embedding层，隐含层维度为150的GRU层以及以softmax为激活函数的全连接层：

define model
model = Sequential()
model.add(Embedding(vocab, 50, input_length=30, trainable=True))
model.add(GRU(150, recurrent_dropout=0.1, dropout=0.1))
model.add(Dense(vocab, activation='softmax'))
print(model.summary())

compile the model
model.compile(loss='categorical_crossentropy', metrics=['acc'], optimizer='adam')
fit the model
model.fit(X_tr, y_tr, epochs=100, verbose=2, validation_data=(X_val, y_val))

训练初始和稳定的loss和accuracy变化如下：

模型训练完成之后，可以根据给定的前几个单词，生成后面的单词：

generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break
        # append to input
        in_text += char
    return in_text

注意这里有一个问题在于将序列输入到模型之前要在序列最前面加入padding的字符，这个可能与训练不符，实际使用GRU生成的时候应该采用变长序列的方式，使用for循环逐个通过GRU。下面这个inference的例子可以验证训练得到的模型效果：

可以发现模型是非常敏感的，对于不同的介词of/for，或者是加了一个空格，生成的结果都是不一样的。另外，生成的句子并没有出现在训练语料中，说明模型的确是在训练过程中理解了英语的语法规则。

使用GPT-2做自然语言生成

2019年2月，OpenAI使用大规模语料训练了基于Transformer的语言模型名叫GPT-2。GPT-2是一个基于Transformer decoder的生成式语言模型，在互联网上40GB语料上训练得到，GPT-2论文见

OpenAI’s GPT-2: A Simple Guide to Build the World’s Most Advanced Text Generator in Python

下面将基于PyTorch-Transformers来使用GPT-2。PyTorch-Transformers包括许多SOTA的预训练模型，文章建议使用Google Colab来运行示例代码：

Import required libraries
import torch
from pytorch_transformers import GPT2Tokenizer, GPT2LMHeadModel

Load pre-trained model tokenizer (vocabulary)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

Encode a text inputs
text = "What is the fastest car in the"
indexed_tokens = tokenizer.encode(text)

Convert indexed tokens in a PyTorch tensor
tokens_tensor = torch.tensor([indexed_tokens])

Load pre-trained model (weights)
model = GPT2LMHeadModel.from_pretrained('gpt2')

Set the model in evaluation mode to deactivate the DropOut modules
model.eval()

If you have a GPU, put everything on cuda
tokens_tensor = tokens_tensor.to('cuda')
model.to('cuda')

Predict all tokens
with torch.no_grad():
    outputs = model(tokens_tensor)
    predictions = outputs[0]

Get the predicted next sub-word
predicted_index = torch.argmax(predictions[0, -1, :]).item()
predicted_text = tokenizer.decode(indexed_tokens + [predicted_index])

Print the predicted word
print(predicted_text)

代码中使用预训练好的gpt-2模型来预测what is the fastest car in the __ 这个词，google上预测的答案是”world”，模型预测结果为：

结果和Google给的query suggestion一致，说明gpt-2的效果很强。

之前使用gpt-2的方式是给定context预测下一个词，下面感受一下使用gpt-2根据给定文字生成一篇文章的能力。下面是给定的文字：

Two roads diverged in a yellow wood,
And sorry I could not travel both
And be one traveler, long I stood
And looked down one as far as I could
To where it bent in the undergrowth;

这段文字是诗歌”The Road Not Taken”的第一段，下面使用PyTorch-Transformers写好的脚本直接来生成后面的段落。直接在google colab上运行下面的命令：

!git clone https://github.com/huggingface/pytorch-transformers.git

!python pytorch-transformers/examples/pytorch/text-generation/run_generation.py \
    --model_type=gpt2 \
    --length=100 \
    --model_name_or_path=gpt2 \

注意这里与原文的命令有差别，代码仓库的结构有所变动。最终生成的结果如下：

Two roads diverged in a yellow wood, And sorry I could not travel both And be one traveler, long I stood And looked down one as far as I could To where it bent in the undergrowth; He was no man who could resist, Nor had any one in his own reach To laugh at his vanity; The only thing which could lull him, for he could not remember, The complete unspoken confession of the suffering of Mr. Ford’s face, and his mournful agony. And glad I saw them yet. In a darkness beneath the fell moon I at last heard the sigh, Until the moon lifted from the earth’s right bethought. Every man rushes before death, When death brings about

感觉比较通顺，意境与第一段也是贴合的。可见gpt-2模型容量之大(后面gpt-3更厉害)

结语

以上经历了关于LM的全面的了解，我们讨论了什么是LM，以及如何使用最新的NLP框架来使用它们。结果令人印象深刻！

我的总结

这篇笔记的确让人由浅入深地理解了什么是Language Model，并实打实地教会大家如何手动搭建一个statistical LM以及Neural LM，后续我会更新一些常见的LM的具体细节介绍。

Original: https://blog.csdn.net/qq_36891953/article/details/121559025
Author: RUCblake
Title: 语言模型(LM)介绍及实操

原创文章受到原创版权保护。转载请注明出处：https://www.johngo689.com/530983/

转载文章受原作者版权保护。转载请注明原作者出处！

人工智能

【自取】最近整理的，有需要可以领取学习：

Linux核心资料大放送~

全栈面试题汇总（持续更新&可下载）

一个提高学习100%效率的工具！

【超详细】深度学习面试题目！

LeetCode Python刷题答案下载！

LeetCode Java版刷题答案下载！

LeetCode C++ 版本，抓紧保存！

LeetCode GO语言刷题答案下载！

Debiased Contrastive Learning of Unsupervised Sentence Representation无监督句子表示中的对比学习去偏

论文地址：https://arxiv.org/abs/2205.00656v1 Comments:11 pages, accepted by ACL 2022 main confe…

人工智能 2023年6月4日
00121
利用OpenCV实现一个简单的实时人脸检测项目并显示FPS

活动地址：毕业季·进击的技术er 在本期中，我将利用OpenCV实现一个简单的人脸识别，其中我们用到的权重文件，大家自行下载效果：我们本期主要用的是cv2.detectMult…

人工智能 2023年7月19日
0073
windows10上conda安装pytorch+transformers

conda create -n myPytorch python=3.7 CUDA 11.3 conda install pytorch torchvision torchaudi…

人工智能 2023年7月22日
0053
基于深度学习的建筑能耗预测02——安装Tensorflow-gpu

天津城建大学建筑学院18级-数字设计-基于深度学习的建筑能耗预测—2021WS作者：徐仔导师：万先生、丁先生 [En] Instructor: Mr. Wan and Mr. Di…

人工智能 2023年5月25日
0078
论文阅读（1）：病理图像分类TransMIL: Transformer based Correlated Multiple Instance Learning

Transformer based Correlated Multiple Instance Learning for Whole Slide Image Classificati…

人工智能 2023年6月30日
0067
分类性能评价指标——精确率，召回率，F1值详细解释

分类性能的评价指标 ; 准确率准确率是全部参与分类的文本中，与人工分类结果吻合的文本所占的比例。即：预测与真实标签相同的比例A c c u r a c y = T P + T …

人工智能 2023年6月15日
00145
YOLOv7 Tensorrt Python部署教程

B站教学视频 https://www.bilibili.com/video/BV1q34y1n7Bw/ Github仓库地址 https://github.com/Monday-L…

人工智能 2023年7月5日
0067
size mismatch问题：训练权重不匹配问题

在测试二阶段和三阶段模型的时候程序一直报错： RuntimeError: Error(s) in loading state_dict for Eff:size mismatch …

人工智能 2023年7月5日
0067
[MATLAB]基本介绍

默认显示格式：短格式short（末尾含有4位有效小数）修改格式：format long（长格式，末尾含15位有效小数）format long g ：自动选择最佳显示方式（双精度数，…

人工智能 2023年6月20日
0067
【进阶篇】全流程学习《20天掌握Pytorch实战》纪实 | Day03 | 文本数据建模流程范例

💖作者简介：大家好，我是车神哥，府学路18号的车神🥇⚡About—> 车神：从寝室到实验室最快3分钟，最慢3分半（那半分钟其实是等红绿灯）📝个人主页：车手只需…

人工智能 2023年7月22日
0073
利用python进行回归分析

通常大家会认为曲线拟合和回归分析类似，但其实回归分析中是包含曲线拟合的。拟合是研究因变量和自变量的函数关系的。而回归是研究随机变量间的相关关系的。拟合侧重于调整参数，使得与给出的数…

人工智能 2023年7月5日
0065
【Python数据分析】实践编写篇2：用Python进行回归分析与相关分析

目录一、前言 1.1 回归分析 1.2 相关分析二、代码的编写 2.1 前期准备 2.2 编写代码 2.2.1 相关分析 2.2.2 一元线性回归分析 2.2.3 多元线性回归…

人工智能 2023年6月19日
00118
python数据分析工具

文章目录 python数据分析工具 NumPy Scipy Matplotlib pandas StatsModels scikit-learn Keras Gensim pyth…

人工智能 2023年6月11日
0099
深度学习之目标检测（三）– FPN结构详解

深度学习之目标检测（三）– FPN结构详解 * – 深度学习之目标检测（三）FPN结构详解 – + 1. FPN —— 特征金字塔深度学习之目…

人工智能 2023年7月12日
0065
pytorch LSTM 文本分类简单例子

3万文本，train val test 6 2 2. pytorch、sklearn、gensim的word2vec。word2vec嵌入句子进行表示，padding后，用LSTM…

人工智能 2023年7月3日
0086
论文笔记Learning Event Graph Knowledge for Abductive Reasoning

Learning Event Graph Knowledge for Abductive Reasoning Abstract 1.Introduction 2.Backgroun…

人工智能 2023年5月31日
0078

2024 年 5 月
一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31