n-gram语言模型LM

感谢阅读,笔者能力有限,有错误请指正!

为什么要加入语音模型?

对于连续的语音识别,可能会有数万个单词,解码过程复杂,识别结果有很多组合,仅使用声学模型是不够的,需要引入语言模型来约束识别结果。

[En]

For continuous speech recognition, there may be tens of thousands of words, the decoding process is complex, there are many combinations of recognition results, only using acoustic model is not enough, it is necessary to introduce a language model to constrain the recognition results.

统计语言模型

一个统计语言模型包含一个有限集合V,和一个函数p ( x 1 , x 2 , … , x n ) p\left(x_{1}, x_{2}, \ldots, x_{n}\right)p (x 1 ​,x 2 ​,…,x n ​);

统计语言模型是对所有单词序列的概率分布。它有什么用呢?

[En]

Statistical language model is a probability distribution on all word sequences. What’s the use of it?

  • 它可以为我们提供任何单词序列的概率,以帮助确定哪个单词序列更有可能。
    [En]

    it can give us the probability of any word sequence to help determine which word sequence is more likely.*

  • 给定一个单词序列,您可以预测下一个最有可能的单词。
    [En]

    given a word sequence, you can predict the next most likely word.*

给定一个词序列 S = ( w 1 , … , w n ) S=\left(w_{1}, \ldots, w_{n}\right)S =(w 1 ​,…,w n ​), 它的概率可以表示为:
P ( S ) = P ( x 1 = w 1 , … x n = w n ) = P ( w 1 n ) = P ( w 1 ) P ( w 2 ∣ w 1 ) P ( w 3 ∣ w 1 2 ) … P ( w n ∣ w 1 n − 1 ) = ∏ k = 1 n P ( w k ∣ w 1 k − 1 ) \begin{aligned} P(S) &=P\left(x_{1}=w_{1}, \ldots x_{n}=w_{n}\right)=P\left(w_{1}^{n}\right) \ &=P\left(w_{1}\right) P\left(w_{2} \mid w_{1}\right) P\left(w_{3} \mid w_{1}^{2}\right) \ldots P\left(w_{n} \mid w_{1}^{n-1}\right) \ &=\prod_{k=1}^{n} P\left(w_{k} \mid w_{1}^{k-1}\right) \end{aligned}P (S )​=P (x 1 ​=w 1 ​,…x n ​=w n ​)=P (w 1 n ​)=P (w 1 ​)P (w 2 ​∣w 1 ​)P (w 3 ​∣w 1 2 ​)…P (w n ​∣w 1 n −1 ​)=k =1 ∏n ​P (w k ​∣w 1 k −1 ​)​

如何获得语言模型中的概率?在语料库中进行统计。

[En]

How to obtain the probability in the language model? Count in the corpus.

N-gram语言模型与评价方法

举个栗子:It is going to be a fine day
P ( d a y ∣ I t i s g o i n g t o b e a f i n e ) = C ( I t i s g o i n g t o b e a f i n e d a y ) C ( I t i s g o i n g t o b e a f i n e ) P(day|It\ is\ going\ to\ be\ a\ fine) = \frac{C(It\ is\ going\ to\ be\ a\ fine\ day)}{C(It\ is\ going\ to\ be\ a\ fine)}P (d a y ∣I t i s g o i n g t o b e a f i n e )=C (I t i s g o i n g t o b e a f i n e )C (I t i s g o i n g t o b e a f i n e d a y )​
问题:历史信息越长,就越难在预测库中找到完全一致的序列。

[En]

Problem: the longer the historical information, the harder it is to find a completely consistent sequence in the prediction library.

引入马尔科夫假设:随意一个词出现的概率只与前面出现的有限的n-1个词有关,则可以用最近的几个历史词代替整个历史词串,从而近似。

N-gram:用前N-1个词作为历史,估计当前第N个词。
P ( w i ∣ w i i − 1 ) = P ( w i ∣ w i − N + 1 i − 1 ) P(w_i|w_i^{i-1})=P(w_i|w_{i-N+1}^{i-1})P (w i ​∣w i i −1 ​)=P (w i ​∣w i −N +1 i −1 ​)
如2-gram(bigram): P ( w i ∣ w i i − 1 ) = P ( w i ∣ w i − 1 ) P(w_i|w_i^{i-1})=P(w_i|w_{i-1})P (w i ​∣w i i −1 ​)=P (w i ​∣w i −1 ​)

如何估计N-gram? 使用最大似然方法,就是在训练预料上进行数数。
P ( w i ∣ w i − N + 1 i − 1 ) = C ( w i − N + 1 i − 1 w i ) C ( w i − N + 1 i − 1 ) P(w_i|w_{i-N+1}^{i-1}) = \frac{C(w_{i-N+1}^{i-1}w_i)}{C(w_{i-N+1}^{i-1})}P (w i ​∣w i −N +1 i −1 ​)=C (w i −N +1 i −1 ​)C (w i −N +1 i −1 ​w i ​)​
bigram: P ( w i ∣ w i − 1 ) = C ( w i − 1 w i ) C ( w i − 1 ) P(w_i|w_{i-1}) = \frac{C(w_{i-1}w_i)}{C(w_{i-1})}P (w i ​∣w i −1 ​)=C (w i −1 ​)C (w i −1 ​w i ​)​

开头结尾如何处理?未知词?
< s > 和 < / s > 标记开头和结尾,未知词标记为

评估方法:1. 根据应用实地测试,2. 困惑度(Perplexity)

在测试集W = w 1 , w 2 , . . . , w N W=w1,w2,…,w_N W =w 1 ,w 2 ,…,w N ​,困惑度就是用单词数归一化后的测试集概率:
P P ( W ) = P ( w 1 w 2 … w N ) − 1 N = 1 P ( w 1 w 2 … w N ) = ∏ i = 1 N 1 P ( w i ∣ w i − N + 1 i − 1 ) \begin{aligned} P P(W) &=P\left(w_{1} w_{2} \ldots w_{N}\right)^{-\frac{1}{N}} \ &=\sqrt{\frac{1}{P\left(w_{1} w_{2} \ldots w_{N}\right)}} \ &=\sqrt{\prod_{i=1}^{N} \frac{1}{P\left(w_{i} \mid w_{i-N+1}^{i-1}\right)}} \end{aligned}P P (W )​=P (w 1 ​w 2 ​…w N ​)−N 1 ​=P (w 1 ​w 2 ​…w N ​)1 ​​=i =1 ∏N ​P (w i ​∣w i −N +1 i −1 ​)1 ​​​

句子越好,概率越大,混淆越少,也就是模型对句子的混淆越少。谜题值相当于虚拟词典的大小,并且从该虚拟词典中选择下一个单词。取值越大,选择越多,越难选对,表示语模训练较差;取值越小,选择越少,选择正确的可能性越大,表示语模训练较好。

[En]

The better the sentence, the greater the probability and the less confusion, that is, the less confused the model is to the sentence. The puzzle value is equivalent to the size of a virtual dictionary, and the next word is selected from this virtual dictionary. The greater the value, the more choices, the more difficult to choose the right, indicating that the language model training is worse; the smaller the value, the fewer choices, the greater the possibility of the right choice, indicating that the language model training is better.

源头:

介绍几个概念:

熵(entropy):又称自信息,描述一个随机变量的不确定性的数量,熵越大,不确定性越大,正确估计其值的可能性越小。越不确定的随机变量越需要大的信息量以确定其值。
H ( X ) = − ∑ x p ( x ) log ⁡ 2 p ( x ) \mathrm{H}(\mathrm{X})=-\sum_{x} p(x) \log _{2} p(x)H (X )=−x ∑​p (x )lo g 2 ​p (x )
其中,p ( x ) p(x)p (x )表示x x x的分布概率。

相对熵(relative-entropy):又称KL距离(KL散度),衡量相同事件空间里两个概率分布相对差距的测度,当p = q p=q p =q的时候,相对熵为0,当p p p和q q q差距变大时,相对熵也变大。
D ( p ∥ q ) = ∑ x p ( x ) log ⁡ 2 p ( x ) q ( x ) \mathrm{D}(\mathrm{p} \| \mathrm{q})=\sum_{x} p(x) \log _{2} \frac{p(x)}{q(x)}D (p ∥q )=x ∑​p (x )lo g 2 ​q (x )p (x )​

交叉熵(cross-entropy):衡量估计模型和真实概率分布之间的差异。
H ( X , q ) = H ( X ) + D ( p ∥ q ) = − ∑ x p ( x ) log ⁡ 2 q ( x ) \begin{array}{l} \mathrm{H}(\mathrm{X}, \mathrm{q})=\mathrm{H}(\mathrm{X})+\mathrm{D}(\mathrm{p} \| \mathrm{q}) \ =-\sum_{x} p(x) \log _{2} q(x) \end{array}H (X ,q )=H (X )+D (p ∥q )=−∑x ​p (x )lo g 2 ​q (x )​

交叉熵就是真值分布的熵与KL散度的和,而真值分布的熵是确定的,与模型的参数无关,所以梯度下降求导时,∇ H ( p , q ) = ∇ D K L ( p ∥ q ) \nabla H(p, q)=\nabla D_{K L}(p \| q)∇H (p ,q )=∇D K L ​(p ∥q ),即最小化交叉熵与最小化KL散度时一样的。

困惑度(perplexity):困惑度是交叉熵的指数形式。
P P L = 2 H ( X , q ) P P L=2^{H(X, q)}P P L =2 H (X ,q )

在有些语言模型工具包里,会有PPL和PPL1
PPL:考虑词数和句子数(i.e. 考虑)
PPL1: 只考虑词数

由于语料库的稀疏性,一些单词序列无法找到,因此其概率为0,因此在测试过程中无法正确识别单词序列。为了解决这一问题,提出了一种平滑算法。

[En]

Because of the sparsity of the corpus, some word sequences can not be found, so its probability is 0, so the word sequence can not be correctly identified during the test. In order to solve this problem, a smoothing algorithm is proposed.

平滑算法

将已见事件的概率的一部分分配给未见事件。

[En]

Assign part of the probability of the seen event to the unseen event.

Intuition:将每个计数都加一,从而使得任何词序列都有计数。

这样的话就可以把本来概率为0的结果变成一个很小的值,为了保证所有实例的概率总和为1,将分母增加实例的种类数V,也就是词典大小。

token: 语料中单词数目(数数)

如1-gram(unigram): P ( w i ) = C ( w i ) N P(w_i)=\frac{C(w_i)}{N}P (w i ​)=N C (w i ​)​,N为总的token数。P l a p l a c e ( w i ) = C ( w i ) + 1 N + V P_{laplace}(w_i)=\frac{C(w_i)+1}{N+V}P l a p l a c e ​(w i ​)=N +V C (w i ​)+1 ​

bigram: P l a p l a c e ( w i ∣ w i − 1 ) = C ( w i − 1 w i ) + 1 C ( w i − 1 ) + V P_{laplace}(w_i|w_{i-1})=\frac{C(w_{i-1}w_i)+1}{C(w_{i-1})+V}P l a p l a c e ​(w i ​∣w i −1 ​)=C (w i −1 ​)+V C (w i −1 ​w i ​)+1 ​

引入两个概念:(我们将词w i w_i w i ​的token数简记为c i c_i c i ​)

  1. 调整计数(adjusted count)—c ∗ c^*c ∗: 描述平滑算法仅对分子的影响。
  2. 相对打折率(discount ratio)—d c d_c d c ​: 打折计数与原计数的比率。

在实践中,它很少单独与其他平滑算法一起使用。

[En]

In practice, it is rarely used alone, in conjunction with other smoothing algorithms.

Intuition: 用你看见过一次的事件估计为看见的事件,并以此类推用看见两次的事件估计看见一次的事件…

频率c出现的频数记为N c N_c N c ​:N c = ∑ x : count ⁡ ( x ) = c 1 N_{c}=\sum_{x: \operatorname{count}(x)=c} 1 N c ​=∑x :c o u n t (x )=c ​1

对于任何一个发生c次的n-gram,都假设它发生了c ∗ c^c ∗:c ∗ = ( c + 1 ) N c + 1 N c c^{}=(\mathrm{c}+1) \frac{N_{c+1}}{N_{c}}c ∗=(c +1 )N c ​N c +1 ​​

样本中出现c事件的概率为:

P G T ∗ ( U n s e e n E v e n t s ) = N 1 N \mathrm{P}{\mathrm{GT}}^{*}( Unseen\ Events )=\frac{\mathrm{N}{1}}{\mathrm{~N}}P G T ∗​(U n s e e n E v e n t s )=N N 1 ​​

P G T ∗ ( s e e n E v e n t s ) = c ∗ N \mathrm{P}_{\mathrm{GT}}^{}( seen\ Events )=\frac{c^}{\mathrm{~N}}P G T ∗​(s e e n E v e n t s )=N c ∗​

问:如果一个单词在语料库中出现233次,但在单词序列中没有出现234次,该怎么办?

[En]

Question: what if a word appears 233 times in the corpus, but does not appear 234 times in the word sequence?

方法1:当c > k ( e . g . k = 5 ) >\mathrm{k}(\mathrm{e} . \mathrm{g} . \mathrm{k}=5)>k (e .g .k =5 ) 时, N c ≈ α c β N_{c} \approx \alpha c^{\beta}N c ​≈αc β, 其中 α \alpha α 和 β \beta β 为参数,对N c N_c N c ​进行平滑。

方法2:当c>k,认为计数可靠不进行打折。[katz 1987]
{ c ∗ = c c > k c ∗ = ( c + 1 ) N c + 1 N c − c ( k + 1 ) N k + 1 N 1 1 − ( k + 1 ) N k + 1 N 1 1 ≤ c ≤ k \left{\begin{array}{ll} c^{}=&c &c>k \ c^{}= & \frac{(c+1) \frac{N_{c+1}}{N_{c}}-c \frac{(k+1) N_{k+1}}{N_{1}}}{1-\frac{(k+1) N_{k+1}}{N_{1}}} & 1 \leq c \leq k \end{array}\right.⎩⎨⎧​c ∗=c ∗=​c 1 −N 1 ​(k +1 )N k +1 ​​(c +1 )N c ​N c +1 ​​−c N 1 ​(k +1 )N k +1 ​​​​c >k 1 ≤c ≤k ​

在所有的N-grams估计中,把所有概率混合。比如我们优化一个tri-gram模型,将统计的tri-gram,bigram和unigram计数进行插值。
P M L ( w i ∣ w i − 2 w i − 1 ) = C ( w i − 2 w i − 1 w i ) C ( w i − 2 w i − 1 ) P M L ( w i ∣ w i − 1 ) = C ( w i − 1 w i ) C ( w i − 1 ) P M L ( w i ) = C ( w i ) N P interp ( w i ∣ w i − 2 w i − 1 ) = λ 1 P M L ( w i ∣ w i − 2 w i − 1 ) + λ 2 P M L ( w i ∣ w i − 1 ) + λ 3 P M L ( w i ) λ 1 + λ 2 + λ 3 = 1 \begin{array}{l} P_{M L}\left(w_{\mathrm{i}} \mid w_{\mathrm{i}-2} w_{\mathrm{i}-1}\right)=\frac{C\left(w_{\mathrm{i}-2} w_{\mathrm{i}-1} w_{i}\right)}{C\left(w_{\mathrm{i}-2} w_{\mathrm{i}-1}\right)} \ P_{M L}\left(w_{i} \mid w_{\mathrm{i}-1}\right)=\frac{C\left(w_{\mathrm{i}-1} w_{i}\right)}{C\left(w_{\mathrm{i}-1}\right)} \ P_{M L}\left(w_{i}\right)=\frac{C\left(w_{i}\right)}{N} \ P_{\text {interp }}\left(w_{i} \mid w_{\mathrm{i}-2} w_{\mathrm{i}-1}\right)=\lambda_{1} P_{M L}\left(w_{i} \mid w_{\mathrm{i}-2} w_{\mathrm{i}-1}\right)+\lambda_{2} P_{M L}\left(w_{i} \mid w_{\mathrm{i}-1}\right) + \lambda_{3} P_{ML}\left(w_i\right) \ \lambda_{1}+\lambda_{2}+\lambda_{3}=1 \end{array}P M L ​(w i ​∣w i −2 ​w i −1 ​)=C (w i −2 ​w i −1 ​)C (w i −2 ​w i −1 ​w i ​)​P M L ​(w i ​∣w i −1 ​)=C (w i −1 ​)C (w i −1 ​w i ​)​P M L ​(w i ​)=N C (w i ​)​P interp ​(w i ​∣w i −2 ​w i −1 ​)=λ1 ​P M L ​(w i ​∣w i −2 ​w i −1 ​)+λ2 ​P M L ​(w i ​∣w i −1 ​)+λ3 ​P M L ​(w i ​)λ1 ​+λ2 ​+λ3 ​=1 ​

思想:若N阶语音模型存在,直接使用打折后的概率(常使用Good-turing算法进行打折),若高阶语言模型不存在,将打折节省出的概率量,依照N-1阶的语言模型概率进行分配,依次类推。
P k a t z ( w i ∣ w i − N + 1 i − 1 ) = { P ∗ ( w i ∣ w i − N + 1 i − 1 ) C ( w i − N + 1 i ) > 0 α ( w i − N + 1 i − 1 ) P k a t z ( w i ∣ w i − N + 2 i − 1 ) 其它 α ( w i − N + 1 i − 1 ) 为归一化系数 \begin{array}{l} P_{k a t z}\left(w_{i} \mid w_{i-N+1}^{i-1}\right)=\left{\begin{array}{lc} P^{*}\left(w_{i} \mid w_{i-N+1}^{i-1}\right) & C\left(w_{i-N+1}^{i}\right)>0 \ \alpha\left(w_{i-N+1}^{i-1}\right) P_{k a t z}\left(w_{i} \mid w_{i-N+2}^{i-1}\right) & \text { 其它 } \end{array}\right. \ \alpha\left(w_{i-N+1}^{i-1}\right) \text { 为归一化系数 } \end{array}P k a t z ​(w i ​∣w i −N +1 i −1 ​)={P ∗(w i ​∣w i −N +1 i −1 ​)α(w i −N +1 i −1 ​)P k a t z ​(w i ​∣w i −N +2 i −1 ​)​C (w i −N +1 i ​)>0 其它​α(w i −N +1 i −1 ​)为归一化系数​

认为:对于一个词来说,如果它出现在语料库中更多不同的上下文中,它可能有更高的概率。

[En]

Thought: for a word, if it appears in more different contexts in the corpus, it may have a higher probability.

定义连续概率:
P continuation ( w i ) = ∣ { w i − 1 : C ( w i − 1 w i ) > 0 } ∣ ∑ w i ∣ { w i − 1 : C ( w i − 1 w i ) > 0 ∣ P_{\text {continuation }}\left(w_{i}\right)=\frac{\left|\left{w_{i-1}: C\left(w_{i-1} w_{i}\right)>0\right}\right|}{\sum_{w_{i}} \mid\left{w_{i-1}: C\left(w_{i-1} w_{i}\right)>0 \mid\right.}P continuation ​(w i ​)=∑w i ​​∣{w i −1 ​:C (w i −1 ​w i ​)>0 ∣∣{w i −1 ​:C (w i −1 ​w i ​)>0 }∣​

回退式 P K N ( w i ∣ w i − 1 ) = { C ( w i − 1 w i ) − D C ( w i − 1 ) C ( w i − 1 w i ) > 0 α ( w i ) P continuation ( w i ) otherwise 插值式 P K N ( w i ∣ w i − 1 ) = C ( w i − 1 w i ) − D C ( w i − 1 ) + β ( w i ) P continuation ( w i ) = C ( w i − 1 w i ) − D C ( w i − 1 ) + β ( w i ) ∣ { w i − 1 : C ( w i − 1 w i ) > 0 } ∣ ∑ w i ∣ { w i − 1 : C ( w i − 1 w i ) > 0 ∣ \begin{aligned} \text{回退式} P_{K N}\left(w_{i} \mid w_{i-1}\right)=&\left{\begin{array}{lc} \frac{C\left(w_{i-1} w_{i}\right)-D}{C\left(w_{i-1}\right)} & C\left(w_{i-1} w_{i}\right)>0 \ \alpha\left(w_{i}\right) P_{\text {continuation }}\left(w_{i}\right) & \text { otherwise } \end{array}\right.\ \text{插值式} P_{K N}\left(w_{i} \mid w_{i-1}\right) &=\frac{C\left(w_{i-1} w_{i}\right)-D}{C\left(w_{i-1}\right)}+\beta\left(w_{i}\right) P_{\text {continuation }}\left(w_{i}\right) \ &=\frac{C\left(w_{i-1} w_{i}\right)-D}{C\left(w_{i-1}\right)}+\beta\left(w_{i}\right) \frac{\left|\left{w_{i-1}: C\left(w_{i-1} w_{i}\right)>0\right}\right|}{\sum_{w_{i}} \mid\left{w_{i-1}: C\left(w_{i-1} w_{i}\right)>0 \mid\right.} \end{aligned}回退式P K N ​(w i ​∣w i −1 ​)=插值式P K N ​(w i ​∣w i −1 ​)​{C (w i −1 ​)C (w i −1 ​w i ​)−D ​α(w i ​)P continuation ​(w i ​)​C (w i −1 ​w i ​)>0 otherwise ​=C (w i −1 ​)C (w i −1 ​w i ​)−D ​+β(w i ​)P continuation ​(w i ​)=C (w i −1 ​)C (w i −1 ​w i ​)−D ​+β(w i ​)∑w i ​​∣{w i −1 ​:C (w i −1 ​w i ​)>0 ∣∣{w i −1 ​:C (w i −1 ​w i ​)>0 }∣​​
插值式效果会更好。

n-gram语言模型的存储格式——APRA Format及工具包

ARPA是N-gram的标准存储格式,是一个ASCII文件,小标题后边跟着一个列表,列举出所有非零的N元语法概率。每个N元语法条目中以此为:折扣后对数概率(log10格式),词序列,回退权重(log10格式)。e.g.:l o g 10 ( w i ∣ w i − 1 ) log_{10}(w_i|w_{i-1})l o g 1 0 ​(w i ​∣w i −1 ​) w i − 1 w i w_{i-1}w_i w i −1 ​w i ​ l o g 10 α ( w i − 1 w i ) log_{10}\alpha(w_{i-1}w_i)l o g 1 0 ​α(w i −1 ​w i ​)。

最高阶语法和结尾的任意阶语法没有回退权重
插补模型和后备模型都可以以这种方式存储

[En]

Both interpolation model and fallback model can be stored in this way

Original: https://blog.csdn.net/weixin_39529413/article/details/117604864
Author: 栋次大次
Title: n-gram语言模型LM

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/513192/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球