语音识别系统结构

语音识别系统结构

语音
信号处理
根据人耳的听觉感知特性,提取语音中最重要的特征,并将语音信号转换为特征向量。

[En]

According to the auditory perception characteristics of human ear, the most important features of speech are extracted, and the speech signal is converted into feature vector.

序列
声学特征
线性预测编码 Linear Predictive Coding,LPC
梅尔频率倒谱系数 Mel-frequency Cepstrum Coefficients,MFCC
梅尔标度滤波器组 Mel-scale Filter Bank,FBank
线性感知预测(Perceptual Linear Prediction, PLP)
解码器
decoder是根据字典、声学模型和语音模型,将输入的语音特征失量序列转化为
字符序列
声学模型acoustic model
声学、语音学、环境变量以及说话者的性别和口音差异的知识表示。

[En]

Knowledge representation of acoustics, phonetics, environmental variables, as well as differences in gender and accent of the speaker.

它使用声学、语音、环境特征和说话者的性别口音等信息。

[En]

It uses information such as acoustics, phonetics, environmental characteristics and the speaker’s gender accent.

特征向量序列的可变长度和音频信号的丰富变异性

[En]

The variable length of eigenvector sequences and the rich variability of audio signals

可变长特征向量序列问题在学术上通常有动态时间规划(Dynamic Time
Warping, DTW)和隐马尔科夫模型(Hidden Markov Model, HMM)方法来
解决
音频信号的丰富性和变化性取决于说话人的各种复杂特征或说话风格和语速。

[En]

The richness and variability of audio signals is determined by the speaker’s various complex characteristics or speaking style and speed.

由环境噪声、频道干扰、方言差异等因素造成。声学模型需要足够健壮

[En]

Caused by environmental noise, channel interference, dialect differences and other factors. Acoustic models need to be robust enough to

处理以上的情况
传统声学模型
GMM
HMM模型对时序信息进行建模,在给定HMM的一个状态后,GMM对属
对该状态下语音特征向量的概率分布进行建模

[En]

Modeling the probability distribution of speech feature vectors in this state

混合高斯模型最明显的分布特性是它的多模特性,这使得混合高斯模型可以

[En]

The most obvious property of the distribution of the mixed Gaussian model is its multimode, which makes the mixed Gaussian model can

描述大量显示多峰特性的物理数据,例如语音数据,而单高斯分布

[En]

Describe a lot of physical data that show multimodal properties, such as speech data, while single Gaussian distribution

它不合适。数据的多峰性可能来自各种潜在因素,每一个因素都决定了分数。

[En]

It doesn’t fit. The multimodal nature of the data may come from a variety of potential factors, each of which determines the score.

布料中的一种特定的成分混合物。如果确定了影响混合分布的因素,就可以分解混合分布。

[En]

A specific mixture of ingredients in cloth. If the factors are identified, the mixed distribution can be decomposed.

成为多因素独立分布的集合。

[En]

Become a set with independent distribution of multiple factors.

HMM
CD-DNN-HMM
DNN模型为了获得更好的性能提升,引入了上下文信息(也即前后特征帧信
息),所以被称为CD-DNN-HMM(Context-Dependent DNN-HMM)模型
相比于GMM模型,DNN模型具有一些明显的优势:首先,DNN是一种判别
模型,自身便带有区分性,可以更好区分标注类别;其次,DNN在大数据上
有非常优异的表现,伴随着数据量的不断增加,GMM模型在2000小时左右
便会出现性能的饱和,而DNN模型在数据量增加到1万小时以上时还能有性
能的提升
DNN模型有更强的对环境噪声的鲁棒性,通过加噪训练等方式,DNN模型在
复杂环境下的识别性能甚至可以超过使用语音增强算法处理的GMM模型
除此之外,DNN还有一些有趣的性质,比如,在一定程度上,随着DNN网络
深度的增加,模型的性能会持续提升,说明DNN伴随模型深度的增加,可以
提取更有表达性、更利于分类的特征;人们利用这一性质,提取DNN模型的
Bottle-neck特征,然后在训练GMM-HMM模型,可以取得和DNN模型相当
的语音识别效果
DNN应用到语音识别领域后取得了非常明显的效果,DNN技术的成功,鼓舞
着业内人员不断将新的深度学习工具应用到语音识别上,从CNN到RNN再到
RNN与CTC的结合等等,伴随着这个过程,语音识别的性能也在持续提升
声学模型的任务是计算P(X|W),即给定文字后,发出这段语音的概率
首先,第一个问题是:如何知道每个单词应该发什么音?这就是它所需要的。

[En]

First of all, the first question is: how to know what sound each word should pronounce? That’s what it takes.

另一个模块,叫作词典(lexicon),它的作用就是把单词串转换成音素串
词典通常被认为是与声学模型和语言模型并列的模块。你会在词典中遇到不止一个单词。

[En]

Dictionaries are generally regarded as modules juxtaposed with acoustic models and language models. You will encounter more than one word in the dictionary.

音的问题
在词典的帮助下,声学模型知道对于给定的文本字符串应该依次发音哪些声音。

[En]

With the help of dictionaries, the acoustic model knows which sounds should be pronounced in turn for a given text string.

然而,为了计算语音和音素串之间的匹配度,还需要知道每个音素的开始和结束。

[En]

However, in order to calculate the degree of match between speech and phoneme strings, it is also necessary to know the beginning and end of each phoneme.


这是通过动态规划算法来完成的,而使用动态规则算法可以有效地发现声音。

[En]

This is done through the dynamic programming algorithm, and the sound can be found efficiently by using the dynamic rule algorithm.

元素的分界点,使得每段语音与音素之间的匹配度(用概率表示)的乘积最大。

[En]

The demarcation point of the element, so that the product of the matching degree (expressed by probability) between each segment of speech and phoneme is maximum.

实际使用的算法称为Viterbi算法,它不仅仅考虑了每一段语音与音素
匹配度还考虑了音素之间的转换概率;后者是通过隐藏的马

[En]

The degree of matching also takes into account the probability of conversion between phonemes; the latter is through the hidden horse

尔可夫模型(HMM)估计出来的
隐含马尔可夫模型Hidden Markov Model,HMM
语音特征向量序列达到某一状态的后验概率

[En]

A posteriori probability of a speech feature vector sequence to a certain state

概率图模型,用于表示序列之间的相关性

[En]

Probability graph model, which is used to represent the correlation between sequences

一种加权有向图,其中每个节点称为一个状态。

[En]

A weighted digraph in which each node is called a state.

每一时刻,HMM都有一定概率从一个状态跳转到另一个状态,并
传输观测码元的概率是一定的,跳跃的概率用边上的权重来表示。

[En]

There is a certain probability to transmit an observation symbol, and the probability of the jump is expressed by the weight on the edge.

HMM假定,每一次状态的转移,只和前一个状态有关,而与之
前之后的其他状态无关
前之后的其他状态无关
也就是说,马尔可夫假设在每个状态中发射的符号仅与当前状态相关。

[En]

That is, Markov assumes that the symbols emitted in each state are only related to the current state.

OFF,这与其他状态和其他符号无关,即独立输出假设

[En]

Off, which has nothing to do with other states and other symbols, that is, independent output hypothesis

隐马尔可夫模型可以产生两个随机序列,一个是状态序列,另一个是状态序列。

[En]

Hidden Markov model can produce two random sequences, one is the state sequence, and the other is the state sequence.

它是一系列观测符号,因此是一个双随机过程,但它只能被外界观察到。

[En]

It is a sequence of observation symbols, so it is a double random process, but it can only be observed by the outside world.

观察符号序列,则不能观察到状态序列。

[En]

Observation symbol sequence, no state sequence can be observed.

利用维特比算法 Viterbi Algorithm找出在给定观测符号序列的条
出现概率最高的状态序列

[En]

The state sequence with the highest probability of occurrence

对于某一观测符号序列的概率,可以通过前向后向算法ForwardBackward Algorithm高效的求得
每个状态的转移概率和观测符号发射概率可以通过Baum-Weir传递。

[En]

The transition probability and observation symbol emission probability of each state can be passed through Baum-Weir.

奇算法Baum-Welch Algorithm计算得到
HMM对声学单元和语音特征序列之间的关系建模
如果声学单元的级别较小,则声学单元的数量将较少,但它们对上下文更敏感。

[En]

If the level of acoustic units is small, the number of acoustic units will be small, but they will be more sensitive to context.

大词汇量连续语音识别系统中一般采用子词Sub-word作为声学单元
英语中采用因素
汉语中采用声母、韵母
由于连续语音中存在同音现象,有必要共用前后三个音素。

[En]

Because of the phenomenon of co-pronunciation in continuous speech, it is necessary to share the front and back three phonemes.

考虑,称为三音子(Triphone)模型。引入三音子后,将引起隐含
随着马尔可夫模型数量的急剧增加,状态一般是聚类化的。

[En]

With the sharp increase in the number of Markov models, the states are generally clustered.

后的状态称为 Senone
在语音识别任务中,声学特征向量的取值是连续的,以消除量化过程。

[En]

The value of acoustic feature vector in speech recognition task is continuous, in order to eliminate the quantization process.

因此,考虑使用连续概率密度函数将特征向量与状态配对。

[En]

So consider using the continuous probability density function to pair the eigenvector to the state.

的概率进行建模。混合高斯模型(Gaussian Mixture Models,
GMM)可以对任意的概率密度函数进行逼近,所以成为了建模的首
选。
邓力等人将深度学习引入语音识别的声学建模。

[En]

Deng Li and others introduced deep learning into the acoustic modeling of speech recognition.

对声学特征向量与状态之间的关系进行建模,大大提高了语音识别的准确率。

[En]

Modeling the relationship between acoustic feature vector and state greatly improves the accuracy of speech recognition.

此后,深度学习在语音识别声学建模中的应用开始蓬勃发展。

[En]

After that, the application of deep learning in acoustic modeling of speech recognition began to flourish.

如利用声学特征矢量上下文关系的循环神经网络(Recurrent Neural
Networks,RNN)及其特殊情况长短时记忆网络(Long Shortterm Memory,LSTM)等。
实际系统中使用比音素更小的单位,但原理是相同的。

[En]

Smaller units than phonemes are used in the actual system, but the principle is the same.

在求音素分界点的过程中,以及在有了分界点后计算P(X|W)时,声学模型
都需要知道如何计算音素和语音信号之间的匹配度。

[En]

All need to know how to calculate the matching degree between a phoneme and a speech signal.

要做到这一点,您需要找到一种合适的方式来表示语音信号。它通常是语音消息

[En]

To do this, you need to find a suitable way to represent the voice signal. It is usually a voice message

数字被分成多个帧,对于每一帧,它被转换成傅立叶变换等一系列运算。

[En]

The number is divided into many frames, and for each frame, it is converted into a series of operations such as Fourier transform.

一个特征向量
最常用的特征是MFCC
https://www.zhihu.com/question/2726…

从训练数据中,我们可以提取大量的特征向量及其对应的声音。

[En]

From the training data, we can extract a large number of feature vectors and their corresponding sounds.

利用这些数据,我们可以训练分类器从特征到音素。

[En]

Using this data, we can train classifiers from features to phonemes.

前些年最常用的分类器是高斯混合模(GMM),它的大致原理是估计出
然后在识别阶段,计算每一帧的特征方向。

[En]

The distribution of the feature vector of each phoneme, and then in the recognition stage, calculate the feature direction of each frame.

量Xt由相应音素Si产生的概率P(Xt|Si),把每一帧的概率相乘,就得到
P(X|W)
现在,神经网络渐渐火了起来,它可以直接给出P(Si|Xt),用贝叶斯公
式可以转换成P(Xt|Si),再相乘得到P(X|W)
语言模型language model
一组字符序列的知识表示

[En]

Knowledge representation of a set of character sequences

表示单词序列出现的概率

[En]

Indicates the probability of the occurrence of a word sequence

例如,你的名字叫”福贵”你输入了”fugui”,出来的可能是”富贵”,但不会
出来”抚跪”,这就是语言模型的功劳
语音识别中语言模型的目的是根据声学模型的输出给出最大概率。

[En]

The purpose of the language model in speech recognition is to give the maximum probability according to the output of the acoustic model.

的文字序列
语言模型的背景
语言模型是为一种语言构建的概率模型,以便创建能够描述给定单词的模型。

[En]

A language model is a probability model built for a language in order to create a model that can describe a given word.

语言中序列出现的概率分布

[En]

The probability distribution of the occurrence of sequences in language

“定义机器人时代的大脑引擎,让生活更便捷、更有趣、更安全”。 “代时人机器定义引擎的大脑,生活让更便捷,有趣更,安更全”。
语言模型会告诉你,第一句话出现的概率更高,更像是一个“人类句子”。

[En]

The language model will tell you that the probability of the first sentence is higher, more like a “human sentence.”

语言模型技术广泛应用于语音识别、OCR、机器翻译、输入法等产品上
在包括词典、语料库和模型选择在内的语言建模过程中,语言建模对产品的性能至关重要。

[En]

In the process of language modeling, including dictionary, corpus and model selection, it is very important to the performance of the product.

重要的影响
语言模型的建模需要使用复杂的模型公式进行仿真计算。

[En]

The modeling of language model needs to use complex model formula for simulation calculation.

语言模型的技术难点
语言模型的性能在很大程度上取决于语料库的质量和数量。与特定任务匹配的

[En]

The performance of the language model largely depends on the quality and volume of the corpus. That matches a specific task

大语料库永远是最重要的。然而,在实际应用中,这样的语料库往往很难找到。

[En]

Big corpus is always the most important. However, in practical application, such corpus is often difficult to find.

传统的Ngram建模技术,对长距离的依赖处理的欠佳。如工业界常用的四元
该模型,即当前词的概率,仅取决于三个历史词。因此,更遥远的历史词语正在被构筑。

[En]

The model, that is, the probability of the current word, depends on only three historical words. Therefore, more distant historical words are being built.

在该模型中,它对当前词的概率没有影响。

[En]

In the model, it has no effect on the current word probability.

此外,Ngram模型建模的参数空间过于庞大。同样以四元模型为例,词典大
小为V,参数空间就是V4。实际应用中V大小为几万到几百万,可想而知,参
有多大的空间。在这样的参数尺度下,再大的数据也显得有些稀疏。

[En]

How much space is there. Under such a parameter scale, no matter how big the data is, it seems a little sparse.

近年来提出的神经网络语言建模技术在一定程度上解决了参数空间大、距离远的问题。

[En]

The neural network language model technology proposed in recent years has solved the problem of large parameter space and long distance to some extent.

依赖的问题。对于相似的单词,概率估计来自于另一个单词的一定程度的平滑。

[En]

The problem of dependence. And for similar words, the probability estimation comes with a certain degree of smoothing from another.

一种观点解决了数据稀疏的问题。但神经网络语言模型的缺点是训练时间过长。

[En]

One point of view solves the problem of sparse data. But the disadvantage of the neural network language model is the training time.

长时间,实际应用中查询速度慢,需要硬件加速。

[En]

Long, the query speed is slow in practical application, so it needs to be accelerated with hardware.

语言模型通常使用链规则将句子的概率分解为其中每个单词的概率积。

[En]

Language models generally use the chain rule to disassemble the probability of a sentence into the probability product of each word in it.

设W是由W1,W2,W3,…,Wn组成的;P(W)=P(w1,w2,w3,
…,wn)=P(w1)P(w2|w1)P(w3|w1,w2)…P(wn|w1,w2,w3,…,wn−1)每一项都是在已
在已知所有先前单词的情况下,当前单词的概率

[En]

Under the condition of knowing all the previous words, the probability of the current word

然而,当条件太长时,概率很难估计,所以最常见的做法是认为

[En]

However, when the condition is too long, the probability is difficult to estimate, so the most common practice is to think

每个词的概率分布只取决于历史上的最后几个词。这样的语言模型称为

[En]

The probability distribution of each word depends only on the last few words in history. Such a language model is called

n-gram模型
N 元文法N-Gram
统计前后N个字出现的概率
假定某一个字出现的概率仅与前面N-1个字出现的概率有关系
假设有一字序列W=(w1,w2,w3,⋯,wU) ,则其发生概率可以被分解为如
下形式:
P(W)=P(w1,w2,w3,…,wn)=P(w1)P(w2|w1)P(w3|w1,w2)…P(wn|w1,w2,w3,
…,wn−1)
但是,这样的概率无法统计。根据马尔科夫假设,则只需考虑前 N 个
字符发生条件下的概率即可。假设 N=2 则有
P(W)=P(w1)P(w2|w1)P(w3|w2)…P(wn|wn−1)
马尔科夫假设(Markov Assumption):下一个词的出现仅依赖于
它前面的一个或几个词
然后根据贝叶斯公式,我们可以得到在另一个词的条件下发生的事情的轮廓。

[En]

Then according to the Bayesian formula, we can get the outline of what happens under the condition of another word.


P(wn|wn−1)=P(wn,wn−1)/P(wn−1)
结果,在大量语料库中计算相邻词的出现概率,然后计算单个词。

[En]

As a result, the probability of occurrence of adjacent words is calculated in a large number of corpus, and then a single word is calculated.

出现的概率,即可。
因为必然会有一些生僻的短语没有出现在语料库中,但它们也会出现。

[En]

Because there are bound to be some obscure phrases that do not appear in the corpus, but they also occur.

概率,所以我们需要一个算法来生成这些晦涩短语的概率,也就是平滑。

[En]

Probability, so we need an algorithm to generate the probability of these obscure phrases, that is, smoothing.

常用的平滑方式有古德-图灵平滑(Good-Turing Smoothing)和卡
茨平滑(Katz Smoothing)等
n-gram模型中的n越大,需要的训练数据就越多。一般的语音识别系统
可以做到trigram(n=3);Google似乎可以做到n=7
当n取1、2、3时,n-gram模型分别称为unigram、bigram和
trigram语言模型
更大的n:对下一个词出现的约束信息更多,具有更大的辨别力;
更小的n:在训练语料库中出现的次数更多,具有更可靠的统计信
息,具有更高的可靠性
理论上,n越大越好,经验上,trigram用的最多,尽管如此,原则
上,能用bigram解决,绝不使用trigram
构造语言模型
通常,通过计算最大似然估计(Maximum Likelihood Estimate)构造
语言模型,它是对训练数据的最佳估计,公式如下:

[En]

Language model, which is the best estimate of training data, the formula is as follows:

p(w1|wi-1) = count(wi-1, wi) / count(wi-1)
count(X)表示在训练语料中出现的次数,训练语料的规模越大,参
数估计的结果越可靠
但即使训练数据的规模很大,如若干GB,还是会有很多语言现象在训
练语料中没有出现过,这就会导致很多参数(某n元对的概率)为0
这种问题也被称为数据稀疏(Data Sparseness),解决数据稀疏问
题可以通过数据平滑(Data Smoothing)技术来解决
1)加法平滑 基本思想是为避免零概率问题,将每个n元对得出现次
数加上一个常数δ(0

Original: https://blog.csdn.net/weixin_45606671/article/details/114522868
Author: Zemun
Title: 语音识别系统结构

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/515073/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球