单麦降噪经典书籍《Speech enhancement: theory and practice》读书笔记(第1章~第4章)

1.1 了解噪声

噪声:平稳噪声、非平稳噪声

[En]

Noise: stationary noise, non-stationary noise

抑制非平稳噪声比抑制平稳噪声困难得多。

[En]

It is much more difficult to suppress non-stationary noise than stationary noise.

不同噪声的频谱形状不同,尤其是噪声能量在频域的分布不同。

[En]

The spectral shapes of different noises are different, especially the distribution of noise energy in the frequency domain.

在实际应用中的语音增强算法通常需要在0~15dB SNR的环境下工作

1.2 语音增强算法分类

语音增强算法可以分为三类:

[En]

Speech enhancement algorithms can be divided into three categories:

谱减法:迄今最容易实现的增强算法。由于噪声是加性的,因此当没有语音的时候,可以估计或更新噪声谱,然后从带噪信号中将噪声减去。
基于统计模型的算法:给定带噪信号的一系列测量参数,例如傅里叶变换系数,我们对感兴趣的参数找到一个线性(或非线性)估计,也就是纯净信号的一种变换系数。维纳(Wiener)算法和最小均方误差(MMSE)算法就属于此类。
子空间算法:源自线性代数理论,通过正交矩阵分解技术将带噪信号向量空间分解为”信号”和”噪声”子空间,特别是奇异值分解(SVD)和特征向量/特征值分解(EVD)。

3.2 语音产生过程

语言的产生涉及一系列器官和肌肉,包括肺、喉咙和声道。

[En]

The production of speech involves a range of organs and muscles, including the lungs, throat and vocal tract.

是语音产生的主要激励源

声带可以处于三种状态:呼吸、有声和无声。

[En]

The vocal cords can be in three states: breathing, voiced and unvoiced.

呼吸态:空气从肺部自由流过声门(张开)而没有明显阻力。
浊音态:两片声带张力的快速增减,伴随着声门气压的快速变化,导致声带周期性闭合。

元音也被称为浊音

基音周期:一个声门开启闭合往复一次的时间长度称为基音周期,基因周期主要受声带的张力和质量的影响。声带质量越大,振动越慢,基因周期越长。男声的基因频率的范围大约在60Hz到150Hz,而女声和童声大约为200Hz到400Hz。

清音态:声带不振动,但是两片声带更紧绷更靠近,因此气流流经声门时产生湍流,空气湍流发出的声音称为送气声。

清音包括了绝大多数的辅音

就像物理的线性滤光器一样,声道重塑声门波谱以产生不同的声音,而滤光器的特性随着发声器官的位置或嘴巴的形状而变化。

[En]

Like a physical linear filter, the vocal tract reshapes the spectrum of glottic waves to produce different sounds, and the characteristics of the filter change with the position of the vocal organ or the shape of the mouth.

声道是一个随时间变化的谐振腔,其形状随时间变化,因此谐振频率会发生变化,产生不同的声音。

[En]

The channel is a time-varying resonant cavity whose shape changes with time, so the resonant frequency changes and produces different sounds.

第一共振峰,标注为F1,随口腔的张闭而改变。嘴巴微张时产生较低频率的F1,而嘴巴大张时具有较高的F1。
第二共振峰,标注为F2,与口腔中舌头位置或者嘴唇活动有关。
第三共振峰,标准为F3,随口腔的前后收缩而变化。

对11个元音的F1和F2共振峰频率做测量和比较,实验结果显示,F1频率受噪声影响最小,F2的值受影响最大。这一发现说明,在噪声环境中,听者必定可以得到相对更可靠的F1的信息,但可能只能得到比较模糊的F2的信息。

一个均匀声管共振的最低频率的波长是声管长度的4倍。声管将在该最低频率的奇数倍上产生谐振。

3.3 语音产生的工程模型

声带为声道提供了一个兴奋源,它可以是周期性的,也可以是非周期性的,具体取决于声带的状态。

[En]

The vocal cords provide a source of excitation for the vocal tract, which can be periodic or aperiodic, depending on the state of the vocal cords.

3.5 语音感知的声学特征

稳态(元音中间段)的共振峰频率是元音感知的主要音标,但不是唯一的音标。其他语音特征(如时长、频谱变化等)也经常用来识别元音。

[En]

The formant frequency of steady state (middle segment of vowels) is the main phonetic sign of vowel perception, but it is not the only one. Other phonetic features (such as duration, spectrum change, etc.) are also often used to identify vowels.

尽管半元音具有与元音相似的特征,但是它们被归为辅音类,这是由于它们和其他辅音一样,只是在音节的起始和结尾的时候出现。与元音不同的是,半元音的共振峰并不能达到稳态。

共振峰转换是半元音感知的重要音征,尤其重要的是F2和F3频率的变化。

在发鼻音的时候,软腭降低,鼻腔入口张开,口腔通道阻断。在声道中引入鼻腔就产生了一个更大更长的共振腔,这样的共振腔具有更低的共振频率。

由于发鼻音是口腔通道完全关闭,所以会出现反共振,其结果是高阶共振峰被严重衰减。

鼻音有两个区别于其他声音的特征:

[En]

There are two features that distinguish nasal sounds from other sounds:

首先是高频共振峰衰减后的强度。

[En]

The first is the intensity of the high frequency formant after attenuation.

二是是否存在低频共振(

[En]

The second is whether there is low frequency resonance (

元音、半元音和鼻音都有一个特点,即气流都可以相对自由地通过声带。而对于闭塞音和摩擦音,发这些音时,气流受约束。某些情况下甚至会被阻断。

清除具有两个独特的属性:

[En]

Obliteration has two unique properties:

首先,当产生闭塞声音时,声道完全关闭,因此气流暂时被阻塞,并在声学信号中被静音。

[En]

First, when the occlusive sound is produced, there is a complete closure of the vocal tract, so that the airflow is temporarily blocked, and it is muted in the acoustic signal.

其次,阻塞的气压累积后,气流迅速释放,听起来像是短的噪音脉冲,脉冲后的噪音称为吸气声。

[En]

Second, after the blocked air pressure accumulates, the air flow is released quickly, which sounds like a short noise pulse, and the noise after the pulse is called aspirated sound.

脉搏阻断辅音的两个主要声学征象是由闭合期和气流释放形成的。

[En]

The two main acoustic signs of pulse block consonants formed by closing period and airflow release.

闭塞音又分为清闭塞音(/p,t,k/)和浊闭塞音(/b,d,g/)。

  • 关闭期间没有空气流过声道,而对于浊音闭塞的声音,在整个或部分关闭期间有一个低强度的周期性信号(在音调频率)。
    [En]

    there is no air flow through the vocal tract during the closed period, and for voiced occluded sounds, there is a low intensity periodic signal (at pitch frequency) during all or part of the closed period.*

  • 浊闭塞音的平均时长要短于清闭塞音。浊音启动时间,即发出脉冲到声带振动开始的时间间隔,对于浊闭塞音来讲大约有10ms到20ms,清闭塞音则有40ms到100ms。
  • 清阻音释放的气流形成的脉冲强度强于浊阻音
    [En]

    the intensity of the pulse formed by the airflow released by the clear blocking sound is stronger than that of the voiced blocking sound.*

与闭塞辅音类似,发摩擦音时,发声器官在声道某处形成窄的收缩甚至阻隔,当气流流经这些成阻区时就产生噪声,继而被声道整形。

摩擦音的主要语音标志是存在较长的非周期噪声段。

[En]

The main phonetic sign of friction sound lies in the existence of relatively long aperiodic noise segment.

摩擦音的位置有三个主要的语音符号:

[En]

There are three main phonetic signs for the location of friction sounds:

第一是摩擦噪声的强度;
二是摩擦声的频谱形态。

[En]

The second is the frequency spectrum shape of friction sound.

第三是与前后浊音之间的F2和F3共振峰的过渡。

4.1 多说话人环境下的语音可懂度

以便区分,用掩蔽语音信号(masker)和目标语音信号(target)来分别表示干扰信号和希望提取的语音信号。

已有研究证明,稳态噪声环境下的语言识别比单个竞争说话人的语言识别更加困难。通过使用单个(或少量)竞争扬声器来代替稳态噪声作为干扰信号来提高可理解性的现象在文献中被称为去掩蔽。

[En]

Some studies have proved that language identification in steady-state noise is more difficult than that in a single competitive speaker. The phenomenon of improving intelligibility by using a single (or a small number of) competing speakers to replace steady-state noise as an interference signal is called de-masking in the literature.

所有这些掩蔽信号都被用来评估它们与竞争语音掩蔽信号性能差异的原因,这是由于它们的时域波动的异同造成的。

[En]

All these masking signals are used to evaluate the reasons for the performance differences between them and competitive speech masking signals, which are caused by the similarities and differences of their time-domain fluctuations.

源自不同位置的信号到达每只耳朵的时间有所不同,并且由于经过头部阻挡的衰减,幅度也会不同,通常称为”头影(head shadow)”。由于头影效应而导致的幅度差别称为双耳强度差(ILDs),到达双耳的时间差称为双耳时差(ITDs)。

噪声源的空间分布和数量会影响双耳听力下的语音清晰度。

[En]

The spatial distribution and number of noise sources will affect the speech intelligibility under binaural hearing.

双耳听力的好处通常被称为空间去掩蔽,这主要归因于两个因素,一个是头部掩蔽的作用,另一个是耳间时差。

[En]

The benefits of binaural hearing are often called spatial de-masking, which is mainly attributed to two factors, one is the role of head masking, and the other is interaural jet lag.

4.2 影响鲁棒性的语音声学属性

从图中可以看出,大多数的语音能量落在低于1KHz的频率上,峰值在500Hz,从500Hz往上,能量逐渐下降。

这种低频压力可以减弱失真和噪声的影响,并保护语音信息,原因有几个:

[En]

This low-frequency stress can weaken the influence of distortion and noise and protect speech information for several reasons:

首先,随着信噪比的降低,低频区域是最后一个被噪声掩盖的区域。

[En]

First, with the decrease of signal-to-noise ratio, the low-frequency region is the last area masked by noise.

第二,听者能获取到可靠的F1和F2的信息,该信息对于元音辨识和闭塞辅音的感知具有关键作用;
第三,低频部分的语音谐波比高频部分所受的影响小,因此,听着可能获得相对可靠的F0音征,从竞争语音中区分出目标语音需要用到该音征;
第四,听觉得频率选择性在低频部分最强,往高频部分则下降,这让听者对F0有较好的分辨能力。

在噪声中,低频峰的信息比谱谷的信息更容易被保留。

[En]

In noise, the information of low frequency peaks is more likely to be preserved than that of spectral valleys.

如图所示,前三个共振峰的频率位置,即对应于最低频率的三个共振峰(F1~F3),在叠加噪声之后变化不大。
噪声会给语音频谱带来失真,改变频谱斜率,减小谱线的动态范围,但可以在一定程度上保留低频共振峰的频率位置。

[En]

Noise can bring distortion to the speech spectrum, change the spectral slope, and reduce the dynamic range of the spectral line, but the frequency position of the low frequency formant can be retained to a certain extent.

至少1Khz以内的低阶谐波在噪声中能得以较好的保存,这表明噪声环境下听者仍然有可能得到相对精确的F0信息。

与元音不同,辅音的持续时间短,强度低,因此辅音比元音更容易受到噪音或失真的影响。

[En]

Unlike vowels, consonants have short duration and low intensity, so consonants are more susceptible to noise or distortion than vowels.

元音或双元音的共振峰迁移是缓慢和渐进的,而辅音的频谱变化快,共振峰转换快。在噪声中,这种快速的光谱变化将在一定程度上被感知到。从而成为辅音存在与否的标志。

[En]

The formant transfer of vowels or double vowels is slow and gradual, on the contrary, consonants have rapid spectrum changes and fast formant conversion. in noise, this rapid spectrum change will be sensed to a certain extent. thus become a sign of the existence of consonants or not.

4.3 噪声环境中听觉的感知策略

将属于同一声源的声学分量称为”听觉流”

有两种类型的集成是分离语音源所必需的:

[En]

There are two types of integration that are necessary for the separation of speech sources:

首先,对连续发生和时间接近的时间进行积分。

[En]

First, integrate the time that occurs in succession and is close in time.

其次,对同时出现在频谱不同区域的声学分量进行积分。

[En]

Second, the acoustic components occurring in different regions of the spectrum at the same time are integrated.

听者将通过竞争语音信号中的静音间隙或波形中的“波谷”来识别目标句子的内容,并且这些受训人员或波幅波谷区域为听者提供了“瞥见”目标语音中的整个音节或单词的可能性。

[En]

The listener will identify the content of the target sentence by competing for the mute gap in the speech signal or the “trough” in the waveform, and these trainee or amplitude trough areas provide the listener with the possibility of “glimpsing” the whole syllable or word in the target speech.

在竞争语音条件下,掩蔽语音和目标语音的基音频率差异在语音识别中起着重要的作用。

[En]

Under the condition of competitive speech, the difference of pitch frequency between masking speech and target speech plays an important role in speech recognition.

在恶劣环境中的语音清晰度与听者运用语言知识的能力有很大关系,特别是当声音信号中的语音符号较弱时。语音内容可以限制候选词的范围,从而提高句子的可理解性。

[En]

Speech intelligibility in a harsh environment has a lot to do with the listener’s ability to make use of linguistic knowledge, especially when the phonetic sign in the acoustic signal is weak. Phonetic content can limit the scope of candidate words, and then improve the intelligibility of sentences.

Original: https://blog.csdn.net/weixin_42601303/article/details/109667743
Author: 毛绒团叽
Title: 单麦降噪经典书籍《Speech enhancement: theory and practice》读书笔记(第1章~第4章)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/525004/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球