# FBK和MFCC特征

FBK: filter bank

MFCC: Mel-frequency ceptral coefficients 梅尔频率倒谱系数

MFCC特征的计算过程：

1.预加重pre emphasis:

[En]

Why should it be pre-aggravated: when the voice travels in the air, the high-frequency part attenuates more, and this part of the attenuation should be restored.

pre_emphasis_coeff = 0.95
x(n) = x(n) - pre_emphasis_coeff * x(n-1)


2.分帧：

frame_len = 25 # each frame length (ms)
frame_shift = 10 # frame shift length (ms)
frame_len_samples = frame_len*fs//1000 # each frame length (samples) =200
frame_shift_samples = frame_shift*fs//1000 # frame shifte length (samples) =80


3.预加窗

[En]

Why pre-window: the window function is very smooth, so that the sampling points at both ends of each frame can be smoothly attenuated to zero, so that the intensity of the post-sidelobe of the Fourier transform can be obtained and a higher quality spectrum can be obtained.

[En]

You can see that after windowing, both ends of the frame data are gradually attenuated to zero.

1. (离散)傅里叶变换

[En]

Physical meaning: convert the signal from time domain to frequency domain to get the amplitude and phase of each frequency (or each frequency point).

K=512 # length of DFT

4.计算能量谱

[En]

Get the energy at each frequency (at each frequency point).

[En]

Calculation method: the sum of squares of the real and imaginary parts of a complex number

power_spec = np.absolute(freq_domain_data) ** 2 * (1/K) # power spectrum


5.梅尔滤波Mel-filter

WHY: 人耳对语音的低频部分和高频部分的敏感度不一样，对低频部分更敏感，对高频部分不敏感。

""" 3. Apply the mel filterbank to the power spectrum, sum the energy in each filter.

The Mel scale relates perceived frequency, or pitch, of a pure tone to its actual measured frequency.

Humans are much better at discerning small changes in pitch at low frequencies than they are at high frequencies.

Incorporating this scale makes our features match more closely what humans hear.

The formula for converting from frequency to Mel scale is:
M(f) = 2595*log10(1+f/700)
And formula for converting from Mel scale to frequency is:
F(m) = 700*(10**(m/2595)-1)
"""
low_frequency = 20 # We don't use start from 0 Hz because human ear is not able to perceive low frequency signal.

high_frequency = fs//2 # if the speech is sampled at f Hz then our upper frequency is limited to 2/f Hz. =4000
low_frequency_mel = 2595 * np.log10(1 + low_frequency / 700) # =31.74
high_frequency_mel = 2595 * np.log10(1 + high_frequency / 700) # = 2146.06


[En]

Calculation method: construct a set of filters, and then multiply the filter with the energy spectrum

6.取log

WHY：纵轴的放缩，可以放大低能量处的能量差异。想想log的图像

7.离散余弦变换DCT

[En]

Generally, only the first 12 or 20 points after discrete cosine transform are retained.

num_ceps = 12 # MFCC feature dims, usually between 2-13.

feature from other dims are dropped beacuse they represent rapid changes in filter bank coefficients and they are not helpful for speech models.

mfcc = dct(log_fbank, type=2, axis=1, norm="ortho")[:, 1 : (num_ceps + 1)]


1.MFCC特征是在FBK特征的基础上计算得到；

2.MFCC特征比FBK特征维度更低。一般FBK特征是40维，MFCC特征是13维。

Original: https://blog.csdn.net/huang_yx005/article/details/122474081
Author: huang_yx005
Title: FBK和MFCC特征

(0)

