




Speaker recognition is to distinguish the identity of the speaker according to the speaker’s speech signal. Speech is one of the natural attributes of human beings. due to the physiological differences of the speaker’s vocal organs and the acquired behavior differences, everyone’s speech has a strong personal color, which makes it possible to identify the speaker by analyzing the speech signal. Using voice to identify the identity of the speaker has many unique advantages, such as voice is an inherent feature of people, will not be lost or forgotten; voice signal collection is convenient, the cost of system equipment is low; the use of telephone network can also achieve remote customer service. Therefore, in recent years, more and more attention has been paid to speaker recognition. Compared with other biometric technologies such as fingerprint recognition and hand recognition, speaker recognition is not only easy to use, but also non-contact, easy to be accepted by users, and among the existing biometric recognition technologies, it is the only recognition technology that can be used for remote verification. Therefore, the application prospect of speaker recognition is very broad: today, speaker recognition technology has been related to multi-disciplinary research fields, and advances in different fields have contributed to the development of speaker recognition. Speaker recognition technology is a comprehensive technology in many fields, such as acoustics, linguistics, computer, information processing, artificial intelligence and so on.



The fundamental problem in the design of speaker recognition system is how to extract the basic human features from the speech signal. That is, the extraction of speech feature vector is the basis of the whole speaker recognition system, which has a very important influence on the false rejection rate and false acceptance rate of speaker recognition. Different from speech recognition, speaker recognition uses the speaker information in the speech signal, regardless of the meaning of the words in the speech, it emphasizes the personality of the speaker. Therefore, a single speech feature vector is difficult to improve the recognition rate, so how to extract the key components of the speech signal is particularly important. The quality of the characteristic parameters of speech signal directly leads to the accuracy of discrimination.



2.1 语音识别系统总体框架



The overall structure of the speaker recognition system is shown in figure 1. First of all, through the voice recording as the input signal, the input analog speech signal should be preprocessed, including pre-filtering, sampling and quantization, windowing, endpoint detection, pre-emphasis and so on. After preprocessing, there is an important part: feature parameter extraction. The specific requirements are:

图1 说话人语音识别系统总体框图


function M3 = blockFrames(s, fs, m, n)

% blockFrames:

% Puts the signal into frames 分帧函数

% Inputs:

% s contains the signal to analize 语音信号

% fs is the sampling rate of the signal 语音采样频率

% m is the distance between the beginnings of two frames 两帧之间的距离

% n is the number of samples per frame 每帧的采样点数

% Output:

% M3 is a matrix containing all the frames 数组形式,包含了所有的帧

l = length(s); %语音信号的长度

nbFrame = floor((l – n) / m) + 1; %帧数

for i = 1:n

for j = 1:nbFrame

M(i, j) = s(((j – 1) * m) + i); %逐帧扫描



h = hamming(n);

M2 = diag(h) * M; %加汉明窗

for i = 1:nbFrame

M3(:, i) = fft(M2(:, i)); %短时傅立叶变换


function code = train(traindir, n)

% 计算wav文件的VQ码码本

% Speaker Recognition: Training Stage

% Input:

% traindir : string name of directory contains all train sound files

% n : number of train files in traindir

% Output:

% code : trained VQ codebooks, code{i} for i-th speaker

% Note:

% Sound files in traindir is supposed to be:

% s1.wav, s2.wav, …, sn.wav

% Example:

% >> code = train(‘C:\data\train\’, 8);

k = 16; % number of centroids required

for i = 1:n % train a VQ codebook for each speaker

file = sprintf(‘%ss%d.wav’, traindir, i);


[s, fs] = wavread(file);

v = mfcc(s, fs); % Compute MFCC’s

code{i} = vqlbg(v, k); % Train VQ codebook


function d = disteu(x, y)

% DISTEU Pairwise Euclidean distances between columns of two matrices 测试失真度

% Input:

% x, y: Two matrices whose each column is an a vector data.

% Output:

% d: Element d(i,j) will be the Euclidean distance between two column vectors X(:,i) and Y(:,j)

% Note:

% The Euclidean distance D between two vectors X and Y is:

% D = sum((x-y).^2).^0.5

[M, N] = size(x);

[M2, P] = size(y);

if (M ~= M2)



d = zeros(N, P);

if (N < P)

copies = zeros(1,P);

for n = 1:N

d(n,:) = sum((x(:, n+copies) – y) .^2, 1);



copies = zeros(1,N);

for p = 1:P

d(:,p) = sum((x – y(:, p+copies)) .^2, 1)’;



d = d.^0.5;

function m = melfb(p, n, fs)

% MELFB Determine matrix for a mel-spaced filterbank

% Inputs: p number of filters in filterbank 滤波器数

% n length of fft FFT变换的点数

% fs sample rate in Hz 采样频率

% Outputs: x a (sparse) matrix containing the filterbank amplitudes

% size(x) = [p, 1+floor(n/2)]

% Usage: For example, to compute the mel-scale spectrum of a

% colum-vector signal s, with length n and sample rate fs:

% f = fft(s);

% m = melfb(p, n, fs);

% n2 = 1 + floor(n/2);

% z = m * abs(f(1:n2)).^2;

% z would contain p samples of the desired mel-scale spectrum

% To plot filterbanks e.g.:

% plot(linspace(0, (12500/2), 129), melfb(20, 256, 12500)’),

% title(‘Mel-spaced filterbank’), xlabel(‘Frequency (Hz)’);

f0 = 700 / fs;

fn2 = floor(n/2);

lr = log(1 + 0.5/f0) / (p+1);

% convert to fft bin numbers with 0 for DC term

bl = n * (f0 * (exp([0 1 p p+1] * lr) – 1));

b1 = floor(bl(1)) + 1;

b2 = ceil(bl(2));

b3 = floor(bl(3));

b4 = min(fn2, ceil(bl(4))) – 1;

pf = log(1 + (b1:b4)/n/f0) / lr;

fp = floor(pf);

pm = pf – fp;

r = [fp(b2:b4) 1+fp(1:b3)];

c = [b2:b4 1:b3] + 1;

v = 2 * [1-pm(b2:b4) pm(1:b3)];

m = sparse(r, c, v, p, 1+fn2);

function r = mfcc(s, fs) % s声音信号的向量 fs取样频率


% Inputs: s contains the signal to analize

% fs is the sampling rate of the signal

% Output: r contains the transformed signal

m = 100;

n = 256;

l = length(s);

nbFrame = floor((l – n) / m) + 1;

for i = 1:n

for j = 1:nbFrame

M(i, j) = s(((j – 1) * m) + i);



h = hamming(n);

M2 = diag(h) * M;

for i = 1:nbFrame

frame(:,i) = fft(M2(:, i));


t = n / 2;

tmax = l / fs;

m = melfb(20, n, fs);

n2 = 1 + floor(n / 2);

z = m * abs(frame(1:n2, :)).^2;

r = dct(log(z));

function r = vqlbg(d,k)

% VQLBG Vector quantization using the Linde-Buzo-Gray algorithme 使用LBG算法

% Inputs:

% d contains training data vectors (one per column)

% k is number of centroids required

% Output:

% r contains the result VQ codebook (k columns, one for each centroids)

e = .01;

r = mean(d, 2);

dpr = 10000;

for i = 1:log2(k)

r = [r(1+e), r(1-e)];

while (1 == 1)

z = disteu(d, r);

[m,ind] = min(z, [], 2);

t = 0;

for j = 1:2^i

r(:, j) = mean(d(:, find(ind == j)), 2);

x = disteu(d(:, find(ind == j)), r(:, j));

for q = 1:length(x)

t = t + x(q);



if (((dpr – t)/t) < e)



dpr = t;




function test(testdir, n, code)

% Speaker Recognition: Testing Stage

% Input:

% testdir : string name of directory contains all test sound files

% n : number of test files in testdir 音频文件数目

% code : codebooks of all trained speakers 说话人的训练码本

% Note:

% Sound files in testdir is supposed to be:

% s1.wav, s2.wav, …, sn.wav

% Example:

% >> test(‘C:\data\test\’, 8, code);

for k = 1:n % 读出音频文件

file = sprintf(‘%ss%d.wav’, testdir, k);

[s, fs] = wavread(file);

v = mfcc(s, fs); % 计算MFCC’s

distmin = inf;

k1 = 0;

for l = 1:length(code) % 计算每个训练码本的失真度

d = disteu(v, code{l});

dist = sum(min(d,[],2)) / size(d,1);

if dist < distmin

distmin = dist;

k1 = l;



msg = sprintf(‘Speaker %d matches with speaker %d’, k, k1);



