MATLAB说话人识别系统

MATLAB说话人识别系统

一.语音识别的简介

说话人识别是根据说话人的语音信号来区分说话人的身份。言语是人类的自然属性之一。由于说话人发声器官的生理差异和后天习得的行为差异,每个人的言语都带有强烈的个人色彩,这使得通过分析语音信号来识别说话人成为可能。使用语音识别说话人的身份具有许多独特的优势,如语音是人的固有功能,不会丢失或遗忘;语音信号采集方便,系统设备成本低;利用电话网络还可以实现远程客服。因此,近年来,说话人识别受到越来越多的关注。与指纹识别、手识别等其他生物识别技术相比,说话人识别不仅使用方便,而且非接触,容易被用户接受,是现有生物识别技术中唯一可用于远程验证的识别技术。因此,说话人识别的应用前景非常广阔:今天,说话人识别技术已经涉及到多学科的研究领域,不同领域的进步促进了说话人识别的发展。说话人识别技术是一项涉及声学、语言学、计算机、信息处理、人工智能等多个领域的综合技术。

[En]

Speaker recognition is to distinguish the identity of the speaker according to the speaker’s speech signal. Speech is one of the natural attributes of human beings. due to the physiological differences of the speaker’s vocal organs and the acquired behavior differences, everyone’s speech has a strong personal color, which makes it possible to identify the speaker by analyzing the speech signal. Using voice to identify the identity of the speaker has many unique advantages, such as voice is an inherent feature of people, will not be lost or forgotten; voice signal collection is convenient, the cost of system equipment is low; the use of telephone network can also achieve remote customer service. Therefore, in recent years, more and more attention has been paid to speaker recognition. Compared with other biometric technologies such as fingerprint recognition and hand recognition, speaker recognition is not only easy to use, but also non-contact, easy to be accepted by users, and among the existing biometric recognition technologies, it is the only recognition technology that can be used for remote verification. Therefore, the application prospect of speaker recognition is very broad: today, speaker recognition technology has been related to multi-disciplinary research fields, and advances in different fields have contributed to the development of speaker recognition. Speaker recognition technology is a comprehensive technology in many fields, such as acoustics, linguistics, computer, information processing, artificial intelligence and so on.

如何从语音信号中提取人的基本特征是说话人识别系统设计中的基本问题。也就是说,语音特征向量的提取是整个说话人识别系统的基础,对说话人识别的误拒率和误接受率有着非常重要的影响。与语音识别不同,说话人识别利用语音信号中的说话人信息,不考虑语音中单词的含义,强调说话人的个性。因此,单一的语音特征向量很难提高识别率,因此如何提取语音信号的关键成分就显得尤为重要。语音信号特征参数的好坏直接影响到语音识别的准确性。

[En]

The fundamental problem in the design of speaker recognition system is how to extract the basic human features from the speech signal. That is, the extraction of speech feature vector is the basis of the whole speaker recognition system, which has a very important influence on the false rejection rate and false acceptance rate of speaker recognition. Different from speech recognition, speaker recognition uses the speaker information in the speech signal, regardless of the meaning of the words in the speech, it emphasizes the personality of the speaker. Therefore, a single speech feature vector is difficult to improve the recognition rate, so how to extract the key components of the speech signal is particularly important. The quality of the characteristic parameters of speech signal directly leads to the accuracy of discrimination.

系统在说话人的识别中采用基于Mel的频率倒谱系数的模板匹配的说话人识别方法。具体的实现过程当中,采用了matlab软件来帮助完成这个项目。在matlab中实现采集,分析,特征提取,配对,识别。

二.语音识别算法原理

2.1 语音识别系统总体框架

说话人识别系统的总体结构如图1所示。首先,通过语音记录作为输入信号,对输入的模拟语音信号进行预处理,包括预滤波、采样量化、加窗、端点检测、预加重等。经过预处理后,有一个重要的部分:特征参数的提取。具体要求如下:

[En]

The overall structure of the speaker recognition system is shown in figure 1. First of all, through the voice recording as the input signal, the input analog speech signal should be preprocessed, including pre-filtering, sampling and quantization, windowing, endpoint detection, pre-emphasis and so on. After preprocessing, there is an important part: feature parameter extraction. The specific requirements are:

参考 模块

语音输入

预处理

特征提取

测度估计

模板库

识别决策

专家知识

图1 说话人语音识别系统总体框图

三、源码

function M3 = blockFrames(s, fs, m, n)

% blockFrames:

% Puts the signal into frames 分帧函数

% Inputs:

% s contains the signal to analize 语音信号

% fs is the sampling rate of the signal 语音采样频率

% m is the distance between the beginnings of two frames 两帧之间的距离

% n is the number of samples per frame 每帧的采样点数

% Output:

% M3 is a matrix containing all the frames 数组形式,包含了所有的帧

l = length(s); %语音信号的长度

nbFrame = floor((l – n) / m) + 1; %帧数

for i = 1:n

for j = 1:nbFrame

M(i, j) = s(((j – 1) * m) + i); %逐帧扫描

end

end

h = hamming(n);

M2 = diag(h) * M; %加汉明窗

for i = 1:nbFrame

M3(:, i) = fft(M2(:, i)); %短时傅立叶变换

End

function code = train(traindir, n)

% 计算wav文件的VQ码码本

% Speaker Recognition: Training Stage

% Input:

% traindir : string name of directory contains all train sound files

% n : number of train files in traindir

% Output:

% code : trained VQ codebooks, code{i} for i-th speaker

% Note:

% Sound files in traindir is supposed to be:

% s1.wav, s2.wav, …, sn.wav

% Example:

% >> code = train(‘C:\data\train\’, 8);

k = 16; % number of centroids required

for i = 1:n % train a VQ codebook for each speaker

file = sprintf(‘%ss%d.wav’, traindir, i);

disp(file);

[s, fs] = wavread(file);

v = mfcc(s, fs); % Compute MFCC’s

code{i} = vqlbg(v, k); % Train VQ codebook

end

function d = disteu(x, y)

% DISTEU Pairwise Euclidean distances between columns of two matrices 测试失真度

% Input:

% x, y: Two matrices whose each column is an a vector data.

% Output:

% d: Element d(i,j) will be the Euclidean distance between two column vectors X(:,i) and Y(:,j)

% Note:

% The Euclidean distance D between two vectors X and Y is:

% D = sum((x-y).^2).^0.5

[M, N] = size(x);

[M2, P] = size(y);

if (M ~= M2)

error(‘不匹配!’)

end

d = zeros(N, P);

if (N < P)

copies = zeros(1,P);

for n = 1:N

d(n,:) = sum((x(:, n+copies) – y) .^2, 1);

end

else

copies = zeros(1,N);

for p = 1:P

d(:,p) = sum((x – y(:, p+copies)) .^2, 1)’;

end

end

d = d.^0.5;

function m = melfb(p, n, fs)

% MELFB Determine matrix for a mel-spaced filterbank

% Inputs: p number of filters in filterbank 滤波器数

% n length of fft FFT变换的点数

% fs sample rate in Hz 采样频率

% Outputs: x a (sparse) matrix containing the filterbank amplitudes

% size(x) = [p, 1+floor(n/2)]

% Usage: For example, to compute the mel-scale spectrum of a

% colum-vector signal s, with length n and sample rate fs:

% f = fft(s);

% m = melfb(p, n, fs);

% n2 = 1 + floor(n/2);

% z = m * abs(f(1:n2)).^2;

% z would contain p samples of the desired mel-scale spectrum

% To plot filterbanks e.g.:

% plot(linspace(0, (12500/2), 129), melfb(20, 256, 12500)’),

% title(‘Mel-spaced filterbank’), xlabel(‘Frequency (Hz)’);

f0 = 700 / fs;

fn2 = floor(n/2);

lr = log(1 + 0.5/f0) / (p+1);

% convert to fft bin numbers with 0 for DC term

bl = n * (f0 * (exp([0 1 p p+1] * lr) – 1));

b1 = floor(bl(1)) + 1;

b2 = ceil(bl(2));

b3 = floor(bl(3));

b4 = min(fn2, ceil(bl(4))) – 1;

pf = log(1 + (b1:b4)/n/f0) / lr;

fp = floor(pf);

pm = pf – fp;

r = [fp(b2:b4) 1+fp(1:b3)];

c = [b2:b4 1:b3] + 1;

v = 2 * [1-pm(b2:b4) pm(1:b3)];

m = sparse(r, c, v, p, 1+fn2);

function r = mfcc(s, fs) % s声音信号的向量 fs取样频率

% MFCC

% Inputs: s contains the signal to analize

% fs is the sampling rate of the signal

% Output: r contains the transformed signal

m = 100;

n = 256;

l = length(s);

nbFrame = floor((l – n) / m) + 1;

for i = 1:n

for j = 1:nbFrame

M(i, j) = s(((j – 1) * m) + i);

end

end

h = hamming(n);

M2 = diag(h) * M;

for i = 1:nbFrame

frame(:,i) = fft(M2(:, i));

end

t = n / 2;

tmax = l / fs;

m = melfb(20, n, fs);

n2 = 1 + floor(n / 2);

z = m * abs(frame(1:n2, :)).^2;

r = dct(log(z));

function r = vqlbg(d,k)

% VQLBG Vector quantization using the Linde-Buzo-Gray algorithme 使用LBG算法

% Inputs:

% d contains training data vectors (one per column)

% k is number of centroids required

% Output:

% r contains the result VQ codebook (k columns, one for each centroids)

e = .01;

r = mean(d, 2);

dpr = 10000;

for i = 1:log2(k)

r = [r(1+e), r(1-e)];

while (1 == 1)

z = disteu(d, r);

[m,ind] = min(z, [], 2);

t = 0;

for j = 1:2^i

r(:, j) = mean(d(:, find(ind == j)), 2);

x = disteu(d(:, find(ind == j)), r(:, j));

for q = 1:length(x)

t = t + x(q);

end

end

if (((dpr – t)/t) < e)

break;

else

dpr = t;

end

end

end

function test(testdir, n, code)

% Speaker Recognition: Testing Stage

% Input:

% testdir : string name of directory contains all test sound files

% n : number of test files in testdir 音频文件数目

% code : codebooks of all trained speakers 说话人的训练码本

% Note:

% Sound files in testdir is supposed to be:

% s1.wav, s2.wav, …, sn.wav

% Example:

% >> test(‘C:\data\test\’, 8, code);

for k = 1:n % 读出音频文件

file = sprintf(‘%ss%d.wav’, testdir, k);

[s, fs] = wavread(file);

v = mfcc(s, fs); % 计算MFCC’s

distmin = inf;

k1 = 0;

for l = 1:length(code) % 计算每个训练码本的失真度

d = disteu(v, code{l});

dist = sum(min(d,[],2)) / size(d,1);

if dist < distmin

distmin = dist;

k1 = l;

end

end

msg = sprintf(‘Speaker %d matches with speaker %d’, k, k1);

disp(msg);

end

Original: https://blog.csdn.net/m0_60677550/article/details/120264305
Author: 技术狼灭
Title: MATLAB说话人识别系统

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/524900/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球