1. 本章内容
- 本博客介绍基于Attention的beamformer技术(多麦克风波束合成)。
- 对其文章中代码进行复现。
- 只复现了beamformer代码,集成到ASR(wenet)中的代码等待我后期GitHub开源。
2. 文章详情
-
引用:Gong, R. , et al. “Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition.” 2021.
-
arXiv:https://arxiv.org/abs/2109.04783
3. 原文解读
- 网络结构(构思还是很easy的)
(1) 最下面:类似于mel谱
(2)右侧:计算多通路之间的cross-attention,然后经过一个softmax函数,来给每一个通路(channel)求出一个权重。
(3)最上面:求多通路语音信号的mel谱加权和,作为ASR的输入 - 结果展示
(1)自称是beamformer中最SOTA的,其实粒度还是蛮高的。(还有方法降低粒度,还可以更SOTA)
(2)效果绝对提升1-2%。
; 4. 代码/源码(只包含beamformer部分,具体融合到Wenet-ASR,参考我后期github更新)
import torch
import torch.nn as nn
import numpy as np
class beamformer_attention(nn.Module):
def __init__(self,d_model, d_k, d_v, n_heads):
super(beamformer_attention, self).__init__()
self.muttiheadatt = MultiHeadAttention(d_model, d_k, d_v, n_heads)
def forward(self,x):
attain = self.muttiheadatt(x,x,x)
softmax_att = nn.Softmax(dim=-1)(attain)
output = torch.matmul(x.transpose(-1,-2),attain)
length = output.size(0)
output = output.view(length,-1)
return output
class ScaledDotProductAttentionMask(nn.Module):
def __init__(self, d_k):
super(ScaledDotProductAttentionMask, self).__init__()
self.d_k = d_k
def forward(self, Q, K, V):
scores = torch.matmul(Q, K.transpose(-1,-2)) / np.sqrt(self.d_k)
attn = nn.Softmax(dim=-1)(scores)
context = torch.matmul(attn, V)
return context, attn
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, d_k, d_v, n_heads):
super(MultiHeadAttention, self).__init__()
self.d_model = d_model
self.d_k = d_k
self.d_v = d_v
self.n_heads = n_heads
self.W_Q = nn.Linear(d_model, d_k * n_heads)
self.W_K = nn.Linear(d_model, d_k * n_heads)
self.W_V = nn.Linear(d_model, d_v*n_heads)
self.layer_norm = nn.LayerNorm(d_model)
self.concat = nn.Linear(n_heads*d_v,d_v)
def forward(self, Q, K, V):
batch_size = Q.size(0)
##下面这个就是先映射,后分头;这里都是dk
q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2) # q_s: [batch_size x n_heads x len_q x d_k]
k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2) # k_s: [batch_size x n_heads x len_k x d_k]
v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1,2) # v_s: [batch_size x n_heads x len_k x d_v]
## 输入进行的attn_mask形状是 batch_size x len_q x len_k,然后经过下面这个代码得到 新的attn_mask : [batch_size x n_heads x len_q x len_k],就是把pad信息重复了n个头上
#attn_mask = attn_mask.unsqueeze(1).repeat(1, self.n_heads, 1, 1)
context, attn = ScaledDotProductAttentionMask(self.d_k)(q_s, k_s, v_s)
context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_v) # context: [batch_size x len_q x n_heads * d_v]
#output = self.layer_norm(context)
output = self.concat(context)
return output # output: [batch_size x len_q x d_v]
if __name__ == '__main__':
beamformer = beamformer_attention(40,64,1,6)
input = torch.ones((10,8,40),dtype=torch.float32) # [seq_len channel fbank]
print(input)
output = beamformer(input)
print(output)
print(input.size(),'[seq_length,channels,fbank]')
print(output.size(),'[seq_length,fbank]')
print('succed')
Original: https://blog.csdn.net/qq_37258753/article/details/126427181
Author: 方付平
Title: 远场多阵列语音识别(Far-filed multi-array speech recognition)
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/512466/
转载文章受原作者版权保护。转载请注明原作者出处!