远场多阵列语音识别(Far-filed multi-array speech recognition)

1. 本章内容
  1. 本博客介绍基于Attention的beamformer技术(多麦克风波束合成)。
  2. 对其文章中代码进行复现。
  3. 只复现了beamformer代码,集成到ASR(wenet)中的代码等待我后期GitHub开源。
2. 文章详情
  1. 引用:Gong, R. , et al. “Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-field Speech Recognition.” 2021.

  2. arXiv:https://arxiv.org/abs/2109.04783

3. 原文解读
  1. 网络结构(构思还是很easy的)
    远场多阵列语音识别(Far-filed multi-array speech recognition)
    (1) 最下面:类似于mel谱
    (2)右侧:计算多通路之间的cross-attention,然后经过一个softmax函数,来给每一个通路(channel)求出一个权重。
    (3)最上面:求多通路语音信号的mel谱加权和,作为ASR的输入
  2. 结果展示
    远场多阵列语音识别(Far-filed multi-array speech recognition)
    (1)自称是beamformer中最SOTA的,其实粒度还是蛮高的。(还有方法降低粒度,还可以更SOTA)
    (2)效果绝对提升1-2%。
; 4. 代码/源码(只包含beamformer部分,具体融合到Wenet-ASR,参考我后期github更新)
import torch
import torch.nn as nn
import numpy as np
class beamformer_attention(nn.Module):
    def __init__(self,d_model, d_k, d_v, n_heads):
        super(beamformer_attention, self).__init__()
        self.muttiheadatt = MultiHeadAttention(d_model, d_k, d_v, n_heads)
    def forward(self,x):
        attain = self.muttiheadatt(x,x,x)
        softmax_att = nn.Softmax(dim=-1)(attain)
        output = torch.matmul(x.transpose(-1,-2),attain)
        length = output.size(0)
        output = output.view(length,-1)
        return output
class ScaledDotProductAttentionMask(nn.Module):
    def __init__(self, d_k):
        super(ScaledDotProductAttentionMask, self).__init__()
        self.d_k = d_k
    def forward(self, Q, K, V):
        scores = torch.matmul(Q, K.transpose(-1,-2)) / np.sqrt(self.d_k)
        attn = nn.Softmax(dim=-1)(scores)
        context = torch.matmul(attn, V)
        return context, attn
class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, d_k, d_v, n_heads):
        super(MultiHeadAttention, self).__init__()
        self.d_model = d_model
        self.d_k = d_k
        self.d_v = d_v
        self.n_heads = n_heads
        self.W_Q = nn.Linear(d_model, d_k * n_heads)
        self.W_K = nn.Linear(d_model, d_k * n_heads)
        self.W_V = nn.Linear(d_model, d_v*n_heads)
        self.layer_norm = nn.LayerNorm(d_model)
        self.concat = nn.Linear(n_heads*d_v,d_v)
    def forward(self, Q, K, V):

        batch_size = Q.size(0)

        ##下面这个就是先映射,后分头;这里都是dk
        q_s = self.W_Q(Q).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)  # q_s: [batch_size x n_heads x len_q x d_k]
        k_s = self.W_K(K).view(batch_size, -1, self.n_heads, self.d_k).transpose(1,2)  # k_s: [batch_size x n_heads x len_k x d_k]
        v_s = self.W_V(V).view(batch_size, -1, self.n_heads, self.d_v).transpose(1,2)  # v_s: [batch_size x n_heads x len_k x d_v]

        ## 输入进行的attn_mask形状是 batch_size x len_q x len_k,然后经过下面这个代码得到 新的attn_mask : [batch_size x n_heads x len_q x len_k],就是把pad信息重复了n个头上
        #attn_mask = attn_mask.unsqueeze(1).repeat(1, self.n_heads, 1, 1)

        context, attn = ScaledDotProductAttentionMask(self.d_k)(q_s, k_s, v_s)
        context = context.transpose(1, 2).contiguous().view(batch_size, -1, self.n_heads * self.d_v) # context: [batch_size x len_q x n_heads * d_v]
        #output = self.layer_norm(context)
        output = self.concat(context)
        return output # output: [batch_size x len_q x d_v]

if __name__ == '__main__':
    beamformer = beamformer_attention(40,64,1,6)
    input = torch.ones((10,8,40),dtype=torch.float32) # [seq_len channel fbank]
    print(input)
    output = beamformer(input)
    print(output)
    print(input.size(),'[seq_length,channels,fbank]')
    print(output.size(),'[seq_length,fbank]')
    print('succed')

Original: https://blog.csdn.net/qq_37258753/article/details/126427181
Author: 方付平
Title: 远场多阵列语音识别(Far-filed multi-array speech recognition)

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/512466/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

亲爱的 Coder【最近整理,可免费获取】👉 最新必读书单  | 👏 面试题下载  | 🌎 免费的AI知识星球