1. 时序分类
假设训练集S S S来自于分布D X × Z D_{\mathcal{X}\times \mathcal{Z}}D X ×Z ,其中X = ( R m ) ∗ \mathcal{X}=(\mathbb{R}^m)^X =(R m )∗表示输入序列空间,Z = L ∗ \mathcal{Z}=L^Z =L ∗表示目标序列空间。每个训练数据由一对序列( x , z ) (\mathbf{x},\mathbf{z})(x ,z )组成,其中目标序列z = ( z 1 , z 2 , … , z U ) \mathbf{z}=(z_1,z_2,\ldots,z_U)z =(z 1 ,z 2 ,…,z U )的元素个数U U U至多不会超过输入序列x = ( x 1 , x 2 , … , x T ) \mathbf{x}=(x_1,x_2,\ldots,x_T)x =(x 1 ,x 2 ,…,x T )的元素个数T T T。
假设S ′ S’S ′表示测试集,h h h表示时序分类器,标签错误率L E R LER L E R公式如下:
L E R ( h , S ′ ) = 1 Z ∑ ( x , z ) ∈ S ′ E D ( h ( x ) ) (1) LER(h,S’)=\frac{1}{Z}\sum_{(\mathbf{x},\mathbf{z})\in S’}ED(h(\mathbf{x}))\tag{1}L E R (h ,S ′)=Z 1 (x ,z )∈S ′∑E D (h (x ))(1 )
其中Z Z Z表示测试集样本个数,E D ED E D表示编辑距离。
编辑距离:指两个字符串相互转换所需要的最少编辑次数。转换操作包括替换、插入以及删除。
2. 连接时序分类(CTC)
CTC网络最后一层为softmax输出层,其输出节点一共有L + 1 L+1 L +1个,最后一个节点输出表示空标签(b l a n k blank b l a n k)。假设某一权重为w w w的循环神经网络N w \mathcal{N}_w N w ,输入为x ∈ ( R m ) T \mathbf{x}\in(\mathbb{R}^m)^T x ∈(R m )T,输出为y ∈ ( R n ) T \mathbf{y}\in(\mathbb{R}^n)^T y ∈(R n )T,其映射公式如下:
y = N w ( x ) (2) \mathbf{y}=\mathcal{N}_w(\mathbf{x})\tag{2}y =N w (x )(2 )
y k t y_k^t y k t 表示网络在t t t时刻的第k k k个节点输出,可被视作第k k k个标签在字母表L ′ = L ∪ { b l a n k } L’=L\cup {blank}L ′=L ∪{b l a n k }上的概率。假设输出层不存在反馈连接,则不同时刻t t t网络N w \mathcal{N}_w N w 的输出分布y t y^t y t条件独立。任意序列π ∈ ( L ′ ) T \pi\in (L’)^T π∈(L ′)T(π \pi π也被称作路径)在x \mathbf{x}x输入下的概率密度公式如下:
p ( π ∣ x ) = ∏ t = 1 T y π t t (3) p(\pi|\mathbf{x})=\prod_{t=1}^Ty_{\pi_t}^t\tag{3}p (π∣x )=t =1 ∏T y πt t (3 )
不太明白为什么y π t t y_{\pi_t}^t y πt t 条件独立?
定义一种多到一的变换关系B : ( L ′ ) T → ( L ) ≤ T \mathcal{B}:(L’)^T\rightarrow (L)^{\leq T}B :(L ′)T →(L )≤T,B \mathcal{B}B变换会删除序列中的空标签以及重复标签(例如B ( a − a b − ) = B ( − a a − − a b b ) = a a b \mathcal{B}(a-ab-)=\mathcal{B}(-aa–abb)=aab B (a −a b −)=B (−a a −−a b b )=a a b)。利用B − 1 \mathcal{B}^{-1}B −1反变换可以定义标签序列l ∈ ( L ) ≤ T \mathbf{l}\in(L)^{\leq T}l ∈(L )≤T在x \mathbf{x}x输入下的概率密度:
p ( l ∣ x ) = ∑ π ∈ B − 1 ( l ) p ( π ∣ x ) (4) p(\mathbf{l}|\mathbf{x})=\sum_{\pi\in\mathcal{B}^{-1}(\mathbf{l})}p(\pi|\mathbf{x})\tag{4}p (l ∣x )=π∈B −1 (l )∑p (π∣x )(4 )
B \mathcal{B}B变换先删除重复标签再删除空标签?
根据式( 4 ) (4)(4 ),我们定义时序分类器输出如下:
h ( x ) = arg max l ∈ L ≤ T p ( l ∣ x ) (5) h(\mathbf{x})=\arg \max_{\mathbf{l}\in L^{\leq T}}p(\mathbf{l}|\mathbf{x})\tag{5}h (x )=ar g l ∈L ≤T max p (l ∣x )(5 )
利用隐马尔可夫术语,我们将分类器从输入序列到标签序列的映射称作解码。由于标签序列l \mathbf{l}l的B − 1 \mathcal{B}^{-1}B −1反变换所对应的路径π \pi π是指数级数量,所以式( 5 ) (5)(5 )所表示的解码算法难以直接计算。在实际中,我们用如下两种近似方式实现解码算法。
第一种方式为最佳路径解码(best path decoding),其数学表示如下:
h ( x ) ≈ B ( π ∗ ) , π ∗ = arg max π ∈ N t p ( π ∣ x ) (6) h(\mathbf{x})\approx\mathcal{B}(\pi^),\pi^=\arg\max_{\pi\in N^t}p(\pi|\mathbf{x})\tag{6}h (x )≈B (π∗),π∗=ar g π∈N t max p (π∣x )(6 )
最佳路径解码在计算上的开销可以忽略不计,但这种方式不能保证找到最大概率标签序列。
第二种方式为前缀搜索解码(prefix search decoding),这是一种前向-反向算法。只要有足够的时间,前缀搜索解码总能找到最大概率标签序列。
3. 训练网络
定义前向变量α t ( s ) \alpha_t(s)αt (s )如下:
α t ( s ) = ∑ π ∈ N T B ( π 1 : t ) = l 1 : s ∏ t ′ = 1 t y π t ′ t ′ (7) \alpha_t(s)=\sum_{\begin{matrix}\pi\in N^T\\mathcal{B}(\pi_{1:t})=\mathbf{l}{1:s}\end{matrix}}\prod{t’=1}^ty_{\pi_{t’}}^{t’}\tag{7}αt (s )=π∈N T B (π1 :t )=l 1 :s ∑t ′=1 ∏t y πt ′t ′(7 )
为了考虑输出序列中存在空标签的情况,在标签序列l \mathbf{l}l的每个标签前后插入空标签生成一个长度为2 ∣ l ∣ + 1 2|\mathbf{l}|+1 2 ∣l ∣+1修正标签序列l ′ \mathbf{l}’l ′。定义在l ′ \mathbf{l}’l ′上的前向变量初始条件如下:
{ α 1 ( 1 ) = y b 1 α 1 ( 2 ) = y l 1 1 α 1 ( s ) = 0 , ∀ s > 2 (8) \left{\begin{aligned}\alpha_1(1)&=y_b^1\\alpha_1(2)&=y_{\mathbf{l}_1}^1\\alpha_1(s)&=0,\forall s>2\end{aligned}\right.\tag{8}⎩⎪⎨⎪⎧α1 (1 )α1 (2 )α1 (s )=y b 1 =y l 1 1 =0 ,∀s >2 (8 )
前向变量递推公式如下:
α t ( s ) = { α ‾ t ( s ) y l s ′ t , if l s ′ = b or l s − 2 ′ = l s ′ ( α ‾ t ( s ) + α t − 1 ( s − 2 ) ) y l s ′ t , otherwise where α ‾ t ( s ) = α t − 1 ( s ) + α t − 1 ( s − 1 ) (9) \alpha_t(s)=\left{\begin{aligned}&\overline\alpha_t(s)y_{\mathbf{l}’s}^t,&\operatorname{if}\ \mathbf{l}’_s=b\ \operatorname{or}\ \mathbf{l}’{s-2}=\mathbf{l}’s\&(\overline\alpha_t(s)+\alpha{t-1}(s-2))y_{\mathbf{l}’s}^t,&\operatorname{otherwise}\end{aligned}\right.\\operatorname{where}\ \overline\alpha_t(s)=\alpha{t-1}(s)+\alpha_{t-1}(s-1)\tag{9}αt (s )={αt (s )y l s ′t ,(αt (s )+αt −1 (s −2 ))y l s ′t ,i f l s ′=b o r l s −2 ′=l s ′o t h e r w i s e w h e r e αt (s )=αt −1 (s )+αt −1 (s −1 )(9 )
条件概率p ( l ∣ x ) p(\mathbf{l}|\mathbf{x})p (l ∣x )可用前向变量表示为:
p ( l ∣ x ) = α T ( ∣ l ′ ∣ ) + α T ( ∣ l ′ ∣ − 1 ) (10) p(\mathbf{l}|\mathbf{x})=\alpha_T(|\mathbf{l}’|)+\alpha_T(|\mathbf{l}’|-1)\tag{10}p (l ∣x )=αT (∣l ′∣)+αT (∣l ′∣−1 )(1 0 )
定义后向变量β t ( s ) \beta_t(s)βt (s )如下:
β t ( s ) = ∑ π ∈ N T B ( π t : T ) = l s : ∣ l ∣ ∏ t ′ = t T y π t ′ t ′ (11) \beta_t(s)=\sum_{\begin{matrix}\pi\in N^T\\mathcal{B}(\pi_{t:T})=\mathbf{l}{s:|\mathbf{l}|}\end{matrix}}\prod{t’=t}^Ty_{\pi_{t’}}^{t’}\tag{11}βt (s )=π∈N T B (πt :T )=l s :∣l ∣∑t ′=t ∏T y πt ′t ′(1 1 )
定义在l ′ \mathbf{l}’l ′上的后向变量初始条件如下:
{ β T ( ∣ l ′ ∣ ) = y b T β T ( ∣ l ′ ∣ − 1 ) = y l ∣ l ∣ T β T ( s ) = 0 , ∀ s < ∣ l ′ ∣ − 1 (12) \left{\begin{aligned}\beta_T(|\mathbf{l}’|)&=y_b^T\\beta_T(|\mathbf{l}’|-1)&=y_{\mathbf{l}_{|\mathbf{l}|}}^T\\beta_T(s)&=0,\forall s
后向变量递推公式如下:
β t ( s ) = { β ‾ t ( s ) y l s ′ t , if l s ′ = b or l s + 2 ′ = l s ′ ( β ‾ t ( s ) + β t + 1 ( s + 2 ) ) y l s ′ t , otherwise where β ‾ t ( s ) = β t + 1 ( s ) + β t + 1 ( s + 1 ) (13) \beta_t(s)=\left{\begin{aligned}&\overline\beta_t(s)y_{\mathbf{l}’s}^t,&\operatorname{if}\ \mathbf{l}’_s=b\ \operatorname{or}\ \mathbf{l}’{s+2}=\mathbf{l}’s\&(\overline\beta_t(s)+\beta{t+1}(s+2))y_{\mathbf{l}’s}^t,&\operatorname{otherwise}\end{aligned}\right.\\operatorname{where}\ \overline\beta_t(s)=\beta{t+1}(s)+\beta_{t+1}(s+1)\tag{13}βt (s )={βt (s )y l s ′t ,(βt (s )+βt +1 (s +2 ))y l s ′t ,i f l s ′=b o r l s +2 ′=l s ′o t h e r w i s e w h e r e βt (s )=βt +1 (s )+βt +1 (s +1 )(1 3 )
在实际计算中,前向或后向变量的递推很可能造成数据溢出。为了避免这种情况的发生需要对变量进行归一化操作:
{ α ^ t ( s ) = α t ( s ) C t , C t = ∑ s α t ( s ) β ^ t ( s ) = β t ( s ) D t , D t = ∑ s β t ( s ) (14) \left{\begin{aligned}\hat{\alpha}_t(s)=\frac{\alpha_t(s)}{C_t},C_t=\sum_s\alpha_t(s)\\hat{\beta}_t(s)=\frac{\beta_t(s)}{D_t},D_t=\sum_s\beta_t(s)\end{aligned}\right.\tag{14}⎩⎪⎪⎪⎪⎨⎪⎪⎪⎪⎧α^t (s )=C t αt (s ),C t =s ∑αt (s )β^t (s )=D t βt (s ),D t =s ∑βt (s )(1 4 )
对数条件概率p ( l ∣ x ) p(\mathbf{l}|\mathbf{x})p (l ∣x )可用归一化变量表示为:
ln ( p ( l ∣ x ) ) = ∑ t = 1 T ln ( C t ) (15) \ln(p(\mathbf{l}|\mathbf{x}))=\sum_{t=1}^T\ln(C_t)\tag{15}ln (p (l ∣x ))=t =1 ∑T ln (C t )(1 5 )
极大似然训练的优化目标如下式所示:
O M L ( S , N w ) = − ∑ ( x , z ) ∈ S ln ( p ( z ∣ x ) ) (16) O^{ML}(S,\mathcal{N}w)=-\sum{(\mathbf{x},\mathbf{z})\in S}\ln(p(\mathbf{z}|\mathbf{x}))\tag{16}O M L (S ,N w )=−(x ,z )∈S ∑ln (p (z ∣x ))(1 6 )
由于训练集样本独立,考虑单个样本点偏导:
∂ O M L ( { ( x ∣ z ) } , N w ) ∂ y k t = − ∂ ln ( p ( z ∣ x ) ) ∂ y k t (17) \frac{\partial O^{ML}({(\mathbf{x}|\mathbf{z})},\mathcal{N}_w)}{\partial y_k^t}=-\frac{\partial \ln(p(\mathbf{z}|\mathbf{x}))}{\partial y_k^t}\tag{17}∂y k t ∂O M L ({(x ∣z )},N w )=−∂y k t ∂ln (p (z ∣x ))(1 7 )
根据前向变量以及后向变量定义可得:
α t ( s ) β t ( s ) = ∑ π ∈ B − 1 ( l ) π t = l s ′ y l s ′ t ∏ t = 1 T y π t t (18) \alpha_t(s)\beta_t(s)=\sum_{\begin{matrix}\pi\in \mathcal{B}^{-1}(\mathbf{l})\\pi_{t}=\mathbf{l}’s\end{matrix}}y{\mathbf{l}’s}^t\prod{t=1}^Ty_{\pi_t}^t\tag{18}αt (s )βt (s )=π∈B −1 (l )πt =l s ′∑y l s ′t t =1 ∏T y πt t (1 8 )
式( 18 ) (18)(1 8 )应从前向变量和后向变量所表达的概率含义理解,不能从等式理解。
结合式( 3 ) (3)(3 )可得:
α t ( s ) β t ( s ) y l s ′ t = ∑ π ∈ B − 1 ( l ) π t = l s ′ p ( π ∣ x ) (19) \frac{\alpha_t(s)\beta_t(s)}{y_{\mathbf{l}’s}^t}=\sum{\begin{matrix}\pi\in \mathcal{B}^{-1}(\mathbf{l})\\pi_{t}=\mathbf{l}’_s\end{matrix}}p(\pi|\mathbf{x})\tag{19}y l s ′t αt (s )βt (s )=π∈B −1 (l )πt =l s ′∑p (π∣x )(1 9 )
由式( 19 ) (19)(1 9 )可得,条件概率p ( l ∣ x ) p(\mathbf{l}|\mathbf{x})p (l ∣x )公式如下:
p ( l ∣ x ) = ∑ s = 1 ∣ l ′ ∣ α t ( s ) β t ( s ) y l s ′ t (20) p(\mathbf{l}|\mathbf{x})=\sum_{s=1}^{|\mathbf{l}’|}\frac{\alpha_t(s)\beta_t(s)}{y_{\mathbf{l}’_s}^t}\tag{20}p (l ∣x )=s =1 ∑∣l ′∣y l s ′t αt (s )βt (s )(2 0 )
定义一种位置集合l a b ( l , k ) = { s : l s ′ = k } lab(\mathbf{l},k)={s:\mathbf{l}’_s=k}l a b (l ,k )={s :l s ′=k },对式( 20 ) (20)(2 0 )求偏导可得:
∂ p ( l ∣ x ) ∂ y k t = 1 y k t 2 ∑ l a b ( l , k ) = { s : l s ′ = k } α t ( s ) β t ( s ) (21) \frac{\partial p(\mathbf{l}|\mathbf{x})}{\partial y_k^t}=\frac{1}{{y_k^t}^2}\sum_{lab(\mathbf{l},k)={s:\mathbf{l}’_s=k}}\alpha_t(s)\beta_t(s)\tag{21}∂y k t ∂p (l ∣x )=y k t 2 1 l a b (l ,k )={s :l s ′=k }∑αt (s )βt (s )(2 1 )
结合式( 9 ) (9)(9 )( 13 ) (13)(1 3 ),取y l s ′ t = y k t y_{\mathbf{l}’_s}^t=y_k^t y l s ′t =y k t 。
对对数条件概率ln p ( l ∣ x ) \ln p(\mathbf{l}|\mathbf{x})ln p (l ∣x )求偏导可得:
∂ ln ( p ( l ∣ x ) ) ∂ y k t = 1 p ( l ∣ x ) ∂ p ( l ∣ x ) ∂ y k t (22) \frac{\partial \ln(p(\mathbf{l}|\mathbf{x}))}{\partial y_k^t}=\frac{1}{p(\mathbf{l}|\mathbf{x})}\frac{\partial p(\mathbf{l}|\mathbf{x})}{\partial y_k^t}\tag{22}∂y k t ∂ln (p (l ∣x ))=p (l ∣x )1 ∂y k t ∂p (l ∣x )(2 2 )
可将l = z \mathbf{l}=\mathbf{z}l =z带入式( 22 ) (22)(2 2 )求得对数条件概率ln p ( z ∣ x ) \ln p(\mathbf{z}|\mathbf{x})ln p (z ∣x )的偏导。y k t y_k^t y k t 是softmax层输出,优化目标函数O M L ( { x ∣ z } , N w ) O^{ML}({\mathbf{x}|\mathbf{z}},\mathcal{N}_w)O M L ({x ∣z },N w )对softmax层输入u k t u_k^t u k t 求偏导可得:
∂ O M L ( { ( x , z ) } , N w ) ∂ u k t = y k t − 1 y k t Z t ∑ s ∈ l a b ( z , k ) α ^ t ( s ) β ^ t ( s ) where Z t = ∑ s = 1 ∣ l ′ ∣ α ^ t ( s ) β ^ t ( s ) y l s ′ t (23) \frac{\partial O^{ML}({(\mathbf{x},\mathbf{z})},\mathcal{N}w)}{\partial u_k^t}=y_k^t-\frac{1}{y_k^tZ_t}\sum{s\in lab(\mathbf{z},k)}\hat{\alpha}t(s)\hat{\beta}_t(s)\\operatorname{where}\ Z_t=\sum{s=1}^{|\mathbf{l}’|}\frac{\hat{\alpha}t(s)\hat{\beta}_t(s)}{y{\mathbf{l}’_s}^t}\tag{23}∂u k t ∂O M L ({(x ,z )},N w )=y k t −y k t Z t 1 s ∈l a b (z ,k )∑α^t (s )β^t (s )w h e r e Z t =s =1 ∑∣l ′∣y l s ′t α^t (s )β^t (s )(2 3 )
不太明白式( 23 ) (23)(2 3 )是怎么来的?
参考文献
Original: https://blog.csdn.net/m0_37142194/article/details/122786922
Author: 会飞的鱼chelmx
Title: CTC损失函数笔记
原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/515735/
转载文章受原作者版权保护。转载请注明原作者出处!