# 损失函数

[En]

In this paper, several loss functions and regularization terms and the influence of regularization on the model are introduced.

## 损失函数

[En]

The first prediction of the loss function measurement model is good or bad, and the prediction of the model in the average sense of risk function measurement is good or bad.

[En]

The input and output of the model is a random variable (XMagi Y) which follows the joint distribution P (XMagi Y). The expectation of the loss function is:

[R_{exp}(f)=E_P[L(Y,f(X))]=\int_{\mathcal X\times \mathcal Y}L(y,f(x))P(x,y)dxdy ]

[En]

This is called a risk function or expected loss.

[En]

The goal of learning is to choose the model with the least expected risk, because the joint distribution is unknown, so the above formula can not be calculated directly.

[En]

According to the large number theorem, when the sample size N tends to infinity, the empirical risk tends to the expected risk, so the empirical risk is used to estimate the expected risk in reality.

## 经验风险最小化 & 结构风险最小化

&#x7ECF;&#x9A8C;&#x98CE;&#x9669;&#x6700;&#x5C0F;&#x5316;

[\min_{f\in \mathcal F}{1\over N}\sum_{i=1}^NL(y_i,f(x_i)) ]

&#x7ED3;&#x6784;&#x98CE;&#x9669;&#x6700;&#x5C0F;&#x5316;是为了防止过拟合而提出来的策略，添加了正则化(regularization):

[\min_{f\in \mathcal F}{1\over N}\sum_{i=1}^NL(y_i,f(x_i))+\lambda J(f) ]

[En]

Where J (f)) is the complexity of the model.

## 几种损失函数

### 1.Softmax(cross-entropy loss)

Softmax分类器是logistic(二类)泛化到多类的情况。

Softmax函数 (f_j(z)=\frac{e^{z_j}}{\sum_k e^{z_k}}) 将输入值转换到[0,1]区间内，可以作为类别概率值或者置信度。Softmax使得最大的(z_j)输出结果最接近1，其余接近0，另外可使负数变成非负。
Softmax函数取负对数得到cross-entropy loss。通常人们使用”softmax loss”这个词时指的就是cross-entropy loss，有时也称对数似然损失。

[L_i=-log(\frac{e^{z_{y_i}}}{\sum_j e^{z_j}})=-z_{y_i}+log{\sum_j e^{z_j}} ]

(z_j)表示在类别j上的得分,(y_i)表示真实类别.损失值的范围为([+\infty,1]),在真实类别上的得分越高,损失越低.

Softmax 总体样本的损失为:

[\mathcal L(X,Y) = -\frac{1}{N}\sum^m_{i}\sum^K_{j}1{j=y^{(i)}}\log(p_{i,j}) ]

[En]

Partial derivative:

[\begin{align} {\partial L\over\partial w_{j}^L} &=-y_j(1-p_j) a_j^{L-1} \ {\partial L\over\partial b_j^L} &=p_j^L-y_j \end{align} ]

[En]

Like the logical regression loss derived by the maximum likelihood estimation method, the maximum likelihood is transformed into the minimum negative logarithm:

[\begin{align} \max& \prod_i^{m} p_j^{1{j=y^{(i)}}} \ \Rightarrow& \log\prod_i^m \sum_j^{K} p_j^{1{j=y^{(i)}}} =\sum_i^m \sum_j^{K}1{j=y^{(i)}}\log p_j \ \to \min& (-\sum_i^m\sum_j^{K}1{j=y^{(i)}}\log p_j) \end{align} ]

caffe中的 多类逻辑回归损失MultinomialLogisticLossLayer 直接接收前一层的每类概率预测值，loss为(E=-{1\over N}\sum_{n=1}^N\log(\hat p_{n,l_n})),其中(\hat p_{n,l_n})表示第n个样本真实类别标签(l_n)多对应的预测值，而其它类别的值为0，乘积为0，所以公式较为简单。

[En]

Techniques for improving numerical stability: because the result of exponential calculation may be large, division of larger numbers may be unstable, and the following conversion techniques can be used to reduce the value:

[\frac{e^{f_{y_i}}}{\sum_j e^{f_j}} = \frac{Ce^{f_{y_i}}}{C\sum_j e^{f_j}} = \frac{e^{f_{y_i} + \log C}}{\sum_j e^{f_j + \log C}} ]

### 2.logistic loss

logistic loss函数定义:

[\begin{eqnarray} C = -\frac{1}{n} \sum_x \left[y \ln a + (1-y ) \ln (1-a) \right] \nonumber \end{eqnarray} ]

[En]

The loss function needs to meet two conditions:

• 函数值为非负值
[En]

* function value is non-negative

• 当输出值接近期望值时，函数值趋于0(同时等于0为佳)
[En]

* when the output value approaches the expected value, the function value tends to 0 (it is better if it is equal to 0 at the same time)

[En]

[\begin{align} {\partial C\over\partial w_j} &={1\over n}\sum_x x_j(a-y) \ {\partial C\over\partial b} &={1\over n}\sum_x (a-y) \end{align} ]

[En]

Extended to the case of multi-layer network with multiple output y values, L represents the last layer:

[\begin{eqnarray} C = -\frac{1}{n} \sum_x \sum_j \left[y_j \ln a^L_j + (1-y_j) \ln (1-a^L_j) \right] \tag{多个输出y值}\end{eqnarray} ]

[En]

Conclusion: logistic loss is almost always better than mean square loss function; if linear excitation is used in the output layer instead of a nonlinear layer such as sigmoid, then it is possible to use difference square function, and there is no problem that the learning rate starts slowly due to improper setting of initial parameters.

[En]

Alias: * logarithmic loss (log loss, the binary tag is usually {- 1) and sometimes called * cross-entropy** loss (the name is a bit confusing, similar to the loss function used by Softmax, which is also derived from cross entropy).

[En]

To express in another form:

[L(w) = -\frac{1}{N} \sum_{i=1}^N \left[y_i \log \hat y_i + (1-y_i ) \log (1-\hat y_i) \right] ]

sigmoid同样涉及指数运算,需要规避溢出问题(在log的参数接近0时得到负无穷).

[L(w)=-{1\over N}\sum_{i=1}^N[y_ix_i+\log({e^{-x_i}\over 1+ e^{-x_i}})] ]

L2 均方损失和交叉熵损失之间的区别:

[En]

The mean square loss is mainly used for regression problems and can also be used for classification. The cross-entropy loss is only used for classification. The significant difference between the two is that the cross-entropy loss only considers the output value of the real category, while the mean square loss considers each category. The result of the mean square loss will follow the Gaussian distribution, and the Gaussian distribution is the distribution of continuous variables, and the mean square loss can be obtained from the negative logarithm of the maximum likelihood estimation of the Gaussian distribution.

[\begin{align} p(x)&=N(\mu,\sigma)={1\over \sqrt{2\pi}\sigma}e^{(x-\mu)^2\over 2\sigma^2} \ \max \log P(D)&=\max_p\log\prod_i^N P(D_i) \ &=\max \sum_i^N \log P(D_i) \ &=\max[-{N\over 2}\log(2\pi\sigma^2)-{1\over 2\sigma^2}\sum_i^N(D_i-\mu)^2] \end{align} ]

Softmax 回归 vs. k 个二元分类器

[En]

If the categories are mutually exclusive, then the softmax regression classifier is more appropriate. If there is crossover or parent-child relationship between categories, it is more appropriate to establish multiple independent logistic regression classifiers.

### 3.hinge loss (SVM)

Hinge loss 用于最大化间隔分类器中（以SVM分类器为代表）。

[\ell (y)=\sum_{j\neq t}\max(0,s_j-s_t+\Delta) ]

## 多个损失函数组合

caffe网络中可以定义多个损失层,如 EUCLIDEAN_LOSSSOFTMAX_LOSS,并设置(\lambda)为 loss_weight: 100.0,那么损失函数为:

[\mathcal L={1\over 2N}\sum_{n=1}^N\|y_n-\hat y_n\|2^2-\lambda{1\over N}\sum{n=1}^N\log(\hat p_{n,l_n}) ]

## 参数正则化

[En]

Simply minimizing the simple loss function may lead to over-fitting, and there are often many parameters (large fluctuation) during over-fitting, so we can consider adding parameters to the loss function to limit the fluctuation. It is usually necessary to reduce overfitting by regularization penalty.

### L1正则项

L1正则化项为(\lambda\sum|\theta_j|) ,(\lambda)是正则化的强度。其导数是(\lambda w),则权重更新为: W += -lambda * W,可以看出是线性递减的。L1正则化有”截断”作用，可以在特征选择中使用L1正则化有效减少特征的数量（Lasso会自动进行参数缩减和变量选择）。

[w\to w-{\eta\lambda\over n}\operatorname{sgn}(w)-\eta{{\partial C_0\over\partial w}} ]

### L2正则项

L2正则化项为(\lambda\sum_{j=1}^n\theta_j^2)

L2对偏导的影响.相比于未加之前,权重的偏导多了一项({\lambda\over n}w),偏置的偏导没变化.那么在梯度下降时(w)的更新变为:(w\to w-\eta ({\partial C_0\over\partial w}+{\lambda\over n}w)=(1-{\eta\lambda\over n})w-\eta{{\partial C_0\over\partial w}}),可以看出(w)的系数使得权重下降加速,因此L2正则也称weight decay(caffe中损失层的weight_decay参数与此有关).对于随机梯度下降(对一个mini-batch中的所有x的偏导求平均):

[\begin{align} w &\to (1-{\eta\lambda\over n})w-{\eta\over m}\sum_x{{\partial C_x\over\partial w}} \ b &\to b-{\eta\over m}\sum_x{{\partial C_x\over\partial b}} \end{align} ]

L1 vs. L2 :

• L1减少的是一个常量,L2减少的是权重的固定比例
• 成熟速度取决于权重本身的大小：当权重较大时，L2可能会更快，但当权重较小时，L2可能会比L1快

[En]

* the speed of maturity depends on the size of the weight itself: L2 may be faster when the weight is larger, but faster than L1 when the weight is small.

• 因为1范数不可导，所以L1没有解析解(闭合解)。

[En]

* because the 1 norm is not derivable, there is no analytical solution (closed solution) for L1.

L1 相比于 L2 为什么容易获得稀疏解？

L1项在0点附近的导数为正负1, 而L2在0点附近的导数比L1的要小, 这样L1使得接近于0的参数更容易更新为0. 而L2的在0点附近的梯度远小于1时对参数更新不再起作用,因此得到的参数存在很多接近于0但不是0的情况. 其实也是一场原损失函数与正则项之间的拉锯战. [2]

L1, ReLU怎么求导?

• 次梯度法
[En]

• 坐标下降法
[En]

* coordinate descent method

• 近似方法，将点0附近的区域替换为可微的近似函数
[En]

* approximate method, replacing the region near point 0 with a differentiable approximate function

[En]

[En]

The subgradient method can be used for non-differentiable objective functions. When the objective function is differentiable, the subgradient method and the gradient descent method have the same search direction for unconstrained problems.

ReLU (= max{0, x}) 函数在0点的次导数是[0,1]; L1 norm在0点的次导数是[-1,1]. 那么我们该怎么使用次导数呢? 我们可以从区间中选择其中一个值作为0点的导数. 但是如果不同的取值会造成不同的结果, 采用次梯度法进行最优值的搜索将会非常耗时. 幸运的是在前向传播中ReLU/abs(0)的结果为0时,反向传播时梯度需要乘上前向传播的结果,梯度总是为0,所以ReLU在0点的梯度值设置成任意值均可. 因此在反向传播过程中取[0,1]中的任意值均可. 而通常我们选择0即可,比如Caffe/Tensorflow框架中的abs函数在0点的梯度通过代码计算出来是0, 梯度符合sign函数的定义. 将导数固定为0的另外一个好处是能够得到一个更稀疏的矩阵表示.

### Dropout

Dropout可以与Max-norm regularization，较大的初始学习率和较高的动量（momentum）等结合获得比单独使用Dropout更好的效果。

## 损失函数解决类别平衡问题

[En]

In order to solve the problem of category imbalance in pattern recognition, we usually use repeated sampling (oversampling) for a small number of samples, or generate artificial data based on the spatial distribution of the original samples. Sometimes, however, neither of these methods is easy to do.

2015 年 ICCV 上的一篇论文《Holistically-nested edge detection》提出了名为 HED 的边缘识别模型，试着用改变损失函数的定义来解决这个问题。如对于二分类问题中的log损失：(l=-\sum_{k=0}^n[Q_k \log p_k+(1-Q_k)(1-\log p_k)])

HED 使用了加权的 cross entropy 函数。例如，当标签 0 对应的样本极少时，加权的损失函数定义为：(l=-\sum_{k=0}^n[Q_k \log p_k+W(1-Q_k)(1-\log p_k)])

W需要大于 1。此时考虑似然函数：(L=p_0\cdot p_1\cdot(1-p_2)^W\cdot p_3\cdot p_4\cdot(1-p_5)^W\cdot\dots)

[En]

It can be seen that the sample with category 0 repeats in the likelihood function, and the proportion increases as a result. In this way, although we can not actually expand the number of samples with a small number of samples, we achieve a basically equivalent effect by modifying the loss function.

Original: https://www.cnblogs.com/makefile/p/loss-function.html
Author: 康行天下
Title: 损失函数

(0)