激活函数(ReLU, Swish, Maxout)

[En]

Continuous update: update@2022.7 adds GELU, GLU and other activation functions.

ReLU(Rectified Linear Unit,修正线性单元)

[En]

The form is as follows:

[$$f(x)= \begin{cases} 0, & {x\leq 0} \\ x, & {x\gt 0} \end{cases}$$ ]

ReLU公式近似推导::

[\begin{align} f(x) &=\sum_{i=1}^{\inf}\sigma(x-i+0.5) &\text{(stepped sigmoid)} \\ &\approx\log(1+e^x) &\text{(softplus function)} \\ &\approx\max(0,x+N(0,1)) &\text{(ReL function)} \\ 其中\sigma(z) &={1\over 1+e^{-z}} &\text{(sigmoid)} \end{align} ]

[En]

The following explanation is given to softplus,Noisy ReLU.

softplus函数与ReLU函数接近,但比较平滑, 同ReLU一样是单边抑制,有宽广的接受域(0,+inf), 但是由于指数运算,对数运算计算量大的原因,而不太被人使用.并且从一些人的使用经验来看(Glorot et al.(2011a)),效果也并不比ReLU好.

softplus的导数恰好是sigmoid函数. softplus 函数图像:

Noisy ReLU [1]
ReLU可以被扩展以包括高斯噪声(Gaussian noise):
(f(x)=\max(0,x+Y), Y\sim N(0,\sigma(x)))
Noisy ReLU 在受限玻尔兹曼机解决计算机视觉任务中得到应用.

ReLU上界设置: ReLU相比sigmoid和tanh的一个缺点是没有对上界设限.在实际使用中,可以设置一个上限,如ReLU6经验函数: (f(x)=\min(6,\max(0,x))). 参考这个上限的来源论文: Convolutional Deep Belief Networks on CIFAR-10. A. Krizhevsky

ReLU的稀疏性（摘自这里）：

[En]

Compared with the 95% sparsity of brain work, there is still a big gap between the existing computational neural network and the biological neural network. Fortunately, only a negative value of ReLu will be sparse, that is, the introduction of sparsity can be trained and adjusted, is dynamic. As long as the gradient training is carried out, the network can automatically adjust the sparse ratio in the direction of error reduction to ensure that there are a reasonable number of non-zero values on the activation chain.

ReLU 缺点

• 坏死化：RELU强制的稀疏处理会降低模型的有效容量(即，由于特征屏蔽过多，模型无法学习有效特征)。因为当x<0时，RELU的梯度为0，这导致在这个RELU中负梯度被设置为零，神经元可能再也不会被任何数据激活，这被称为神经元“坏死”。
[En]

* Necrosis: sparse processing forced by ReLU will reduce the effective capacity of the model (that is, the model cannot learn effective features due to too much feature shielding). Because the gradient of ReLU is 0 when x < 0, this causes the negative gradient to be set to zero in this ReLU, and the neuron may never be activated by any data again, which is called neuron “necrosis”.

• 没有负值：Relu和Sigmoid有一个共同之处，那就是结果是积极的，没有负值。
[En]

* No negative value: one thing that ReLU and sigmoid have in common is that the result is positive, no negative value.

ReLU变种

Leaky ReLU

[f(x)=\max(\alpha x,x) ]

[En]

• 不要过大(饱和)
[En]

* do not overfit (saturate)

• 计算简单有效
[En]

* simple and effective calculation

• 比sigmoid/tanh收敛快

指数线性单元ELU

[$$f(x)= \begin{cases} \alpha(e^x-1), & \text{x\leq 0} \ x, & \text{x\gt 0} \end{cases}$$ ]

[$$f'(x)= \begin{cases} f(x)+\alpha, & \text{x\leq 0} \ 1, & \text{x\gt 0} \end{cases}$$ ]

exponential linear unit， 该激活函数由Djork等人提出,被证实有较高的噪声鲁棒性,同时能够使得使得神经元

[En]

The average activation average is close to 0, and it is more robust to noise. Because of the need to calculate the index, the amount of calculation is large.

ReLU family:

Leaky ReLU (\alpha)是固定的;PReLU的(\alpha)不是固定的,通过训练得到;RReLU的(\alpha)是从一个高斯分布中随机产生,并且在测试时为固定值，与Noisy ReLU类似（但是区间正好相反）。

ReLU系列对比：

SELU

SELU是给ELU乘上系数 (\lambda), 即 (\rm{SELU}(x)=\lambda\cdot \rm{ELU}(x))

[f(x)=\lambda \begin{cases} \alpha(e^x-1) & x \le 0 \ x & x>0 \end{cases} ]

Swish

paper Searching for Activation functions(Prajit Ramachandran,Google Brain 2017)

[f(x) = x · \text{sigmoid}(βx) ]

β是一个恒定或可训练的参数。摆动具有无上下限、光滑、非单调的特点。

[En]

β is a constant or trainable parameter. Swish has the characteristics of no upper bound and lower bound, smooth and non-monotonous.

Swish 在深层模型上的效果优于 ReLU。例如，仅仅使用 Swish 单元替换 ReLU 就能把 Mobile NASNetA 在 ImageNet 上的 top-1 分类准确率提高 0.9%，Inception-ResNet-v 的分类准确率提高 0.6%。

[En]

Derivative:

β → ∞, $σ(x) = (1 + \exp(−x))^{−1}$为0或1. Swish变为ReLU: f(x)=2max(0,x)

[En]

Therefore, the Swish function can be regarded as a smooth function between the linear function and the ReLU function.

GELU

GELU（高斯误差线性单元）是一个非初等函数形式的激活函数，是RELU的变种。由16年论文 Gaussian Error Linear Units (GELUs) 提出，随后被GPT-2、BERT、RoBERTa、ALBERT 等NLP模型所采用。论文中不仅提出了GELU的精确形式，还给出了两个初等函数的近似形式。函数曲线如下：

RELU及其变种与Dropout从两个独立的方面来决定网络的输出，有没有什么比较中庸的方法把两者合二为一呢？在网络正则化方面，Dropout将神经单元输出随机置0（乘0），Zoneout将RNN的单元随机跳过（乘1）。两者均是将输出乘上了服从伯努利分布的随机变量m ~ Bernoulli(p)，其中p是指定的确定的参数，表示取1的概率。

[En]

In this paper, it is hoped that p can vary with the input x, and set it to 0 with a higher probability when x is small. Since the inputs of neurons usually follow the normal distribution, especially in the networks with Batch Normalization, it can be satisfied if p is equal to the cumulative distribution function of the normal distribution.

[\begin{align} \Phi(x) =& \frac{1}{\sqrt{2\pi}} \int_{-\infty}^x \exp\left(-\frac{t^2}{2}\right) \, dt \ =& {1\over 2} + \frac{1}{\sqrt{2\pi}} \int_0^x \exp\left(-\frac{t^2}{2}\right) \, dt \tag{正态分布曲线下面积为1，一半则为0.5} \ =& {1\over 2}\left(1 + \frac{2}{\sqrt{\pi}} \int_0^x \exp\left(-({t\over \sqrt 2})^2\right) \, {dt\over \sqrt 2}\right) \ =& {1\over 2}\left(1 + \frac{2}{\sqrt{\pi}} \int_0^{x\over\sqrt 2} \exp\left(-z^2\right) \, dz\right) \ =& {1\over 2}\left(1+\rm{erf}\left({x\over \sqrt 2}\right)\right) \end{align} ]

[{\operatorname {erf} (x)={\frac {1}{\sqrt {\pi }}}\int {-x}^{x}e^{-t^{2}}\,\mathrm {d} t={\frac {2}{\sqrt {\pi }}}\int {0}^{x}e^{-t^{2}}\,\mathrm {d} t} ]

erf(x) 与 tanh(x) 比较接近，与 (2\left(\sigma(x)-\frac{1}{2}\right)) 也有相似的曲线，但是相对差别较大一些。在代码实现中可以用近似函数来拟合erf(x)。论文给出的两个近似如下：

[\begin{align} x\Phi(x) &\approx x\sigma(1.702 x) \ x\Phi(x) &\approx \frac{1}{2} x \left[1 + \tanh\left(\sqrt{\frac{2}{\pi}}\left(x + 0.044715 x^3\right)\right)\right] \end{align} ]

[En]

However, many frameworks already have accurate erf calculation functions, which can be used directly. The reference code is as follows:

# BERT、GPT-2 的旧式 GELU 实现
def gelu(x):
return x * 0.5 * (1 + tf.tanh(np.sqrt(2/np.pi)*(x+0.044715*tf.pow(x,3))))
# 使用erf函数的 GELU 实现
def gelu(x):
cdf = 0.5 * (1.0 + tf.erf(x / tf.sqrt(2.0)))
return x * cdf

GELU vs Swish

GELU 与 Swish 激活函数（x · σ(βx)）的函数形式和性质非常相像，一个是固定系数 1.702，另一个是可变系数 β（可以是可训练的参数，也可以是通过搜索来确定的常数），两者的实际应用表现也相差不大。

GLU( Gated Linear Unit)及其变种

GLUGated Linear Unit）函数形式如下（忽略bias项的书写）：

[\begin{align} \text{GLU}\left(a, b\right) &= a\odot \sigma\left(b\right) \ \text{GLU}(x,W,V) &= \sigma(xW) \odot xV \ \end{align} ]

GLU通过门控机制对输出进行把控，像Attention一样可看作是对重要特征的选择。其优势是不仅具有通用激活函数的非线性，而且 反向传播梯度时具有线性通道，类似ResNet残差网络中的加和操作传递梯度，能够缓解梯度消失问题。

[En]

Why? Compare the gradient of gated tanh unit (GTU) used in sigmoid and LSTM:

[\begin{align} \nabla[\tanh (\mathbf{X}) \odot \sigma(\mathbf{X})] &=\tanh ^{\prime}(\mathbf{X}) \nabla \mathbf{X} \odot \sigma(\mathbf{X}) +\sigma^{\prime}(\mathbf{X}) \nabla \mathbf{X} \odot \tanh (\mathbf{X}) \tag{LSTM} \ ∇[σ(X)] &= ∇X \odot σ'(X) = ∇X \odot σ(X)(1-σ(X)) \tag{sigmoid} \ ∇[X \odot σ(X)] &= ∇X \odot σ(X) + X \odot σ'(X) \tag{GLU} \ \end{align} ]

Sigmoid和tanh的导数会缩小，从而导致梯度消失的问题。另一方面，GLU比Sigmoid多一个线性乘积项，而且梯度中的σ(X)不会降低σ(X)的尺度，因此可以加快收敛速度。

[En]

The derivative of sigmoid and tanh will downscaling, which leads to the problem of gradient disappearance. On the other hand, GLU has one more linear product term than sigmoid, and (X odot σ (X)) in the gradient will not downscaling σ (X), so it can accelerate the convergence.

GEGLUGLUGated Linear Unit）激活函数的变体，源自 GLU Variants Improve Transformer （Google, 2020），将GLU中的sigmoid替换为GELU，函数形式如下（忽略bias项的书写）：

[\begin{align} \text{GLU}(x,W,V) &= \sigma(xW) \odot xV \tag{$\text{GLU}\left(a, b\right) = a\odot \sigma\left(b\right)$} \ \text{GELU}(x) &= x\cdot \Phi(x)=x\cdot {1\over 2}\left(1+\rm{erf}\left({x\over \sqrt 2}\right)\right) \ \text{GEGLU}\left(x, W, V\right) &= \text{GELU}\left(xW\right) \odot xV \end{align} ]

Maxout

Maxout可以看做是在深度学习网络中加入一层激活函数层,包含一个参数k.这一层相比ReLU,sigmoid等,其特殊之处在于增加了k个神经元,然后输出激活值最大的值.

[En]

Our common hidden layer node output:

[h_i(x)=\text{sigmoid}(x^TW_{…i}+b_i) ]

[En]

In the Maxout network, the output expression of the hidden layer node is:

[h_i(x)=\max_{j\in[1,k]}z_{ij} ]

[En]

Take the following simplest multilayer perceptron (MLP) as an example:

[En]

Suppose there are two neurons x1 and x2 in layer I of the network, and the number of neurons in layer I + 1 is 1. Originally, there is only one layer of parameters. If the activation function such as ReLU or sigmoid is replaced and introduced into Maxout, it will become two-layer parameters, and the number of parameters will be increased to k times.

[En]

• Maxout的拟合能力非常强，可以拟合任意的凸函数。
• Maxout具有ReLU的所有优点，线性、不饱和性。
• 没有RELU的一些缺点。例如，神经元的死亡。
[En]

* without some of the shortcomings of ReLU. For example, the death of neurons.

[En]

[En]

As can be seen from the above activation function formula, if there are two sets of parameters in each neuron, the number of parameters is doubled, which leads to a surge in the number of parameters as a whole.

Maxout激活函数

[En]

Different from the conventional activation function, it is a learnable piecewise linear function.

[En]

However, any convex function can be approximated by a linear piecewise function. In fact, we can regard the activation functions we have learned before, namely ReLU and abs activation functions, as linear functions divided into two segments, as shown in the following diagram:

[En]

The experimental results show that the combination of Maxout and Dropout can play a better effect.

sigmoid & tanh

sigmoid/logistic 激活函数:

[\sigma(x) ={1\over 1+e^{-x}} ]

tanh 函数是sigmoid函数的一种变体，以0点为中心。取值范围为 [-1,1] ，而不是sigmoid函数的 [0,1] 。

[\tanh(x) ={e^x-e^{-x}\over e^x+e^{-x}} ]

tanh 是对 sigmoid 的平移和收缩: (\tanh \left( x \right) = 2 \cdot \sigma \left( 2 x \right) – 1).

[En]

So what is the relationship between the hyperbolic tangent function of tanh and the trigonometric function tan?

[e^{-ix} = \cos x – i\cdot\sin x \ \sin x=(e^{ix} – e^{-ix})/(2i) \ \cos x=(e^{ix} + e^{-ix})/2 \ \tan\ x=\tanh(ix)/i \ \tanh(ix)=i\tan\ x ]

hard tanh 限界: g(z) = max(-1, min(1,z))

sigmoid & tanh 函数图像如下:

sigmoid作激活函数的优缺点

[En]

It is very popular in history (Historically popular since they have nice interpretation as a saturating “firing rate” of a neuron), and gradient calculation is more convenient:

[\nabla\sigma = {e^{-x}\over(1+e^{-x})^2}=({1+e^{-x}-1\over 1+e^{-x}})({1\over 1+e^{-x}})= \sigma(x)(1-\sigma(x)) ]

[En]

Three * problems * exist:

1. 饱和的神经元会”杀死”梯度,指离中心点较远的x处的导数接近于0,停止反向传播的学习过程.

2. sigmoid的输出不是以0为中心,而是0.5,这样在求权重w的梯度时,梯度总是正或负的.

3. 指数计算耗时

为什么tanh相比sigmoid收敛更快:

1. 梯度消失问题程度
(\tanh'( x ) = 1-\tanh( x )^2 \in (0,1))
(\text{sigmoid: } s'(x)=s(x)\times(1-s(x))\in(0,1/4))
由此可见，tanh(X)的梯度消失问题比Sigmoid的问题要轻。如果梯度过早消失，收敛速度就会很慢。

[En]

It can be seen that the problem of gradient disappearance of tanh (x) is lighter than that of sigmoid. If the gradient disappears prematurely, the convergence rate is slow.

2. 以零为中心的影响
如果当前参数(w0,w1)的最佳优化方向是(+d0, -d1),则根据反向传播计算公式,我们希望 x0 和 x1 符号相反。但是如果上一级神经元采用 Sigmoid 函数作为激活函数，sigmoid不以0为中心，输出值恒为正，那么我们无法进行最快的参数更新，而是走 Z 字形逼近最优解。[4]

激活函数的作用

1. 加入非线性因素
2. 充分组合特征

[En]

The following explains why it has the function of combining features.

[En]

General functions can be approximately calculated by Taylor expansion, for example, the exponential term in sigmoid activation function can be approximately calculated by Taylor expansion as follows:

[e^z=1+{1\over 1!}z+{1\over 2!}z^2+{1\over 3!}z^3+o(z^3) ]

梯度消失与梯度爆炸

[En]

The reason is that the gradient calculation of the shallow layer requires the product of the weight of the following layers and the derivative of the activation function, so the learning rate of the front layer may be lower (vanishing gradient) or higher (exploding) than that of the latter layer, so it is unstable. So how to solve it?

[En]

There are several aspects to consider:

• 权重初始化

[En]

* weight initialization

使用适当的方法初始化权重，例如使用MSRA初始化的RELU，使用Xavier初始化的TANH。

[En]

Use appropriate methods to initialize weights, such as ReLU using MSRA initialization, tanh using xavier initialization.

• 激活功能选择

[En]

* Activation function selection

激活函数应该是稳定的，通过梯度积累，如RELU。

[En]

The activation function should be stable by gradient accumulation such as ReLU.

• 学习率

[En]

* Learning rate

训练优化的一种方法是对输入进行白化(包括正则化和去相关)，以选择更高的学习率。批量归一化通常用于现代深度学习网络(包括正则化步骤，但不包括去相关性)。(您所需要的只是一个好的初始化。如果找不到好的初始化，请使用批处理标准化。)

[En]

One way of training optimization is to whiten the input (including regularization and decorrelation) in order to choose a higher learning rate. Batch Normalization is often used in modern deep learning networks (including regularization steps, but not de-correlation). (All you need is a good init. If you can’t find the good init, use Batch Normalization.)

[En]

Since the formula of gradient contains the derivative of each layer and the product of weight, the product of the middle layer can be equal to about 1. However, the derivative value of the function sigmoid is related to the weight (the maximum value is 1 to 4, and the two sides decrease symmetrically), so the neural network with sigmoid is not easy to solve, and most of the activation in the output layer is saturated, so it is not recommended to use sigmoid.

ReLU在自变量大于0时导数为1,小于0时导数为0,因此可以解决上述问题.

激活函数选择

1. 首先尝试ReLU,速度快,但要注意训练的状态.

2. 如果ReLU效果欠佳,尝试Leaky ReLU或Maxout等变种。

3. 尝试tanh正切函数(以零点为中心,零点处梯度为1)
4. sigmoid/tanh在RNN（LSTM、注意力机制等）结构中有所应用，作为门控或者概率值.

5. 在浅层神经网络中，如不超过4层的，可选择使用多种激励函数，没有太大的影响。

Original: https://www.cnblogs.com/makefile/p/activation-function.html
Author: 康行天下
Title: 激活函数(ReLU, Swish, Maxout)

(0)