# 深度学习激活函数与正则化问题

Deep Learning Activation function and regularization-Pandon’s Deep Learning Notes

In the gradient descent, as the algorithm is fed back to the first few layers, the gradient will become smaller and smaller, and finally, there will be no change, which may not converge to a better solution at this time, which is the problem of gradient disappearance, and deep learning suffers from unstable gradients. Different layers learn at different speeds.

## ; Relu的缺点

max ⁡ ( α ∗ z , z ) , α = 0.01 \max(\alpha * z,z), \alpha =0.01 max (α∗z ,z ),α=0 .0 1

def leaky_relu(z,name=None):
return tf.maximun(0.01*z,z,name=name)

hidden1 = fully_connected(X,n_hidden1,activitation_fn=leaky_relu)


## 其他Relu变形

• RReLU，Random，α \alpha α 是一个在给定范围内随机取值的数在训练时，固定的平均值在测试时，过拟合时可以试试
• PReLU，Parametric，α \alpha α 是一个在训练过程中需要学习的参数，它会被修改在反向传播中，适合大数据集
• ELU，exponential，计算梯度的速度会慢一些，但是整体因为没有死的神经元，整体收敛快，超参数 0.01

ELU 可以在tansorflow中直接调用

hidden1 = fully_connected(X,n_hidden1,activation_fn=tf.nn.elu)


## random initialization

Random initialization is a method often used by many people at present, but it has drawbacks. Once the random distribution is not selected properly, it will lead to network optimization in trouble.

It can be seen that the variance of the activation value decreases layer by layer the distribution of the gradient of back propagation (with respect to the gradient of the state):

With the increase of the number of layers, the gradient will get closer and closer to 0, and the gradient will disappear.

### ; 代码验证

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

data = tf.constant(np.random.randn(2000, 800))
layer_sizes = [800 - 50 * i for i in range(0,11)]
num_layers = len(layer_sizes)
fcs = []
fig, axs = plt.subplots(2, 5, figsize=(15, 6), sharey=True)
for i in range(0, num_layers - 1):
X = data if i == 0 else fcs[i - 1]
node_in = layer_sizes[i]
node_out = layer_sizes[i + 1]
W = tf.Variable(np.random.randn(node_in, node_out)) * 0.01
fc = tf.matmul(X, W)
fc = tf.nn.tanh(fc)
fcs.append(fc)
axs[i//5,i%5].hist(fc.numpy()[:,1])

fig.show()


Let’s increase the initial value: the mean is 0 and the standard deviation is 1.

## Xavier initialization

Xavier 初始化的基本思想是保持输入和输出的方差一致，这样就避免了所有输出值都趋向于 0。注意，为了问题的简便，Xavier 初始化的推导过程是基于线性函数的，但是它在一些非线性神经元中也很有效

r = 6 n i n p u t s + n o u t p u t s r=\sqrt{\frac{6}{n_{inputs}+n_{outputs}}}r =n i n p u t s ​+n o u t p u t s ​6 ​​σ = 2 n i n p u t s + n o u t p u t s \sigma=\sqrt{\frac{2}{n_{inputs}+n_{outputs}}}σ=n i n p u t s ​+n o u t p u t s ​2 ​​

tanh
r = 4 6 n i n p u t s + n o u t p u t s r=4\sqrt{\frac{6}{n_{inputs}+n_{outputs}}}r =4 n i n p u t s ​+n o u t p u t s ​6 ​​σ = 4 2 n i n p u t s + n o u t p u t s \sigma=4\sqrt{\frac{2}{n_{inputs}+n_{outputs}}}σ=4 n i n p u t s ​+n o u t p u t s ​2 ​​

Relu
r = 2 6 n i n p u t s + n o u t p u t s r=\sqrt{2}\sqrt{\frac{6}{n_{inputs}+n_{outputs}}}r =2 ​n i n p u t s ​+n o u t p u t s ​6 ​​σ = 2 2 n i n p u t s + n o u t p u t s \sigma=\sqrt{2}\sqrt{\frac{2}{n_{inputs}+n_{outputs}}}σ=2 ​n i n p u t s ​+n o u t p u t s ​2 ​​

### 代码验证

import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

data = tf.constant(np.random.randn(2000, 800))
layer_sizes = [800 - 50 * i for i in range(0,11)]
num_layers = len(layer_sizes)
fcs = []
fig, axs = plt.subplots(2, 5, figsize=(15, 6), sharey=True)
for i in range(0, num_layers - 1):
X = data if i == 0 else fcs[i - 1]
node_in = layer_sizes[i]
node_out = layer_sizes[i + 1]
W = tf.Variable(np.random.randn(node_in, node_out)) * np.sqrt(2/(node_in+node_out))
fc = tf.matmul(X, W)
fc = tf.nn.tanh(fc)
fcs.append(fc)
axs[i//5,i%5].hist(fc.numpy()[:,1])
axs[i//5,i%5].set_xlim([-1,1])

fig.show()


W = tf.Variable(np.random.randn(node_in, node_out)) * np.sqrt(4/(node_in+node_out))

fc = tf.nn.relu(fc)


The front looks good, but the latter trend is getting closer and closer to 0.

## 对参数初始化问题的总结

Generally speaking, both the initialization of parameters and the selection of activation function can only alleviate the symptom of disappearance of gradient, but can not solve the problem of disappearance of gradient.

In the process of back propagation, many gradients are greater than 1. Under the chain derivation rule, the result of multiplication tends to infinity. According to the method of gradient descent, it will take a big step under the gradient explosion, so as to get out of the optimal solution domain.

## 解决方案

• Batch Normalization,一种归一化手段，主要作用在Activations上面
• 激活函数也会影响，使用Relu优于tanh和sigmoid
• w的初始化，如果一开始w的绝对值比较大，更容易偷渡爆炸或者消失
• 网络Topolopy设计可以在一定程度上解决梯度消失或者爆炸

Normalization

1. BN 是在 batch 上，对 N、H、W 做归一化，而保留通道 C 的维度。BN 对较小的
batch size 效果不好。BN 适用于固定深度的前向神经网络，如 CNN，不适用于 RNN；
2. LN 在通道方向上，对 C、H、W 归一化，主要对 RNN 效果明显
3. IN 在图像像素上，对 H、W 做归一化，用在风格化迁移
4. GN 将 channel 分组，然后再做归一化

## ; Batch Normalization

BN的操作是这样的: 取一个特定的Batch，把这一个Batch中的每个图片的某个通道，作为一个归一化的对象，举例来说[ 10 , 3 , 28 , 28 ] [10,3,28,28][1 0 ,3 ,2 8 ,2 8 ]的图片，当取第一个通道作为归一化的对象的时候，会将10幅图片的第一个通道都拿出来，将102828的数据做归一化…

Batch Normalization 是一种巧妙而粗暴的方法来削弱 bad initialization 的影响,我们想要的是在非线性 activation 之前，输出值应该有比较好的分布（例如高斯分布），

### 为什么要进行BN

1. 在深度神经网络训练的过程中，通常以输入网络的每一个 mini-batch 进行训练，这样每个 batch 具有不同的分布，使模型训练起来特别困难。
2. Internal Covariate Shift (ICS) 问题：在训练的过程中，激活函数会改变各层数据的
分配，随着网络的深入，这种变化(差异)会越来越大，使得模型特别难训练，收敛速度也会很快。
Distribution, with the deepening of the network, this change (difference) will become larger and larger, making the model particularly difficult to train and converging fast.

度数很慢，就会出现梯度消失的问题。(也可以理解为病态矩阵问题，只要有一个小的扰动，就会引起很大的变化。)

The degree is very slow, there will be the problem of gradient disappearance. (it can also be understood as a morbid matrix problem, as long as there is a small disturbance, it will cause a great change.)

### BN的使用位置

After the fully connected layer or convolution operation, before activating the function

### BN算法过程

Batch Normalization 将输出值强行做一次 Gaussian Normalization 和线性变换

Input:Values of x over a mini-batch: B = { x 1 , … , m } \mathcal{B} = {x_{1,\ldots,m}}B ={x 1 ,…,m ​}Parameters to be learned: γ , β \gamma,\beta γ,β

Output:{ y i = N B γ , β ( x i ) } {y_i=NB_{\gamma,\beta}(x_i)}{y i ​=N B γ,β​(x i ​)}
μ B ← 1 m ∑ i = 1 m x i σ B 2 ← 1 m ∑ i = 1 m ( x i − μ B ) 2 x ^ i ← x i − μ B σ B 2 + ϵ y i ← γ x ^ i + β ≡ B N γ , β ( x i ) \mu_{\mathcal{B}} \leftarrow \frac{1}{m}\sum_{i=1}^mx_i \ \sigma_{\mathcal{B}}^2 \leftarrow \frac{1}{m}\sum_{i=1}^m(x_i-\mu_{\mathcal{B}})^2 \ \hat{x}i \leftarrow \frac{x_i-\mu{\mathcal{B}}}{\sqrt{\sigma_{\mathcal{B}}^2 + \epsilon}}\ y_i \leftarrow \gamma\hat{x}i + \beta \equiv BN{\gamma,\beta}(x_i)μB ​←m 1 ​i =1 ∑m ​x i ​σB 2 ​←m 1 ​i =1 ∑m ​(x i ​−μB ​)2 x ^i ​←σB 2 ​+ϵ​x i ​−μB ​​y i ​←γx ^i ​+β≡B N γ,β​(x i ​)

Batch Normalization 中所有的操作都是平滑可导，这使得 back propagation 可以有效运行并学到相应的参数γ , β \gamma,\beta γ,β。需要注意的一点是 Batch Normalization 在 training 和testing 时行为有所差别。Training 时μ B \mu_{\mathcal{B}}μB ​ 和σ B \sigma_{\mathcal{B}}σB ​ 由当前 batch 计算得出；在 Testing 时μ B \mu_{\mathcal{B}}μB ​ 和σ B \sigma_{\mathcal{B}}σB ​ 应使用 Training 时保存的均值或类似的经过处理的值，而不是由当前 batch 计算。

### BN的作用

1. 允许较大的学习率
2. 减弱对初始化的强依赖性
3. 保持隐藏层中数值的均值、方差不变，让数值更稳定，为后面网络提供坚实的基础
4. 有轻微的正则化作用（相当于给隐藏层加入噪声，类似 Dropout）

### BN的问题

1. 每次是在一个 batch 上计算均值、方差，如果 batch size 太小，则计算的均值、方差不足以代表整个数据分布。
2. batch size 太大：会超过内存容量；需要跑更多的 epoch，导致总训练时间变长；会直接固定梯度下降的方向，导致很难更新

## Group Normlization

