# 参数更新-深度学习优化算法

## 学习率调整方式

[En]

The initial value is usually set at 1e-3, and it is usually useful to reduce the learning rate with time in the learning process. The following learning rate decay mode is suitable for optimization algorithms using first-order momentum.

[En]

The ways to reduce the learning rate are:

• 减少步长衰减每隔几个纪元减少一次，例如每5次减半，或每20次减少0.1次，视情况而定。通过尝试不同的固定学习速率，通过观察验证集的错误率，可以找到合适的学习速率。

[En]

* decreasing step decay decreases every few epoch, such as halving every 5 times, or 0.1 every 20 times, as the case may be. The appropriate learning rate can be found by observing the error rate of the verification set by trying different fixed learning rates.

• 指数下降 (\alpha = \alpha_0 e^{-k t}),t 是迭代次数,其它是超参数.

• 1/t下降 (\alpha = \alpha_0 / (1 + k t ))

• cosine下降
(\alpha = \alpha_0 \times (1 + \cos(\pi \times t / T)) / 2), T 是总迭代次数. 与指数下降类似,但曲线不同. 是 SGDR decay 方法的非重启版本(重启指的是周期性重置学习率).

[En]

In practice, step decay is more commonly used because the parameters are easier to interpret and more intuitive. Set a smaller decline rate and train for a longer time.

## 用一个框架概括一阶优化算法

[En]

[En]

Use a framework to sort out all the optimization algorithms to make a more strategic comparison.

[En]

First of all, define: parameter to be optimized: W, objective function: F (w), initial learning rate ( alpha).

[En]

After that, iterative optimization begins. At each epoch t:

1. 计算目标函数关于当前参数的梯度：(g_t=\nabla f(w_t))
2. 根据历史梯度计算一阶动量和二阶动量：(m_t = \phi(g_1, g_2, \cdots, g_t); V_t = \psi(g_1, g_2, \cdots, g_t))，
3. 计算当前时刻的下降梯度：(\eta_t = \alpha \cdot m_t / \sqrt{V_t})
4. 根据下降梯度进行更新：(w_{t+1} = w_t – \eta_t)

[En]

With this framework mastered, you can easily design your own optimization algorithms.

[En]

For SGD, only the first-order momentum is used to calculate the descent gradient, and the disadvantage is that it may stay at the local optimal point.

[En]

In order to solve the above problems, momentum is proposed, that is, when calculating the descending direction of the gradient, it is determined not only by the gradient of the current point, but also by the previous cumulative descending direction.

## 权重更新方法(一阶)

### Vanilla update (SGD)

（vanilla作为形容词，意思是普通的，平凡的）

[En]

Update the parameters along the negative gradient direction to minimize the loss.

[En]

The speed of decline is slow, and it is easy to stay at the local point.

# Vanilla update,learning_rate是超参数
x += - learning_rate * dx


### 动量法 Momentum update

SGD with momentum，在SGD基础上引入了一阶动量：

[m_t = \beta_1 \cdot m_{t-1} + \alpha \cdot g_t ]

[En]

In order to avoid calculating the gradient independently each time the gradient is updated, resulting in a continuous change in the direction of the gradient, Momentum adds the gradient value of the previous round to the current gradient calculation.

[En]

The visual explanation is that gravity reduces the amplitude of its concussion and accelerates its decline.

# Momentum update
# 其中变量v初始值为0,mu 称为动量,常用值是0.9
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position


[En]

The momentum method makes the update of the independent variables of the adjacent time steps more consistent in direction, thus reducing the possibility of divergence when the large gradient value is updated.

### Nesterov Momentum

[En]

It performs better than the momentum update above, and the improvement is that the position x of the calculated gradient is transferred to after the momentum update:

x_ahead = x + mu * v
v = mu * v - learning_rate * dx_ahead
x += v


### Nesterov’s Accelerated Momentum (NAG)

[En]

The effect is better

[g_t=\nabla f(w_t-\alpha \cdot m_{t-1} / \sqrt{V_{t-1}}) ]

[En]

[\begin{align} s_t &← s_{t−1} + g t ⊙ g t \ x_t &← x_{t−1}-{\eta\over\sqrt{s_t+ε}} \end{align} ]

[En]

Where is multiplied by elements. Here, the operations of square, division and multiplication are all based on elements. These element-by-element operations make each element in the independent variable of the objective function have its own learning rate.

[En]

We have accumulated a lot of knowledge about the frequently updated parameters, we do not want to be affected too much by a single sample, and we want the learning rate to be slower; for the parameters that are updated occasionally, we know too little. I hope we can learn more from each accidental sample, that is, a higher learning rate.

[En]

How to measure the frequency of historical updates? That’s the second-order momentum– the sum of the squares of all the gradient values so far in this dimension:

[V_t = \sum_{\tau=1}^{t} g_\tau^2 ]

[x_i=x_i-{\alpha\over \sqrt{V_t}}g_i^t ]

v += dx^2
x += - learning_rate * dx/sqrt(v)


[En]

The adjustment of adaptive learning rate is equivalent to the penalty added because the one-time moving DX is too large, which increases the resistance to shock.

[En]

Advantages: there is no need to specify the decline of the learning rate (just specify the initial learning rate)

[En]

Disadvantages: cumulative square gradient, resulting in a very small learning rate in the later stage. With the increase of the number of iterations, the learning rate decreases gradually until it approaches 0 and no more learning is carried out.

[En]

Because the monotone decreasing learning rate of AdaGrad is too radical, we consider a strategy to change the second-order momentum calculation method: not accumulate all the historical gradients, but only focus on the decline gradients of the past window. This is the origin of Delta in the name of AdaDelta. RMSProp, which stands for Root Mean Square propagation, is a similar algorithm.

RMSProp

[En]

The exponential moving average is about the average over a period of time, so we use this method to calculate the second-order cumulative momentum:

[V_t = \beta_2 * V_{t-1} + (1-\beta_2) g_t^2 ]

[En]

This moving average method is similar to the momentum method, so RMSProp can be seen as a combination of Momentum and AdaGrad:

v = mu * v + (1-mu) * dx^2
x += - learning_rate * dx/sqrt(v)


[\begin{align} s_t &← \rho s_{t−1} + (1-\rho)g t ⊙ g t \ g_t’ &← \sqrt{{\Delta x_{t-1}+ε\over s_t+ε}} ⊙ g_ t \ x_t &← x_{t−1}-g_t’ \ \Delta x_t &← \rho\Delta x_{t−1}+(1-\rho)g_t’⊙ g_ t’ \end{align} ]

[\begin{align} m_t &= \beta_1 \cdot m_{t-1} + (1-\beta_1)\cdot g_t \ V_t &= \beta_2 \cdot V_{t-1} + (1-\beta_2) \cdot g_t^2 \end{align} ]

m = mu1 * m + (1-mu1) * dx
v = mu2 * v + (1-mu2) * dx^2
x += - learning_rate * m/sqrt(v)


[\begin{align} \hat m_t &={m_t\over 1-\beta_1^t} \ \hat V_t &={V_t\over 1-\beta_2^t} \end{align} ]

[w_{t+1} = w_t – {\alpha \over \sqrt{\hat V_t}+\epsilon}\hat m_t ]

[En]

[g_t=\nabla f(w_t-\alpha \cdot m_{t-1} / \sqrt{V_t}) ]

Eve

ICLR 2018 中的一篇文章 on the convergence of adam and beyond 提出了对 Adam 学习率振荡问题的解决方法.

[En]

Mitigation: numerical tailoring to ensure that the second-order momentum is not decreasing (similar to order-preserving regression):

[V_t = \max(\beta_2 * V_{t-1} + (1-\beta_2) g_t^2, V_{t-1}) ]

[En]

However, the experimental results in this way are not very satisfactory.

[w_{t+1} = (1-d_w)\cdot w_t – \alpha \cdot g_t ]

L2正则项是添加在损失函数中的：

[\begin{align} L &= \ell + {1\over 2} d_w \|w_t\|^2 \ {∇ w_t} &= g_t + d_w \cdot w_t \ w_{t+1} &= w_t – \alpha \cdot ∇ w_t \ &= (1-\alpha \cdot d_w)\cdot w_t – \alpha \cdot g_t \end{align} ]

[En]

It can be seen that weight decay is formally equivalent to L2 regular term in SGD, but the amount of weight attenuation is not equal.

[En]

After adding momentum momentum, the L2 regular term is updated as follows:

[\begin{align} \text{L2: }\ m_t &= \beta_1 \cdot m_{t-1} + \alpha \cdot (g_t+d_w \cdot w_t) \ w_{t+1} &= w_t – m_t \end{align} ]

[En]

If you apply the original concept of weight decay, it should be:

[\begin{align} m_t &= \beta_1 \cdot m_{t-1} + \alpha\cdot g_t \ \text{weight decay: }\ w_{t+1} &= w_t – m_t – d_w \cdot w_t \end{align} ]

[En]

It can be seen that in L2, because the learning rate acts on the weight decay, the smaller the learning rate, the smaller the weight attenuation. In the Adam optimization algorithm with adaptive learning rate, the larger the weight gradient is, the smaller the adaptive learning rate is, and the smaller the weight attenuation is, which counteracts the effect of weight decay to some extent. According to the original concept of weight decay, after adding it to the role of optimizer, it is completely decoupled from learning rate and really plays the role of weight decay:

[\begin{align} g_t &= ∇ w_t + d_w \cdot w_t \ m_t &= \beta_1 \cdot m_{t-1} + (1-\beta_1)\cdot g_t \ V_t &= \beta_2 \cdot V_{t-1} + (1-\beta_2) \cdot g_t^2 \ \hat m_t &={m_t\over 1-\beta_1^t} \ \hat V_t &={V_t\over 1-\beta_2^t} \ w_{t+1} &= w_t – \eta_t[{\alpha \over \sqrt{\hat V_t}+\epsilon}\hat m_t + d_w \cdot w_t] \end{align} ]

Caffe 上的实现参考这里

### LAMB/LARS

&#x6570;&#x636E;&#x5E76;&#x884C;分布式训练场景中, 每个计算设备上的batch size固定时，设备数越多，全局的batch size越大。但过大的 batch size 会带来训练的收敛问题：refer1 ,refer2

• 模型最终失去精确度
[En]

* final loss of accuracy of the model

• 收敛速度较慢，需要更多纪元才能收敛
[En]

* the convergence rate is slower, and more epoch is needed to converge.

LARS 和 LAMB 两个优化策略常用来解决上述超大batch 训练中的收敛问题。

[x_{t+1}^{(i)}=x_{t}^{(i)}-\eta_{t} \frac{\phi\left(\left\|x_{t}^{(i)}\right\|\right)}{\left\|g_{t}^{(i)}\right\|} g_{t}^{(i)} ]

[En]

In particular, l2-norm carries out norm separately for each layer of the multi-layer neural network.

[En]

The advantages and disadvantages of normalization: normalization will lose the size of the gradient and retain only the direction, which leads to the update of the gradient biased, but the problem will be weakened in the case of larger batch learning. But it can bring obvious benefits: it is more robust to the phenomenon of gradient explosion and gradient disappearance.

LARS、LAMB算法的伪代码如下：

[\begin{align} \rm{local_learning_rate} \quad \alpha &= \eta_t {\beta_{lars} \|x_t\| \over \|g_t\| + d_w x_t} \ \rm{velocity} \quad m_t &= \beta_1 \cdot m_{t-1} + \alpha \cdot (g_t+d_w x_t+\epsilon) \ x_t &= x_t – v_t \end{align} ]

[En]

This method, proposed by the Google brain team in 2019, is used to accelerate BERT pre-training, increases batch size without reducing accuracy, and reduces overall time consumption by making full use of computing resources such as GPU and TPU (not reducing the weight of the model or the number of parameters of the optimizer).

[En]

It is found that * LARS is unstable on different models * , such as Attention models such as BERT. The author analyzes this from the point of view of formula theory (the appendix contains a bunch of formulas).

LAMB（Layer-wise Adaptive Moments optimizer for Batching training）优化器基于Adam对LARS进行了改进，旨在解决超大batch的训练收敛问题。在不降低精度的前提下增大训练的批量大小，其支持 &#x81EA;&#x9002;&#x5E94;&#x7684;&#x9010;&#x5143;&#x7D20;&#x66F4;&#x65B0;（adaptive element-wise updating）和 &#x7CBE;&#x786E;&#x7684;&#x5206;&#x5C42;&#x6821;&#x6B63;（layer-wise correction）。

LAMB算法参数更新如下：

[\begin{align} m_t &=\beta_1m_{t-1}+(1-\beta_1)g_t\ v_t &=\beta_2v_{t-1}+(1-\beta_2)g_t^2\ \hat{m_t} &=m_t/(1-\beta_1^t)\ \hat{v_t} &=v_t/(1-\beta_2^t)\ r_t &=\frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon}\ x_t &=x_{t-1}-\alpha*\frac{\phi(||x_{t-1}||)}{||r_t+ d_w x_{t-1}||}(r_t + d_w x_{t-1}) \end{align} ]

[En]

The actual effect of this simple adjustment is very significant. When using AdamW, the effect of the model will be greatly reduced when the batch size exceeds 512, but under LAMB, batch size can directly mention 32k without loss of accuracy. LAMB can also be used as a general optimizer to train small batches of samples.

[En]

Reference:

### 方法选择

[En]

Immature point of view: adaptive learning rate algorithms such as Adam have advantages for sparse data and converge very fast, but SGD (+ Momentum) with fine tuning parameters can often achieve better final results.

[En]

Most NLP pre-training models no longer use SGD, AdaGrad and RMSprop methods, but use AdamW and LAMB proposed later.

[En]

It can be predicted that there will be more and better optimization algorithms in the future.

[En]

Take linear regression as an example:

hyposis: (h_\theta(x)=\sum_{j=0}^n \theta_jx_j)

m是样本的总数,n是参数(\theta)的总数.

[En]

Advantages: global optimal solution, easy to implement in parallel

[En]

Disadvantages: slow training speed, it is difficult to add new samples dynamically

[En]

The operation of a gradient descent is as follows, iterating many times:

[\theta_j:=\theta_j-\alpha{\partial\over\partial\theta_j}\text{cost}(\theta) \tag{for j=1,…,n} ]

1. 对训练样本随机打乱顺序
2. 重复如下的梯度下降:

for i=1,…,m{

​ (\theta_j:=\theta_j-\alpha{\partial\over\partial\theta_j}J(\theta,(x^{(i)},y^{(i)})) \tag{for j=1,…,n})

}

[En]

Since only one sample is used at a time, there is an extra layer of loop. But it is rarely used in practice, because the optimized vector operation is faster than the individual processing. Sometimes people refer to SGD as Mini-batch Gradient Descent (because the choice of mini-batch is usually random).

[En]

[En]

Disadvantages: it is not the global optimal solution, and it is not easy to implement in parallel.

[En]

Each calculation of the gradient uses only a small portion of the training set, and the common mini-batch size is 32 Universe 64 Universe 128 (optimized vector). For example, ILSVRC ConvNet uses 256 samples.

for i=1,1+b,1+2b,…,~m{

​ (\theta_j:=\theta_j-\alpha{1\over b}\sum_{k=i}^{i+b}{\partial\over\partial\theta_j}\text{cost}(\theta,(x^{(k)},y^{(k)})) \tag{for j=1,…,n})

}

[En]

By means of vectorization, it may be faster than Stochastc gradient descent.

[En]

Advantage: it can reduce the variance of parameter update.

[En]

Disadvantages: greatly affected by the learning rate

## 二阶方法 Second order methods

[En]

The methods discussed above are all first-order methods (first order), and the more effective method is based on Newton’s method.

[x \leftarrow x – [H_f(x)]^{-1} \nabla f(x) ]

[En]

The Newton method uses the information of the first and second derivatives of the function to find the point with a gradient of 0 directly. Newton method can not guarantee that the function value decreases in each iteration, nor can it guarantee to converge to the minimum point. In the implementation, we also need to set the learning rate, the reason is the same as the gradient descent method, in order to be able to ignore the higher-order terms in the Taylor expansion. The learning rate is usually set by straight line search (line search) technology.

### 牛顿法

[\begin{align} f(x)\approx f(x_n)+f'(x_n)(x-x_n)+{1\over 2}f”(x_n)(x-x_n)^2 \ f(x_n+\Delta x)\approx f(x_n)+f'(x_n)\Delta x+{1\over 2}f”(x_n)\Delta x^2 \end{align} ]

[\begin{align} {d f(x_n+\Delta x) \over d\Delta x}=f'(x_n)+f”(x_n)\Delta x=0 \ \Longrightarrow \Delta x=-{f'(x_n)\over f”(x_n)} \ x_{n+1}=x_n+\Delta x=x_n-[f”(x_n)]^{-1}f'(x_n) \end{align} ]

[En]

In view of the shortcomings of Newton method, there are some improved algorithms at present. These improved algorithms are collectively referred to as quasi-Newton algorithms. The representative ones are BFGS and L-BFGS.

BFGS 算法使用低秩分解近似的方法来代替计算 Hessian 矩阵的逆，有效地提高了运算速度。但是我们仍然需要保存每次迭代计算近似逆矩阵所需要的变量的历史值，空间成本较大。

L-BFGS 是 limited BFGS 的缩写，是对BFGS 算法的改进，简单地只使用最近的m个历史值. 算法不需要存储 Hessian 近似逆矩阵， 而是直接通过迭代算法获取本轮的搜索方向，空间成本大大降低。

### 牛顿法 vs 梯度下降

[En]

The Newton method is usually faster than the gradient descent method and can be regarded as the limit of the gradient descent method (the inverse of the Hessian matrix can be regarded as part of the learning rate of the gradient descent method). The step size of Newton’s method is calculated by derivative. So when approaching the saddle point, the step size becomes smaller and smaller, so the Newton method can easily fall into the saddle point. The step size of sgd is a preset fixed value, which is relatively easy to cross some saddle points.

[En]

Generally speaking, the optimization algorithm based on gradient descent is more widely used in practical applications, such as RMSprop, Adam and so on. However, the improved algorithms of Newton method, such as BFGS and L-BFGS, also have their own characteristics and strong practicability.

[En]

Reference for the improvement of Newton method:

### Equilibrated SGD

[En]

The problem that SGD may encounter saddle points is solved by using second-order gradient information.

[\begin{align} x_i &=x_i-{\alpha\over \sqrt{D_i^s}}g_i^s \ D_i^s &=ρD_i^{s-1}+(1-ρ)(H_d)^2 \end{align} ]

(H_d)是L(x)的Hessian矩阵的对角线元素,且x服从正向分布((x\in \mathcal N(0,1))).

## 并行和分布式SGD

• Hogwild
• Downpour

[En]

It contains two key parts: model replication and parameter server.

## 梯度累积技巧

[En]

The technique of adding batch size under the condition of limited memory: the gradient accumulation of consecutive multiple batch is followed by a back propagation.

[En]

On the one hand, it can reduce the calculation times of back propagation, on the other hand, it can increase the batch size appropriately, and the cumulative gradient can be more stable (smooth extreme value).

## 参考

Original: https://www.cnblogs.com/makefile/p/dl-optimizer.html
Author: 康行天下
Title: 参数更新-深度学习优化算法

(0)