参数更新-深度学习优化算法

学习率调整方式

初始值通常设置为1e-3,在学习过程中随时间降低学习率通常是有用的。下面的学习率衰减模式适用于使用一阶动量的优化算法。

[En]

The initial value is usually set at 1e-3, and it is usually useful to reduce the learning rate with time in the learning process. The following learning rate decay mode is suitable for optimization algorithms using first-order momentum.

降低学习速度的方法有:

[En]

The ways to reduce the learning rate are:

  • 减少步长衰减每隔几个纪元减少一次,例如每5次减半,或每20次减少0.1次,视情况而定。通过尝试不同的固定学习速率,通过观察验证集的错误率,可以找到合适的学习速率。

    [En]

    * decreasing step decay decreases every few epoch, such as halving every 5 times, or 0.1 every 20 times, as the case may be. The appropriate learning rate can be found by observing the error rate of the verification set by trying different fixed learning rates.

  • 指数下降 (\alpha = \alpha_0 e^{-k t}),t 是迭代次数,其它是超参数.

  • 1/t下降 (\alpha = \alpha_0 / (1 + k t ))

  • cosine下降
    (\alpha = \alpha_0 \times (1 + \cos(\pi \times t / T)) / 2), T 是总迭代次数. 与指数下降类似,但曲线不同. 是 SGDR decay 方法的非重启版本(重启指的是周期性重置学习率).

在实践中,阶跃衰减更常用,因为参数更容易解释,也更直观。设定一个较小的下降率,并进行更长时间的训练。

[En]

In practice, step decay is more commonly used because the parameters are easier to interpret and more intuitive. Set a smaller decline rate and train for a longer time.

用一个框架概括一阶优化算法

深度学习优化算法经历了SGD->SGDM->NAG->AdaGrad->AdaDelta->ADAM->NADAM的发展过程。

[En]

Deep learning optimization algorithm has experienced the development process of SGD-> SGDM-> NAG-> AdaGrad-> AdaDelta-> Adam-> Nadam.

使用一个框架来整理所有的优化算法,以进行更具战略性的比较。

[En]

Use a framework to sort out all the optimization algorithms to make a more strategic comparison.

首先定义:待优化参数:W,目标函数:F(W),初始学习率(α)。

[En]

First of all, define: parameter to be optimized: W, objective function: F (w), initial learning rate ( alpha).

之后,迭代优化开始。在每个时期t:

[En]

After that, iterative optimization begins. At each epoch t:

  1. 计算目标函数关于当前参数的梯度:(g_t=\nabla f(w_t))
  2. 根据历史梯度计算一阶动量和二阶动量:(m_t = \phi(g_1, g_2, \cdots, g_t); V_t = \psi(g_1, g_2, \cdots, g_t)),
  3. 计算当前时刻的下降梯度:(\eta_t = \alpha \cdot m_t / \sqrt{V_t})
  4. 根据下降梯度进行更新:(w_{t+1} = w_t – \eta_t)

掌握了这个框架之后,您就可以轻松地设计自己的优化算法了。

[En]

With this framework mastered, you can easily design your own optimization algorithms.

对于SGD,只使用一阶动量来计算下降梯度,缺点是它可能停留在局部最优点。

[En]

For SGD, only the first-order momentum is used to calculate the descent gradient, and the disadvantage is that it may stay at the local optimal point.

为了解决上述问题,提出了动量法,即在计算坡度下降方向时,不仅取决于当前点的坡度,还取决于之前的累计下降方向。

[En]

In order to solve the above problems, momentum is proposed, that is, when calculating the descending direction of the gradient, it is determined not only by the gradient of the current point, but also by the previous cumulative descending direction.

AdaGrad从名字应该可以看出来是自适应和梯度英文的缩写,标志使用二阶动量以及自适应学习率,二阶动量则为迄今为止所有梯度值的平方和;
AdaDelta 则是对AdaGrad的改进,不累积全部历史梯度,而只关注过去一段时间窗口(Delta)的下降梯度;
Adam——Adaptive + Momentum(一阶动量和二阶动量进行结合).

权重更新方法(一阶)

Vanilla update (SGD)

(vanilla作为形容词,意思是普通的,平凡的)
沿负梯度方向更新参数以最小化损失。

[En]

Update the parameters along the negative gradient direction to minimize the loss.

下行速度较慢,容易停留在局部点。

[En]

The speed of decline is slow, and it is easy to stay at the local point.

# Vanilla update,learning_rate是超参数
x += - learning_rate * dx

动量法 Momentum update

SGD with momentum,在SGD基础上引入了一阶动量:

[m_t = \beta_1 \cdot m_{t-1} + \alpha \cdot g_t ]

为避免每次梯度更新时单独计算梯度,导致梯度方向不断变化,动量将上一轮的梯度值加到当前梯度计算中。

[En]

In order to avoid calculating the gradient independently each time the gradient is updated, resulting in a continuous change in the direction of the gradient, Momentum adds the gradient value of the previous round to the current gradient calculation.

也就是说,t时刻的下降方向,不仅由当前点的梯度方向决定,而且由此前累积的下降方向决定。其中(\beta_1\in [0,1)), (\beta_1) 的经验值为0.9,这就意味着下降方向主要是此前累积的下降方向,并略微偏向当前时刻的下降方向。对(\beta_1)进行调优可以从0.5作为初始值,再依次尝试[0.9, 0.95, 0.99],(模拟退火的过程).

直观的解释是,重力降低了它的震荡幅度,加速了它的下降。

[En]

The visual explanation is that gravity reduces the amplitude of its concussion and accelerates its decline.

# Momentum update
# 其中变量v初始值为0,mu 称为动量,常用值是0.9
v = mu * v - learning_rate * dx # integrate velocity
x += v # integrate position

动量法使用了指数加权移动平均的思想。它将过去时间步的梯度做了加权平均,且权重按时间步指数衰减。 即(m_t ← \beta_1 m_{t−1} + (1 − \beta_1)({\alpha_t\over 1−\beta_1}g_t)).

动量法使相邻时间步长的自变量更新在方向上更加一致,从而减少了大梯度值更新时发散的可能性。

[En]

The momentum method makes the update of the independent variables of the adjacent time steps more consistent in direction, thus reducing the possibility of divergence when the large gradient value is updated.

Nesterov Momentum

它比上面的动量更新执行得更好,改进是在动量更新之后将计算的渐变的位置x转移到:

[En]

It performs better than the momentum update above, and the improvement is that the position x of the calculated gradient is transferred to after the momentum update:

x_ahead = x + mu * v
# evaluate dx_ahead (the gradient at x_ahead instead of at x)
v = mu * v - learning_rate * dx_ahead
x += v

Nesterov’s Accelerated Momentum (NAG)

效果更好。

[En]

The effect is better

[g_t=\nabla f(w_t-\alpha \cdot m_{t-1} / \sqrt{V_{t-1}}) ]

进一步阅读:

[En]

Read further:

AdaGrad (开启自适应模式,二阶动量)

[\begin{align} s_t &← s_{t−1} + g t ⊙ g t \ x_t &← x_{t−1}-{\eta\over\sqrt{s_t+ε}} \end{align} ]

其中乘以元素。这里的平方、除法和乘法运算都是基于元素的。这些逐个元素的运算使得目标函数的自变量中的每个元素都有自己的学习率。

[En]

Where is multiplied by elements. Here, the operations of square, division and multiplication are all based on elements. These element-by-element operations make each element in the independent variable of the objective function have its own learning rate.

AdaGrad 是 Adaptive Gradient 的缩写,为解决 GD 中固定学习率带来的不同参数间收敛速度不一致的弊端,AdaGrad 为每个参数赋予独立的学习率。计算梯度后,梯度较大的参数获得的学习率较低(由二阶动量控制),反之亦然。

我们已经积累了很多关于频繁更新的参数的知识,我们不希望受到单一样本的太大影响,我们希望学习速度慢一些;对于偶尔更新的参数,我们知道的太少。我希望我们能从每个偶然的样本中学到更多,也就是更高的学习率。

[En]

We have accumulated a lot of knowledge about the frequently updated parameters, we do not want to be affected too much by a single sample, and we want the learning rate to be slower; for the parameters that are updated occasionally, we know too little. I hope we can learn more from each accidental sample, that is, a higher learning rate.

如何衡量历史更新的频率?这是二阶动量–这个维度到目前为止所有梯度值的平方和:

[En]

How to measure the frequency of historical updates? That’s the second-order momentum– the sum of the squares of all the gradient values so far in this dimension:

[V_t = \sum_{\tau=1}^{t} g_\tau^2 ]

Adagrad更新规则为:

[x_i=x_i-{\alpha\over \sqrt{V_t}}g_i^t ]

v += dx^2
x += - learning_rate * dx/sqrt(v)

自适应学习速率的调整相当于一次性移动DX过大而增加的惩罚,增加了抗冲击能力。

[En]

The adjustment of adaptive learning rate is equivalent to the penalty added because the one-time moving DX is too large, which increases the resistance to shock.

优点:无需指定学习率的下降幅度(只需指定初始学习率)

[En]

Advantages: there is no need to specify the decline of the learning rate (just specify the initial learning rate)

缺点:累积平方梯度,导致后期学习速度很小。随着迭代次数的增加,学习率逐渐降低,直至逼近0,不再进行学习。

[En]

Disadvantages: cumulative square gradient, resulting in a very small learning rate in the later stage. With the increase of the number of iterations, the learning rate decreases gradually until it approaches 0 and no more learning is carried out.

AdaDelta/RMSProp

由于AdaGrad的单调递减学习率过于激进,我们考虑了一种改变二阶动量计算方法的策略:不累加所有的历史梯度,而只关注过去窗口的下降梯度。这就是以AdaDelta的名字命名的Delta的起源。RMSProp是均方根传播的缩写,也是一种类似的算法。

[En]

Because the monotone decreasing learning rate of AdaGrad is too radical, we consider a strategy to change the second-order momentum calculation method: not accumulate all the historical gradients, but only focus on the decline gradients of the past window. This is the origin of Delta in the name of AdaDelta. RMSProp, which stands for Root Mean Square propagation, is a similar algorithm.

RMSProp

修改的思路很简单,仅计算过去的W步的梯度的平方和的根(RMS,root mean square)。由于计算固定大小窗口:(g_i^{-w},\cdots,g_i^{-1})的平方和与计算累计指数下降:(E[(g_i)^2]^s=ρ E[(g_i)^2]^{s-1}+(1-ρ)(g_i^s)^2)在概念上等价,且计算量更小,其中ρ 是decay或称momentum.

指数移动平均线大约是一段时间的平均值,所以我们使用这种方法来计算二阶累积动量:

[En]

The exponential moving average is about the average over a period of time, so we use this method to calculate the second-order cumulative momentum:

[V_t = \beta_2 * V_{t-1} + (1-\beta_2) g_t^2 ]

此均线方法类似于动量方法,因此RMSProp可以被视为动量和AdaGrad的组合:

[En]

This moving average method is similar to the momentum method, so RMSProp can be seen as a combination of Momentum and AdaGrad:

v = mu * v + (1-mu) * dx^2
x += - learning_rate * dx/sqrt(v)

AdaDelta

AdaDelta算法跟RMSProp算法的不同之处在于使用 (\sqrt{\Delta x_{t−1}}) 来替代超参数学习率。

[\begin{align} s_t &← \rho s_{t−1} + (1-\rho)g t ⊙ g t \ g_t’ &← \sqrt{{\Delta x_{t-1}+ε\over s_t+ε}} ⊙ g_ t \ x_t &← x_{t−1}-g_t’ \ \Delta x_t &← \rho\Delta x_{t−1}+(1-\rho)g_t’⊙ g_ t’ \end{align} ]

AdaDelta算法没有学习率超参数,它通过使用有关自变量更新量平方的指数加权移动平均的项来替代RMSProp算法中的学习率。

AdaDelta/RMSProp 优势:避免了二阶动量持续累积、导致训练过程提前结束的问题。

Adam

Adaptive Moment estimation 结合了一阶动量与二阶动量,在RMSProp的基础上引入一阶动量(在RMSProp中没有完全运用Momentum):

[\begin{align} m_t &= \beta_1 \cdot m_{t-1} + (1-\beta_1)\cdot g_t \ V_t &= \beta_2 \cdot V_{t-1} + (1-\beta_2) \cdot g_t^2 \end{align} ]

m = mu1 * m + (1-mu1) * dx
v = mu2 * v + (1-mu2) * dx^2
x += - learning_rate * m/sqrt(v)

一般的参数设置为 (\beta_1=0.9, \beta_2=0.999), 但是初始时的累计梯度很小。例如,当(β_1 = 0.9)时,(m_ 1 = 0.1g_1) 。

设 (m_0 =V_0 = 0) ,在时间步 t 有(m_t = (1 − β_1 ) \sum_{i=1}^t β_1^{t-i} g_ i)。将过去各时间步小批量随机梯度的权值相加,得到 ((1 − β_1 ) \sum_{i=1} ^tβ_1^{t-i} = 1 − β_1 ^t)。当t较小时,过去各时间步小批量随机梯度权值之和会较小。

为了消除这样的影响,需要做 偏差校正。对于任意时间步t, 可以将(m_ t) 再除以(1 − β_1^t) ,从而使过去各时间步小批量随机梯度权值之和为1。

[\begin{align} \hat m_t &={m_t\over 1-\beta_1^t} \ \hat V_t &={V_t\over 1-\beta_2^t} \end{align} ]

Adam算法使用以上偏差修正后的变量,将模型参数中每个元素的学习率通过按元素运算重新调整。参数更新方式变为:

[w_{t+1} = w_t – {\alpha \over \sqrt{\hat V_t}+\epsilon}\hat m_t ]

Adam 既为每一个浮点参数自适应性地设置学习率,又将过去的梯度历史纳入考量。由于引入动量,相比于原始SGD参数量翻倍,增加了内存消耗,但因为效果较好在学术界被广泛采用。

Nadam、Eve、AMSGrad

Nadam = Nesterov + Adam
遵循NAG的步骤:

[En]

Follow the steps of NAG:

[g_t=\nabla f(w_t-\alpha \cdot m_{t-1} / \sqrt{V_t}) ]

Eve

Adam的扩展,根据最近的损失的振荡情况,调整学习率. 在 Adam 中仅仅调整局部的学习率, 而 Eve 方法还可以调整全局的学习率.

AMSGrad (二阶动量振荡问题)

ICLR 2018 中的一篇文章 on the convergence of adam and beyond 提出了对 Adam 学习率振荡问题的解决方法.

二阶动量是固定时间窗口内的累积,随着时间窗口的变化,遇到的数据可能发生巨变,使得 (V_t) 可能会时大时小,不是单调变化。这就可能在训练后期引起学习率的振荡,导致模型无法收敛。

缓解:数字调整以确保二阶动量不减小(类似于保序回归):

[En]

Mitigation: numerical tailoring to ensure that the second-order momentum is not decreasing (similar to order-preserving regression):

[V_t = \max(\beta_2 * V_{t-1} + (1-\beta_2) g_t^2, V_{t-1}) ]

通过这样修改,就保证了 (||V_t|| \geq ||V_{t-1}||) ,从而使得学习率单调递减。
然而,这种方法的实验结果并不十分令人满意。

[En]

However, the experimental results in this way are not very satisfactory.

AdamW (weight decay)

在论文 Fixing Weight Decay Regularization in Adam(Decoupled Weight Decay Regularization) 中指出了在当前的绝大多数深度学习库(Caffe/Tensorflow等)中的L2正则化方式的Adam实现并非 weight decay 的形式,并做了对比实验发现 weight decay 的形式在某些实验中比 L2 正则化的方式要好,因此推荐使用权重衰减而不是L2正则。

先来看原始不带动量版本的 SGD 的 weight decay 更新方式, (d_w) 表示参数weight decay:

[w_{t+1} = (1-d_w)\cdot w_t – \alpha \cdot g_t ]

L2正则项是添加在损失函数中的:

[\begin{align} L &= \ell + {1\over 2} d_w \|w_t\|^2 \ {∇ w_t} &= g_t + d_w \cdot w_t \ w_{t+1} &= w_t – \alpha \cdot ∇ w_t \ &= (1-\alpha \cdot d_w)\cdot w_t – \alpha \cdot g_t \end{align} ]

可以看出,重量衰减在形式上等同于SGD中的L2正则项,但重量衰减量并不相等。

[En]

It can be seen that weight decay is formally equivalent to L2 regular term in SGD, but the amount of weight attenuation is not equal.

在添加动量动量后,L2正则项更新如下:

[En]

After adding momentum momentum, the L2 regular term is updated as follows:

[\begin{align} \text{L2: }\ m_t &= \beta_1 \cdot m_{t-1} + \alpha \cdot (g_t+d_w \cdot w_t) \ w_{t+1} &= w_t – m_t \end{align} ]

如果应用权重衰减的原始概念,它应该是:

[En]

If you apply the original concept of weight decay, it should be:

[\begin{align} m_t &= \beta_1 \cdot m_{t-1} + \alpha\cdot g_t \ \text{weight decay: }\ w_{t+1} &= w_t – m_t – d_w \cdot w_t \end{align} ]

可以看出,在L2中,由于学习率对权值衰减起作用,学习率越小,权值衰减值越小。在具有自适应学习速率的ADAM优化算法中,权重梯度越大,自适应学习速率越小,权重衰减越小,在一定程度上抵消了权重衰减的影响。按照最初的权重衰减概念,将其加入优化器的角色后,与学习率完全脱钩,真正起到了权重衰减的作用:

[En]

It can be seen that in L2, because the learning rate acts on the weight decay, the smaller the learning rate, the smaller the weight attenuation. In the Adam optimization algorithm with adaptive learning rate, the larger the weight gradient is, the smaller the adaptive learning rate is, and the smaller the weight attenuation is, which counteracts the effect of weight decay to some extent. According to the original concept of weight decay, after adding it to the role of optimizer, it is completely decoupled from learning rate and really plays the role of weight decay:

[\begin{align} g_t &= ∇ w_t + d_w \cdot w_t \ m_t &= \beta_1 \cdot m_{t-1} + (1-\beta_1)\cdot g_t \ V_t &= \beta_2 \cdot V_{t-1} + (1-\beta_2) \cdot g_t^2 \ \hat m_t &={m_t\over 1-\beta_1^t} \ \hat V_t &={V_t\over 1-\beta_2^t} \ w_{t+1} &= w_t – \eta_t[{\alpha \over \sqrt{\hat V_t}+\epsilon}\hat m_t + d_w \cdot w_t] \end{align} ]

AdamW(Adam Weight Decay Regularization) 便是 weight decay 形式的 Adam. 由于 Adam 中自适应的学习率通常比较小,因此需要较大的 weight decay 系数。 AdamW类似L2能达到同样使参数接近于 0 的目的。那么,AdamW 权重衰减总是比 Adam 的 L2 正则化更好? 不一定, 但通常不会更差。

参考 fast.ai对AdamW的介绍和实验.

Caffe 上的实现参考这里

LAMB/LARS

数据并行分布式训练场景中, 每个计算设备上的batch size固定时,设备数越多,全局的batch size越大。但过大的 batch size 会带来训练的收敛问题:refer1 ,refer2

  • 模型最终失去精确度
    [En]

    * final loss of accuracy of the model

  • 收敛速度较慢,需要更多纪元才能收敛
    [En]

    * the convergence rate is slower, and more epoch is needed to converge.

LARS 和 LAMB 两个优化策略常用来解决上述超大batch 训练中的收敛问题。

LARS通过应用逐层自适应学习率(layerwise adaptive)使得用 ImageNet 训练RESNET只需要几分钟。

以SGD为例阐述逐层自适应学习率的思想:

对梯度(更新量)进行normalization(l2-norm),再进行缩放,可看作对学习率的缩放。缩放因子采用关于权重的函数:(\phi\left(\left\|x_t\right\|\right))

[x_{t+1}^{(i)}=x_{t}^{(i)}-\eta_{t} \frac{\phi\left(\left\|x_{t}^{(i)}\right\|\right)}{\left\|g_{t}^{(i)}\right\|} g_{t}^{(i)} ]

特别地,L2范数对多层神经网络的每一层分别执行范数。

[En]

In particular, l2-norm carries out norm separately for each layer of the multi-layer neural network.

出发点:有时候 (g_t) 的模长会大于参数 (x_t) 的模长,这可能会导致更新的不稳定。所以,一个直接的想法是:每一层的参数的更新幅度应该由 (x_t) 的模长来调控。

归一化的优缺点:归一化会丢失梯度的大小,只保留方向,导致梯度更新有偏差,但在批量学习较大的情况下,问题会减弱。但它可以带来明显的好处:它对梯度爆炸和梯度消失的现象更加稳健。

[En]

The advantages and disadvantages of normalization: normalization will lose the size of the gradient and retain only the direction, which leads to the update of the gradient biased, but the problem will be weakened in the case of larger batch learning. But it can bring obvious benefits: it is more robust to the phenomenon of gradient explosion and gradient disappearance.

其中,(x_{t}^{(i)}) 表示第t步第i层的权重。(\phi) 是一个可选择的映射函数,保证norm之后的梯度与权重有相同的数量级。论文给出的两种映射函数为:(1)(\phi(z)=z);(2)起到min-max归一化作用的(\min(\max(z,\gamma_l),\gamma_u))。((\gamma_l,\gamma_u))为预先设定的超参数,分别代表参数调整的下界和上界。

LARS、LAMB算法的伪代码如下:

参数更新-深度学习优化算法

其中(m_t = β_1m_{t−1} + (1 − β_1)(g_t + λx_t)) 这个一阶动量计算方式类似带 weight decaymomentum 优化器,LARS在其基础上进行norm与缩放。LARS公式还可以写成如下形式,在带 weight decaymomentum 优化器的基础上加入了 local learning rate 的逻辑, 对每一层的 learning rate 进行放缩。

[\begin{align} \rm{local_learning_rate} \quad \alpha &= \eta_t {\beta_{lars} \|x_t\| \over \|g_t\| + d_w x_t} \ \rm{velocity} \quad m_t &= \beta_1 \cdot m_{t-1} + \alpha \cdot (g_t+d_w x_t+\epsilon) \ x_t &= x_t – v_t \end{align} ]

LAMB 出自 Large Batch Optimization for Deep Learning: Training BERT in 76 minutes(ICLR 2020)

该方法由Google Brain团队于2019年提出,用于加速BERT预训练,在不降低精确度的情况下增加批次规模,通过充分利用GPU、TPU等计算资源降低整体时间消耗(不降低模型权重或优化器的参数个数)。

[En]

This method, proposed by the Google brain team in 2019, is used to accelerate BERT pre-training, increases batch size without reducing accuracy, and reduces overall time consumption by making full use of computing resources such as GPU and TPU (not reducing the weight of the model or the number of parameters of the optimizer).

研究发现,LARS在不同的模型上都是不稳定的,如BERT等注意模型。笔者从公式理论的角度对此进行了分析(附录中包含了大量公式)。

[En]

It is found that * LARS is unstable on different models * , such as Attention models such as BERT. The author analyzes this from the point of view of formula theory (the appendix contains a bunch of formulas).

LAMB(Layer-wise Adaptive Moments optimizer for Batching training)优化器基于Adam对LARS进行了改进,旨在解决超大batch的训练收敛问题。在不降低精度的前提下增大训练的批量大小,其支持 自适应的逐元素更新(adaptive element-wise updating)和 精确的分层校正(layer-wise correction)。

LAMB算法参数更新如下:

[\begin{align} m_t &=\beta_1m_{t-1}+(1-\beta_1)g_t\ v_t &=\beta_2v_{t-1}+(1-\beta_2)g_t^2\ \hat{m_t} &=m_t/(1-\beta_1^t)\ \hat{v_t} &=v_t/(1-\beta_2^t)\ r_t &=\frac{\hat{m_t}}{\sqrt{\hat{v_t}}+\epsilon}\ x_t &=x_{t-1}-\alpha*\frac{\phi(||x_{t-1}||)}{||r_t+ d_w x_{t-1}||}(r_t + d_w x_{t-1}) \end{align} ]

相比于LARS, LAMB 仅替换了内层优化器为包含二阶动量和偏差校正的Adam优化器,也是在内层优化器的基础上, 套了一个 local learning rate 的逻辑, 对每一层的 learning rate 进行了放缩。

这一简单调整的实际效果非常显著。当使用AdamW时,当批量超过512时,模型的效果会大大降低,但在Lamb下,批量可以在不损失精度的情况下直接提到32K。Lamb还可以用作训练小批量样本的通用优化器。

[En]

The actual effect of this simple adjustment is very significant. When using AdamW, the effect of the model will be greatly reduced when the batch size exceeds 512, but under LAMB, batch size can directly mention 32k without loss of accuracy. LAMB can also be used as a general optimizer to train small batches of samples.

参考资料:

[En]

Reference:

方法选择

不成熟观点:ADAM等自适应学习率算法对于稀疏数据具有优势,收敛速度非常快,但带有微调参数的SGD(+Momentum)算法往往能获得更好的最终结果。

[En]

Immature point of view: adaptive learning rate algorithms such as Adam have advantages for sparse data and converge very fast, but SGD (+ Momentum) with fine tuning parameters can often achieve better final results.

论文 Improving Generalization Performance by Switching from Adam to SGD 给出了一种先使用 Adam 训练,再在合适的时机转成 SGD 的组合方法.

大多数NLP预训练模型不再使用SGD、AdaGrad和RMSprop方法,而是使用后来提出的AdamW和Lamb方法。

[En]

Most NLP pre-training models no longer use SGD, AdaGrad and RMSprop methods, but use AdamW and LAMB proposed later.

可以预见,未来会有更多更好的优化算法。

[En]

It can be predicted that there will be more and better optimization algorithms in the future.

梯度下降 Gradient Descent

以线性回归为例:

[En]

Take linear regression as an example:

hyposis: (h_\theta(x)=\sum_{j=0}^n \theta_jx_j)

m是样本的总数,n是参数(\theta)的总数.

优点:全局最优解,易于并行实施

[En]

Advantages: global optimal solution, easy to implement in parallel

缺点:训练速度慢,难以动态添加新样本

[En]

Disadvantages: slow training speed, it is difficult to add new samples dynamically

Batch gradient descent

总体 cost: ({\text{cost}}(\theta)={1\over 2m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})^2)

梯度下降的操作如下,迭代多次:

[En]

The operation of a gradient descent is as follows, iterating many times:

[\theta_j:=\theta_j-\alpha{\partial\over\partial\theta_j}\text{cost}(\theta) \tag{for j=1,…,n} ]

Stochastic Gradient Descent (SGD)

单个元素的loss: (J(\theta,(x^{(i)},y^{(i)}))={1\over 2}(h_\theta(x^{(i)})-y^{(i)})^2)

  1. 对训练样本随机打乱顺序
  2. 重复如下的梯度下降:

for i=1,…,m{

​ (\theta_j:=\theta_j-\alpha{\partial\over\partial\theta_j}J(\theta,(x^{(i)},y^{(i)})) \tag{for j=1,…,n})

}

由于一次只使用一个样本,因此有一层额外的环路。但在实践中很少使用,因为优化后的向量运算比单独的处理速度更快。有时人们将SGD称为小批量梯度下降(因为小批量的选择通常是随机的)。

[En]

Since only one sample is used at a time, there is an extra layer of loop. But it is rarely used in practice, because the optimized vector operation is faster than the individual processing. Sometimes people refer to SGD as Mini-batch Gradient Descent (because the choice of mini-batch is usually random).

与Batch gradient descent相比,其收敛的曲线可能比较振荡.

优点:训练速度快,可以添加新样本

[En]

Advantages: fast training speed, you can add new samples

缺点:它不是全局最优解,而且不容易并行实施。

[En]

Disadvantages: it is not the global optimal solution, and it is not easy to implement in parallel.

Mini‐batch Gradient Descent

每次梯度计算只使用训练集的一小部分,常见的小批量大小为32/64/128(优化后的向量)。例如,ILSVRC ConvNet使用256个样本。

[En]

Each calculation of the gradient uses only a small portion of the training set, and the common mini-batch size is 32 Universe 64 Universe 128 (optimized vector). For example, ILSVRC ConvNet uses 256 samples.

设b是mini-batch的样本数,融合Batch gradient descent与Stochastc gradient descent的思想,重复如下的梯度下降:

for i=1,1+b,1+2b,…,~m{

​ (\theta_j:=\theta_j-\alpha{1\over b}\sum_{k=i}^{i+b}{\partial\over\partial\theta_j}\text{cost}(\theta,(x^{(k)},y^{(k)})) \tag{for j=1,…,n})

}

通过矢量化的方法,它可能比Stochastc梯度下降更快。

[En]

By means of vectorization, it may be faster than Stochastc gradient descent.

优点:可以减少参数更新的方差。

[En]

Advantage: it can reduce the variance of parameter update.

缺点:受学习速度的影响很大

[En]

Disadvantages: greatly affected by the learning rate

二阶方法 Second order methods

上面讨论的方法都是一阶方法(一阶),更有效的方法是基于牛顿方法。

[En]

The methods discussed above are all first-order methods (first order), and the more effective method is based on Newton’s method.

[x \leftarrow x – [H_f(x)]^{-1} \nabla f(x) ]

牛顿法利用函数的一阶和二阶导数信息,直接寻找梯度为0的点。牛顿法不能保证函数值在每一次迭代中递减,也不能保证收敛到极小点。在实现中,我们还需要设置学习率,原因与梯度下降法相同,以便能够忽略泰勒展开中的高阶项。学习速率通常由直线搜索(Line Search)技术设置。

[En]

The Newton method uses the information of the first and second derivatives of the function to find the point with a gradient of 0 directly. Newton method can not guarantee that the function value decreases in each iteration, nor can it guarantee to converge to the minimum point. In the implementation, we also need to set the learning rate, the reason is the same as the gradient descent method, in order to be able to ignore the higher-order terms in the Taylor expansion. The learning rate is usually set by straight line search (line search) technology.

牛顿法收敛速度更快,并且不需要超参数.然而计算Hessian矩阵的逆在时间和空间复杂度都非常高,不切实际.因此出现了许多变种方法(quasi-Newton methods)来计算近似的逆矩阵值.如方法 L-BFGS,(缺点是必须在整个训练集上计算,如何让其在mini-batches上工作是当前比较活跃的研究领域)

牛顿法

牛顿法原理: 牛顿法是对目标函数在当前点进行二阶泰勒展开, 然后求导求极值得到下一步更新的步伐.

将 (f(x)) 在 (x_n) 处二阶 taylor 展开, 有:

[\begin{align} f(x)\approx f(x_n)+f'(x_n)(x-x_n)+{1\over 2}f”(x_n)(x-x_n)^2 \ f(x_n+\Delta x)\approx f(x_n)+f'(x_n)\Delta x+{1\over 2}f”(x_n)\Delta x^2 \end{align} ]

我们的目的是选择合适的 (\Delta x) 最小化目标函数 (f(x_n+\Delta x)) . 求导令导数为0得到:

[\begin{align} {d f(x_n+\Delta x) \over d\Delta x}=f'(x_n)+f”(x_n)\Delta x=0 \ \Longrightarrow \Delta x=-{f'(x_n)\over f”(x_n)} \ x_{n+1}=x_n+\Delta x=x_n-[f”(x_n)]^{-1}f'(x_n) \end{align} ]

其中的二阶项 (f”(x_n)) 就是 Hessian 矩阵.

针对牛顿法的不足,目前已有一些改进算法。这些改进的算法统称为拟牛顿算法。其中具有代表性的有BFGs和L-BFGs。

[En]

In view of the shortcomings of Newton method, there are some improved algorithms at present. These improved algorithms are collectively referred to as quasi-Newton algorithms. The representative ones are BFGS and L-BFGS.

BFGS 算法使用低秩分解近似的方法来代替计算 Hessian 矩阵的逆,有效地提高了运算速度。但是我们仍然需要保存每次迭代计算近似逆矩阵所需要的变量的历史值,空间成本较大。

L-BFGS 是 limited BFGS 的缩写,是对BFGS 算法的改进,简单地只使用最近的m个历史值. 算法不需要存储 Hessian 近似逆矩阵, 而是直接通过迭代算法获取本轮的搜索方向,空间成本大大降低。

牛顿法 vs 梯度下降

牛顿法通常比梯度下降法更快,可视为梯度下降法的极限(海森矩阵的逆可视为梯度下降法的学习速率的一部分)。牛顿法的步长是用导数来计算的。因此,当逼近鞍点时,步长变得越来越小,因此牛顿法很容易陷入鞍点。SGD的步长是一个预设的固定值,相对容易跨越一些鞍点。

[En]

The Newton method is usually faster than the gradient descent method and can be regarded as the limit of the gradient descent method (the inverse of the Hessian matrix can be regarded as part of the learning rate of the gradient descent method). The step size of Newton’s method is calculated by derivative. So when approaching the saddle point, the step size becomes smaller and smaller, so the Newton method can easily fall into the saddle point. The step size of sgd is a preset fixed value, which is relatively easy to cross some saddle points.

一般而言,基于梯度下降的优化算法在实际应用中应用较为广泛,如RMSprop、Adam等。而牛顿法的改进算法,如BFGS和L-BFGS,也各有特点,具有很强的实用性。

[En]

Generally speaking, the optimization algorithm based on gradient descent is more widely used in practical applications, such as RMSprop, Adam and so on. However, the improved algorithms of Newton method, such as BFGS and L-BFGS, also have their own characteristics and strong practicability.

牛顿法的改进参考:

[En]

Reference for the improvement of Newton method:

Equilibrated SGD

利用二阶梯度信息解决了SGD可能遇到鞍点的问题。

[En]

The problem that SGD may encounter saddle points is solved by using second-order gradient information.

[\begin{align} x_i &=x_i-{\alpha\over \sqrt{D_i^s}}g_i^s \ D_i^s &=ρD_i^{s-1}+(1-ρ)(H_d)^2 \end{align} ]

(H_d)是L(x)的Hessian矩阵的对角线元素,且x服从正向分布((x\in \mathcal N(0,1))).

一种推荐的激活函数选择是增加一个较小的线性项,使其减少饱和区域,如:(f(x)=\tanh(x)+ax).

在参数更新时可以引入梯度噪声(g_i = g_i + \mathcal N ( 0,\sigma )).

并行和分布式SGD

  • Hogwild
  • Downpour

它包含两个关键部分:模型复制和参数服务器。

[En]

It contains two key parts: model replication and parameter server.

梯度累积技巧

有限记忆条件下的批量增加技术:连续多批次的梯度累加后进行反向传播。

[En]

The technique of adding batch size under the condition of limited memory: the gradient accumulation of consecutive multiple batch is followed by a back propagation.

它一方面可以减少反向传播的计算次数,另一方面可以适当增加批次大小,累计梯度可以更稳定(平滑极值)。

[En]

On the one hand, it can reduce the calculation times of back propagation, on the other hand, it can increase the batch size appropriately, and the cumulative gradient can be more stable (smooth extreme value).

参考:用时间换取效果:Keras梯度累积优化器

总结

参数更新-深度学习优化算法

参考

Original: https://www.cnblogs.com/makefile/p/dl-optimizer.html
Author: 康行天下
Title: 参数更新-深度学习优化算法

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6523/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

免费咨询
免费咨询
扫码关注
扫码关注
联系站长

站长Johngo!

大数据和算法重度研究者!

持续产出大数据、算法、LeetCode干货,以及业界好资源!

2022012703491714

微信来撩,免费咨询:xiaozhu_tec

分享本页
返回顶部