This paper mainly makes a comprehensive and systematic analysis of various gradient descent optimization algorithms to help relevant algorithm developers to select appropriate algorithms in the process of model development. Relatively speaking, this content will be divided into several chapters, the following is the chapter, this chapter will introduce the problems related to model training and the optimization direction of the advanced optimizer-impulse, adaptive learning rate.




The goal of the gradient descent method is to update the model parameters in the opposite direction of the gradient. Geometrically, the surface created by the objective function goes all the way down to the valley along the direction of the slope (the fastest). And through reasonable step size setting to accelerate and stabilize the convergence of the algorithm model, train a more generalized model. The basic idea can also be understood like this: we start from a certain point on the mountain, take a step to find the steepest slope (that is, to find the direction of the gradient), and then find the steepest slope, and then take another step, until we keep going like this. to the lowest point (the convergence point of the minimum cost function). Based on this, relevant researchers have carried out extensive and in-depth research on the algorithm.


  • 模型的代价函数:J ( θ ) J(\theta)J (θ)
  • 模型的相关参数:θ ∈ R d \theta \in R^d θ∈R d
  • 参数的梯度:▽ θ J ( θ ) \bigtriangledown_{\theta}J(\theta)▽θ​J (θ)
  • 参数的梯度:η \eta η


; 2 模型训练遇到的挑战



For a specific analysis of the problems that may be encountered in model training in the process of gradient update, please see the following figure.




The above figure is a schematic diagram of a gradient update based on the regression model, in which we can see several key points:

  • The Plateau:停滞点位,梯度值近乎于0,所以简单的梯度更新算法十分缓慢。
  • Saddle Point:鞍点点位,梯度值等于0,所以简单的梯度更新算法基本无法跳出。
  • Local Minima:局部最优解点位,相对于全局最优来说,所以简单的梯度更新算法。



Let’s carefully analyze these three ways to see if there is a way to jump out to the global optimization and ensure the speed of convergence.

2.1 Local Minima – 局部最优解



首先,无论是全局最优或者是局部最优我们统称为最优解(谷底),都是相对于所有的参数而言,需要在所有参数达到最优(谷底),那么我们假设目前的参数是N个,每个参数取得最优(谷底)的概率是P,则整体达到最优(谷底)的概率是P N P^N P N,以目前的工业界使用的深度学习模型为例,普遍都是百万级以上(SOTA模型动辄就达到千亿、万亿的参数),那么这个概率出现的是非常小的,所有当出现最优解的时候,很大程度上判断应该不是Local Minima,而是Global Minima。

所以,基本在工业界基本没有人费劲的去解所谓的”Local Minima”的问题,默认最优(谷底)就是全局的。我思故我在,我如果不思那就不在!基本就是不收敛改各种超参与网络重训,通过高级优化器与参数的配合改变梯度路径,上面说过所有参数一致的最优概率非常小,连续两次碰到的情况基本不太可能,进而重训会很大程度上会加快收敛速度与模型泛化能力。



However, what is in this advanced optimizer is to solve the focus of local optimization, introducing impulse and adaptive learning rate, but many people do not understand, please move on.

; 2.2 The Plateau & Saddle Point




For the stagnation point and the saddle point, I put the two together, mainly because there are a lot of similarities.

上面的梯度下降图中,其实是一个参数的情况,而Local Minima是所有参数一致达到谷底的情况。但是这个The Plateau & Saddle Point其实是针对每个参数的特征而言的,所以有很大的不同。



In view of the situation that the gradient update of some parameters is very small or zero, the ordinary gradient descent method is powerless at this time, because it is only based on the current gradient, it is basically difficult to go out, resulting in the non-convergence of the model. An accurate model cannot be trained.



Next, let’s see how to solve the problem.

2.3 解决办法




Based on the above situation, do we have some ways to solve it? in fact, no discipline is isolated. We can learn from each other and integrate many principles and laws in nature, such as:




So, to sum up, the missing part of our current simple gradient descent relative to cycling down a steep slope.

  • 前进力-惯性:下坡时候的车速不仅仅取决于目前的力度,也取决于以前积累的加速;
  • 回退力-闸:下坡的时候通过闸来调整车的加速比。

那么这两个要素在梯度下降里面,我们采用了两个新名词,动量(亦叫冲量)Momentum,动态自适应学习率 Learning Rate。下面就讲解下这两个技术点。

; 3 动量(冲量)

动量定义:在经典力学里,物体所受合外力的冲量等于它的动量的增量(即末动量减去初动量),叫做动量定理。 和动量是状态量不同,冲量是一个过程量;在DL中定义如下,在梯度下降算法中,结合以往参数梯度更新的惯性,予以累积,进而影响本次的梯度更新。



  1. 点位一:梯度下降的红色箭头方向是往右的,假设其是起始点,此时没有冲量(惯性)。所以实际移动的方向就是梯度下降的方向,向右运动。
  2. 点位二:梯度下降的红色箭头方向是向右的(非常小的梯度),动量的紫色箭头方向也是向右的,所以向右运动。
  3. 点位三:此处是鞍点,梯度下降等于0,动量的紫色箭头方向也是向右的,所以向右运动。
  4. 点位四:此处是局部最优,此时我们的梯度下降的红色箭头方向是向左的,这个时候如果我们的动量值 > 梯度的值。就会继续向上走,如果值够大,可以跳出local minima。

从上面的描述,大家可以看到了冲量在梯度下降中的很多关键点位还是有很重要的作用的,如果应用的好,会极大的加快模型收敛的速度和模型的泛化性能。下面接着介绍下另外一个重点的内容Learning Rate。

4 Learning Rate – 学习率

学习率 Learning Rate定义::表示了每次参数更新的幅度大小。 学习率过大:会导致待优化的参数在最小值附近进行波动;




The black on the left is the curve of the loss function, assuming that it starts from the highest point on the left.

  • 学习率调整恰到好处,如红线,即可成功找到最低点

    if the learning rate is adjusted just right, such as the red line, you can successfully find the lowest point.*

  • 若调整学习率过小,如蓝线,则走得太慢,虽然这种情况给了足够的时间寻找最低点,但实际情况可能不会等结果

    if the learning rate is adjusted too small, such as the blue line, it will walk too slowly, although this situation gives enough time to find the lowest point, the actual situation may not wait for the result.*

  • 如果学习率调整得有点过头,比如绿线,就会在上面震动,走不下去,永远也不会到最低点。

    if the learning rate is adjusted a little too much, such as the green line, it will vibrate on it, can not go on, and will never reach the lowest point.*

  • 如果学习率调整得很大,比如黄线,直接飞出来,更新参数时才会发现,损失函数更新越多,规模就越大。

    if the learning rate is adjusted very large, such as the yellow line, it will fly out directly, and when you update the parameters, you will only find that the more the loss function is updated, the larger it will be.*



The solution is the solution on the right side of the image above, which visualizes the impact of parameter changes on the loss function. Although such visualization can be intuitively observed, visualization can only be carried out when the parameters are one-dimensional or two-dimensional, and higher-dimensional cases can no longer be visualized. (you can consider sub-parameter visualization, but this is too hard, and the combination of different samples in the training process is strange, and many of them are just theoretical schemes.)

  • 比如学习率太小(蓝线),损失函数下降很慢

    for example, the learning rate is too small (blue line), and the loss function decreases very slowly.*

  • 学习率过高(绿线),损失函数快速下降,但即刻陷入胶着

    the learning rate is too high (green line), and the loss function decreases rapidly, but it immediately becomes stuck.*

  • 若学习率很高(黄线),损失功能将飞出

    if the learning rate is very high (yellow line), the loss function will fly out.*

  • 红色的正好,可以得到不错的结果

    the red one is just about right, and you can get a good result.*



If it is unrealistic and there are too many parameters every time, then if we want to apply it in industry, we need an adaptive learning rate, and the granularity may be different parameters, different parameters in different periods.

; 5 梯度下降的高级用法



From the above description and the current mainstream practices in the industry, we can know that advanced optimizer algorithms are intelligent and adaptive for momentum and learning rate (may not be optimal, but relatively primitive has been a great progress), granularity may be a basic combination of parameter levels and different times. These include:

  • NAG
  • Adagrad
  • Adadelta
  • RMSprop
  • Adam
  • AdaMax
  • Nadam等



These optimizers will be further described in subsequent chapters, which is the end of this chapter.

6 番外篇



Business level: it has achieved a good start to the business, created a new business growth point, and produced significant business economic benefits.


