The optimization method and development of exponentially weighted moving average (EWMA) in deep learning

The optimization method and development of exponentially weighted moving average (EWMA) in deep learning

This article will introduce Momentum algorithm, RMSProp algorithm and Adam algorithm in neural network optimization algorithm. The information comes from the fastai course and Ng's deep learning course.

1.Momentum algorithm

The Momentum algorithm, also known as the momentum method, directly uses the idea of EWMA, that is, the exponentially weighted moving average method. When performing this optimization, the gradient size and direction of the previous optimizations have a greater impact on the vector size and direction of this optimization, because the previous optimization size and direction can well reflect the trend of recent changes, so as to achieve It accelerates the effect of model fitting and optimizes the fitting route.

Expressed as a formula:

As shown in the figure, if the blue arrow is regarded as, the green arrow is, and the gradient value of this optimization (that is, the blue arrow), usually we make =0.9, then.

If the optimization vector is called the "number of steps", then the number of steps of this optimization = the number of steps of the last optimization * + the gradient of this optimization * (1- ) . The weight to be optimized

However, there are two problems here:

Question one:

From the above formula, it can be concluded that when the optimization is performed for the first time, the number of optimization steps in the previous iteration process is very small, and the fitting speed is very slow.

As shown in the figure, the ideal effect is a green curve, but the actual effect is a purple curve. When the number of iterations continues to increase, the two curves slowly overlap. This is because when the number of iterations increases, the current step number is less affected by the number of steps optimized in the previous few times.

Therefore, the formula can be rewritten as:

In this case, when the number of iterations t is small, the original number of iteration steps can be enlarged; when the number of iterations is large, the denominator approaches 1, which has no effect on the original number of iteration steps.

Question two:

If the moving average keeps increasing, it will cause the problem of gradient explosion, as shown in the figure below. And the problem of the fluctuation of the fitted route has not been solved.

To solve this problem, the RMSProp algorithm was born.

2.RMSProp algorithm

Adaptive learning rate adjustment. The RMSProp algorithm is improved on the basis of the momentum algorithm, using the hyperparameter and the learning rate of the mom algorithm to simultaneously restrict the number of steps in this move.

The formula is:

In this way, based on the influence of the gradient and the learning rate, the number of steps in this optimization is restricted by the previous accumulated moving average, which is equivalent to combining the gradient descent method with the momentum method to make the optimization route less fluctuating. The convergence rate is faster.

As shown in the figure, the blue is the mom algorithm optimized route, and the green is the RMSProp algorithm optimized route.

3. Adam method

Adam algorithm essentially superimposes mom method and RMS method again in order to get better results. In the RMS method, the influence of the number of steps previously optimized on the number of steps optimized this time is further enlarged.

formula:

For the Adam algorithm, it is actually a better fusion of the momentum method and the gradient descent method, but the problem mentioned in the first problem of the momentum method still exists, so the same method can be used to optimize the formula.