Deep learning Adam optimization algorithm theoretical knowledge and changes in learning rate

In recent learning, Adam was used as the optimizer and the learning rate was printed during training and it was found that the learning rate did not change. This seems to contradict the previously understood adaptive learning rate?

Adam's theoretical knowledge

Adam paper: https://arxiv.org/pdf/1412.6980.pdf
Insert image description here The picture above is the detailed process of applying the Adam optimization algorithm to the gradient descent method in deep learning. There are some parameters that need to be explained: the
Insert image description heredetails can be found through https://blog .csdn.net/sinat_36618660/article/details/100026261 to understand Adam’s principle.

Question 1 What is an exponential moving average?

Exponential Moving Average (EMA) Exponential Moving Average means that the weighting coefficient of each value decreases exponentially with time. The closer the value is to the current moment, the greater the weighting coefficient.
Insert image description here
Take mt m_tmtFor example, from the above derivation, we can see that the farther away from the time t, the smaller the proportion of the gradient. In the process of continuous updating of the gradient, although the historical gradient is used, the gradient at different moments has a greater impact on the current mt m_tmtThe contributions are different. The closer to the time t, the greater the contribution to mt m_tmtThe greater the influence, the further away from time t, the greater the impact on mt m_tmtThe impact will be smaller.

Question 2 Why does it need to be corrected?

(1) Popular explanation:

When mt m_tmtWithout correction ( β 1 = 0.9 \beta_1=0.9b1=0.9):
m 0 m_0 m0=0
m 1 m_1 m1= β 1 ∗ m 0 + ( 1 − β 1 ) ∗ g 1 \beta _1* m_0+(1-\beta_1)*g_1b1m0+(1b1)g1= 0.1 g 1 0.1g_1 0.1g1
m 2 m_2 m2= 0.9 ∗ 0.1 ∗ g 1 + 0.1 ∗ g 2 0.9*0.1*g_1+0.1*g_20.90.1g1+0.1g2= 0.09 g 1 + 0.1 g 2 0.09g_1+0.1g_2 0.09g1+0.1g2
m 3 m_3 m3= 0.081 g 1 + 0.09 g 2 + 0.1 g 3 0.081g_1+0.09g_2+0.1g3 0 . 0 8 1g _1+0.09g2+0 . 1 g 3
and so on, we can see that sincem 0 m_0m0=0, resulting in, mt m_tmtare biased towards 0 and will also move away from gt g_tgtfurther and further.

(2) Explanation of theoretical formula:

mt m_t from abovemtIt can be seen from the update formula that mt m_tmtEquivalent to gradient gt g_tgtFirst-order distance estimate, so we calculate mt m_tmtExpectation:
Insert image description here
As can be seen from the above formula, mt m_t needs to bemtCorrected to mt / ( 1 − β 1 t ) m_t/(1-\beta_1^t)mt/(1b1t) ,only mt m_tmtis the gradient gt g_tgtunbiased estimate of . The same idea can explain vt v_tvtcorrection.

Question 3 How does the learning rate change?

In Adam's paper, it is pointed out that the following three lines of formulas can be
Insert image description here
replaced by: Equivalently replaced by:
Insert image description here
In the pytorch source code, it is also written in the above way (attached at the end).

Does writing like this mean that the change of learning rate is determined by α 1 − β 2 t / ( 1 − β 1 t ) \alpha \sqrt{1-\beta_2^t}/(1-\beta_1^t)a1b2t /(1b1t) to decide, can this still be called an adaptive learning rate?

We know the definition formula of the gradient descent method:
Insert image description here
According to the definition formula of the gradient descent method, we can write the parameter update formula as:

θ t = θ t − 1 − α v t ^ + ε ∗ m t ^ \theta_t=\theta_{t-1}- {\frac{\alpha}{\sqrt{\hat{v_t}}+\varepsilon}}*\hat{m_t} it=it1vt^ +eamt^
Among them, mt m_tmtConsidered as gradient gt g_tgtThe first-order distance estimate of , then α vt ^ + ε {\frac{\alpha}{\sqrt{\hat{v_t}}+\varepsilon}}vt^ + eaIt can be seen that at time t, the parameter θ t \theta_titlearning rate. It can also be seen from the above formula that for different parameters, there will be different learning rates at each moment, so it is difficult to visualize them.

Finally, the Adam source code is attached

I found the adam.py file through pytorch1.2/lib/python3.7/site-packages/torch/optim/, the following is the code:

def step(self, closure=None):
  loss = None
  if closure is not None:
      loss = closure()

  for group in self.param_groups:
      for p in group['params']:
          if p.grad is None:
              continue
          grad = p.grad.data
          if grad.is_sparse:
              raise RuntimeError('Adam does not support sparse gradients, please consider SparseAdam instead')
          amsgrad = group['amsgrad']

          state = self.state[p]

          # State initialization
          if len(state) == 0:
              state['step'] = 0
              # Exponential moving average of gradient values
              state['exp_avg'] = torch.zeros_like(p.data)
              # Exponential moving average of squared gradient values
              state['exp_avg_sq'] = torch.zeros_like(p.data)
              if amsgrad:
                  # Maintains max of all exp. moving avg. of sq. grad. values
                  state['max_exp_avg_sq'] = torch.zeros_like(p.data)

          exp_avg, exp_avg_sq = state['exp_avg'], state['exp_avg_sq']
          if amsgrad:
              max_exp_avg_sq = state['max_exp_avg_sq']
          beta1, beta2 = group['betas']

          state['step'] += 1

          if group['weight_decay'] != 0:
              grad.add_(group['weight_decay'], p.data)

          # Decay the first and second moment running average coefficient
          exp_avg.mul_(beta1).add_(1 - beta1, grad)
          exp_avg_sq.mul_(beta2).addcmul_(1 - beta2, grad, grad)
          if amsgrad:
              # Maintains the maximum of all 2nd moment running avg. till now
              torch.max(max_exp_avg_sq, exp_avg_sq, out=max_exp_avg_sq)
              # Use the max. for normalizing running avg. of gradient
              denom = max_exp_avg_sq.sqrt().add_(group['eps'])
          else:
              denom = exp_avg_sq.sqrt().add_(group['eps'])

          bias_correction1 = 1 - beta1 ** state['step']
          bias_correction2 = 1 - beta2 ** state['step']

          step_size = group['lr'] * math.sqrt(bias_correction2) / bias_correction1

          p.data.addcdiv_(-step_size, exp_avg, denom)

  return loss

The above is my personal understanding. If you have better ideas, please leave a message and learn and progress together!

Guess you like

Origin blog.csdn.net/qq_44846512/article/details/112466609