Nesterov Accelerated Gradient Descent (NAG)

1. Introduction

  • Nesterov Accelerated Gradient (NAG) is an optimization algorithm that improves on classical momentum-based gradient descent by incorporating a lookahead gradient computation.

  • Named after Yurii Nesterov, it offers theoretically faster convergence rates in convex optimization, and is widely used in training deep neural networks for better performance.


2. Motivation and Intuition

  • Classical momentum updates parameters based on accumulated past gradients, smoothing oscillations and accelerating descent.

  • However, traditional momentum evaluates the gradient at the current parameters.

  • NAG first makes a tentative step in the direction of the previous momentum, then calculates the gradient at this lookahead position, gaining more accurate information about the upcoming gradient landscape.

  • This leads to more informed and effective parameter updates.


3. Algorithm Description

Given:

  • Parameters θ\theta,

  • Learning rate α\alpha,

  • Momentum coefficient μ[0,1)\mu \in [0,1),

  • Gradient of loss w.r.t. parameters J(θ)\nabla J(\theta),

The update steps in NAG are:

  1. Compute lookahead position:

θlookahead=θtμvt\theta_{\text{lookahead}} = \theta_t - \mu v_t
  1. Calculate gradient at lookahead:

gt=J(θlookahead)g_t = \nabla J(\theta_{\text{lookahead}})
  1. Update velocity:

vt+1=μvt+αgtv_{t+1} = \mu v_t + \alpha g_t
  1. Update parameters:

θt+1=θtvt+1\theta_{t+1} = \theta_t - v_{t+1}

Where:

  • vtv_t is the velocity (momentum term) at iteration tt.


4. Comparison with Classical Momentum

FeatureClassical MomentumNesterov Accelerated Gradient
Gradient EvaluationGradient at current parameters θt\theta_tGradient at lookahead θtμvt\theta_t - \mu v_t
Update DirectionMomentum term updated directly with gradient at current stepMore informed update, anticipates future gradient slope
Empirical ResultsGood acceleration, may overshoot or oscillateOften faster convergence and better stability

5. Advantages

  • Faster Convergence: Incorporates future gradient information, improving step quality.

  • Less Overshooting: Reduces oscillation around minima by correcting velocity direction proactively.

  • Widely Used: Many popular optimizers like Adam incorporate NAG principles.


6. Practical Implementation Notes

  • Parameter μ\mu (momentum) typically set between 0.9 to 0.99.

  • Learning rate α\alpha often starts higher and decays over training.

  • Compatible with batch and mini-batch gradient formulations.


7. Applications in Deep Learning

  • Effective for training deep networks like CNNs and RNNs.

  • Enhances learning dynamics especially in deeper or more complex architectures.


8. Summary Table

StepOperationFormula
Lookahead PositionAdvance parameters using previous velocityθtμvt\theta_t - \mu v_t
Gradient CalculationCompute gradient at lookaheadgt=J(θlookahead)g_t = \nabla J(\theta_{\text{lookahead}})
Velocity UpdateUpdate momentum velocityvt+1=μvt+αgtv_{t+1} = \mu v_t + \alpha g_t
Parameter UpdateMove parameters by new velocityθt+1=θtvt+1\theta_{t+1} = \theta_t - v_{t+1}