Nesterov Accelerated Gradient Descent (NAG)
1. Introduction
-
Nesterov Accelerated Gradient (NAG) is an optimization algorithm that improves on classical momentum-based gradient descent by incorporating a lookahead gradient computation.
-
Named after Yurii Nesterov, it offers theoretically faster convergence rates in convex optimization, and is widely used in training deep neural networks for better performance.
2. Motivation and Intuition
-
Classical momentum updates parameters based on accumulated past gradients, smoothing oscillations and accelerating descent.
-
However, traditional momentum evaluates the gradient at the current parameters.
-
NAG first makes a tentative step in the direction of the previous momentum, then calculates the gradient at this lookahead position, gaining more accurate information about the upcoming gradient landscape.
-
This leads to more informed and effective parameter updates.
3. Algorithm Description
Given:
-
Parameters ,
-
Learning rate ,
-
Momentum coefficient ,
-
Gradient of loss w.r.t. parameters ,
The update steps in NAG are:
-
Compute lookahead position:
-
Calculate gradient at lookahead:
-
Update velocity:
-
Update parameters:
Where:
-
is the velocity (momentum term) at iteration .
4. Comparison with Classical Momentum
| Feature | Classical Momentum | Nesterov Accelerated Gradient |
|---|---|---|
| Gradient Evaluation | Gradient at current parameters | Gradient at lookahead |
| Update Direction | Momentum term updated directly with gradient at current step | More informed update, anticipates future gradient slope |
| Empirical Results | Good acceleration, may overshoot or oscillate | Often faster convergence and better stability |
5. Advantages
-
Faster Convergence: Incorporates future gradient information, improving step quality.
-
Less Overshooting: Reduces oscillation around minima by correcting velocity direction proactively.
-
Widely Used: Many popular optimizers like Adam incorporate NAG principles.
6. Practical Implementation Notes
-
Parameter (momentum) typically set between 0.9 to 0.99.
-
Learning rate often starts higher and decays over training.
-
Compatible with batch and mini-batch gradient formulations.
7. Applications in Deep Learning
-
Effective for training deep networks like CNNs and RNNs.
-
Enhances learning dynamics especially in deeper or more complex architectures.
8. Summary Table
| Step | Operation | Formula |
|---|---|---|
| Lookahead Position | Advance parameters using previous velocity | |
| Gradient Calculation | Compute gradient at lookahead | |
| Velocity Update | Update momentum velocity | |
| Parameter Update | Move parameters by new velocity |
Join the conversation