Nesterov Accelerated Gradient Descent (NAG)
1. Introduction
- 
Nesterov Accelerated Gradient (NAG) is an optimization algorithm that improves on classical momentum-based gradient descent by incorporating a lookahead gradient computation. 
- 
Named after Yurii Nesterov, it offers theoretically faster convergence rates in convex optimization, and is widely used in training deep neural networks for better performance. 
2. Motivation and Intuition
- 
Classical momentum updates parameters based on accumulated past gradients, smoothing oscillations and accelerating descent. 
- 
However, traditional momentum evaluates the gradient at the current parameters. 
- 
NAG first makes a tentative step in the direction of the previous momentum, then calculates the gradient at this lookahead position, gaining more accurate information about the upcoming gradient landscape. 
- 
This leads to more informed and effective parameter updates. 
3. Algorithm Description
Given:
- 
Parameters , 
- 
Learning rate , 
- 
Momentum coefficient , 
- 
Gradient of loss w.r.t. parameters , 
The update steps in NAG are:
- 
Compute lookahead position: 
- 
Calculate gradient at lookahead: 
- 
Update velocity: 
- 
Update parameters: 
Where:
- 
is the velocity (momentum term) at iteration . 
4. Comparison with Classical Momentum
| Feature | Classical Momentum | Nesterov Accelerated Gradient | 
|---|---|---|
| Gradient Evaluation | Gradient at current parameters | Gradient at lookahead | 
| Update Direction | Momentum term updated directly with gradient at current step | More informed update, anticipates future gradient slope | 
| Empirical Results | Good acceleration, may overshoot or oscillate | Often faster convergence and better stability | 
5. Advantages
- 
Faster Convergence: Incorporates future gradient information, improving step quality. 
- 
Less Overshooting: Reduces oscillation around minima by correcting velocity direction proactively. 
- 
Widely Used: Many popular optimizers like Adam incorporate NAG principles. 
6. Practical Implementation Notes
- 
Parameter (momentum) typically set between 0.9 to 0.99. 
- 
Learning rate often starts higher and decays over training. 
- 
Compatible with batch and mini-batch gradient formulations. 
7. Applications in Deep Learning
- 
Effective for training deep networks like CNNs and RNNs. 
- 
Enhances learning dynamics especially in deeper or more complex architectures. 
8. Summary Table
| Step | Operation | Formula | 
|---|---|---|
| Lookahead Position | Advance parameters using previous velocity | |
| Gradient Calculation | Compute gradient at lookahead | |
| Velocity Update | Update momentum velocity | |
| Parameter Update | Move parameters by new velocity | 
Join the conversation