Nesterov Accelerated Gradient Descent (NAG)

1. Introduction

Nesterov Accelerated Gradient (NAG) is an optimization algorithm that improves on classical momentum-based gradient descent by incorporating a lookahead gradient computation.
Named after Yurii Nesterov, it offers theoretically faster convergence rates in convex optimization, and is widely used in training deep neural networks for better performance.

2. Motivation and Intuition

Classical momentum updates parameters based on accumulated past gradients, smoothing oscillations and accelerating descent.
However, traditional momentum evaluates the gradient at the current parameters.
NAG first makes a tentative step in the direction of the previous momentum, then calculates the gradient at this lookahead position, gaining more accurate information about the upcoming gradient landscape.
This leads to more informed and effective parameter updates.

3. Algorithm Description

Given:

Parameters $\theta$ ,
Learning rate $\alpha$ ,
Momentum coefficient $\mu \in [0,1)$ ,
Gradient of loss w.r.t. parameters $\nabla J(\theta)$ ,

The update steps in NAG are:

Compute lookahead position:

\theta_{\text{lookahead}} = \theta_t - \mu v_t

Calculate gradient at lookahead:

g_t = \nabla J(\theta_{\text{lookahead}})

Update velocity:

v_{t+1} = \mu v_t + \alpha g_t

Update parameters:

\theta_{t+1} = \theta_t - v_{t+1}

Where:

$v_t$ is the velocity (momentum term) at iteration $t$ .

4. Comparison with Classical Momentum

Feature	Classical Momentum	Nesterov Accelerated Gradient
Gradient Evaluation	Gradient at current parameters $\theta_t$	Gradient at lookahead $\theta_t - \mu v_t$
Update Direction	Momentum term updated directly with gradient at current step	More informed update, anticipates future gradient slope
Empirical Results	Good acceleration, may overshoot or oscillate	Often faster convergence and better stability

5. Advantages

Faster Convergence: Incorporates future gradient information, improving step quality.
Less Overshooting: Reduces oscillation around minima by correcting velocity direction proactively.
Widely Used: Many popular optimizers like Adam incorporate NAG principles.

6. Practical Implementation Notes

Parameter $\mu$ (momentum) typically set between 0.9 to 0.99.
Learning rate $\alpha$ often starts higher and decays over training.
Compatible with batch and mini-batch gradient formulations.

7. Applications in Deep Learning

Effective for training deep networks like CNNs and RNNs.
Enhances learning dynamics especially in deeper or more complex architectures.

8. Summary Table

Step	Operation	Formula
Lookahead Position	Advance parameters using previous velocity	$\theta_t - \mu v_t$
Gradient Calculation	Compute gradient at lookahead	$g_t = \nabla J(\theta_{\text{lookahead}})$
Velocity Update	Update momentum velocity	$v_{t+1} = \mu v_t + \alpha g_t$
Parameter Update	Move parameters by new velocity	$\theta_{t+1} = \theta_t - v_{t+1}$

Shanlaksh

Nesterov Accelerated Gradient Descent (NAG)

1. Introduction

2. Motivation and Intuition

3. Algorithm Description

4. Comparison with Classical Momentum

5. Advantages

6. Practical Implementation Notes

7. Applications in Deep Learning

8. Summary Table

Syllabus for UI and UX Design (CCS370)

UI and UX Unit 1 Important question and answer

UI vs. UX Design

AL3502: Deep Learning for Vision

Introduction to Image Formation

Nesterov Accelerated Gradient Descent (NAG)

1. Introduction

2. Motivation and Intuition

3. Algorithm Description

4. Comparison with Classical Momentum

5. Advantages

6. Practical Implementation Notes

7. Applications in Deep Learning

8. Summary Table

Join the conversation