Rectified Linear Unit (ReLU)

1. Introduction to ReLU

  • ReLU (Rectified Linear Unit) is a widely used activation function in deep neural networks.

  • It outputs the input directly if positive; otherwise, it outputs zero.

Mathematically:

f(x)=max(0,x)f(x) = \max(0, x)

which means:

f(x)={xif x>00if x0f(x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}
  • This simple non-linearity helps deep models learn complex patterns with efficient gradient flow.


  • Computationally efficient: ReLU involves simple thresholding, making it fast to compute.

  • Sparse activation: Outputs zero for negative inputs, leading to sparse representations which improve model efficiency and reduce overfitting.

  • Mitigates vanishing gradient problem: Unlike sigmoid/tanh, the gradient of ReLU is 1 for positive inputs, preserving gradient strength during backpropagation in deep models.

  • Promotes faster convergence: Helps models train more quickly.


3. Derivative of ReLU

The derivative used in backpropagation is:

f(x)={1if x>00if x0f'(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases}
  • This piecewise derivative makes gradient computations straightforward.


4. Drawbacks

  • Dying ReLU problem: Some neurons can become inactive (output always zero) if input is always negative.

  • Not zero-centered: Outputs zero or positive, which can affect weight updates in some cases.

  • Unbounded output: Large values can cause exploding activations if not controlled.


5. Variants of ReLU

5.1 Leaky ReLU

f(x)={xx>0αxx0f(x) = \begin{cases} x & x > 0 \\ \alpha x & x \leq 0 \end{cases}
  • Allows a small, non-zero gradient α\alpha for negative inputs, reducing the dying ReLU problem.

5.2 Parametric ReLU (PReLU)

  • Similar to Leaky ReLU, but α\alpha is learned during training instead of fixed.

5.3 Exponential Linear Unit (ELU)

  • Smoothes negative part to reduce bias shift and improve learning speed.


6. ReLU vs Other Activation Functions

Activation FunctionOutput RangeProsConsTypical Use
ReLU[0, ∞)Simple, sparse, mitigates vanishing gradientDying neurons, unbounded outputHidden layers in deep nets
Sigmoid(0, 1)Smooth, probabilistic interpretationVanishing gradient, slow trainingOutput layers in binary classification
Tanh(-1, 1)Zero-centeredVanishing gradientHidden layers when zero-centered output needed
Leaky ReLU(-∞, ∞)Fixes dying ReLUNeeds tuning α\alphaHidden layers alternative to ReLU
ELU(-α, ∞)Smooth negative values, faster convergenceMore computationDeeper nets for improved learning

7. Applications

  • Used as default activation function in convolutional neural networks (CNNs) for image classification, object detection, segmentation.

  • Useful in deep feedforward networks and reinforcement learning.

  • Helps build deeper, more expressive models while retaining trainability.


8. Summary

  • ReLU's simplicity and effectiveness have made it the go-to activation function in deep learning.

  • While it has drawbacks, simple variants like Leaky ReLU and parametric forms address them.

  • Understanding ReLU is critical to grasping modern deep learning model design and training dynamics.