Deep Learning For Vision

Unit 3: Visualization and Understanding CNN

1. Introduction to CNNs

Convolutional Neural Networks (CNNs) are deep learning models particularly effective for image data, exploiting spatial hierarchies for feature learning.
Main components: Convolutional layers, Activation functions (e.g., ReLU), Pooling layers, and Fully-connected layers.

2. Evolution of CNN Architectures

Classic CNN Architectures:

AlexNet: First deep CNN to win ImageNet (2012), used ReLU, dropout, and GPUs for training.
ZFNet (Zeiler & Fergus Net): Improved on AlexNet, visualized intermediate activations to better understand learned features.
VGGNet: Very deep nets with small (3x3) convolutions, emphasized architectural simplicity, improved performance.

3. Visualization of Kernels/Filters

Kernel visualization helps interpret what features the network’s convolutional layers are learning.
Filters in early layers visualize as edge or color detectors, while deeper layers capture more complex patterns (textures, objects).

4. Backprop-to-Image/Deconvolution Methods

Deconvolution/Backprop-to-image:
- Techniques to project feature activations back to input space to show which image regions activate specific neurons.
- Useful for understanding spatial sensitivity and layer-wise feature abstraction.
Guided Backpropagation: Refines visualization by masking negative gradients, creating sharper images.

5. Deep Dream and Hallucination

Deep Dream: Algorithm modifies input images to maximize neuron activations, creating "dream-like" visuals.
Hallucination: Technique to amplify patterns seen by CNNs, exploring their internal feature space interactively.

6. Neural Style Transfer

Combines content features from one image with style features from another (usually using a pre-trained CNN) to create art-like images.
Relies on separating and recombining content and style representations in feature maps.

7. Class Activation Mapping (CAM, Grad-CAM)

CAM: Generates heatmaps showing which image regions the CNN used for a given class prediction.
Grad-CAM: Uses gradients of class scores w.r.t feature maps to create class-discriminative localization maps, useful for model interpretability and debugging.

8. Applications in Vision

CNN visualizations help in model interpretation, architecture debugging, medical imaging, and trustworthy AI deployment.
Used extensively for explaining decisions in safety-critical or regulated domains.