Visual Feature Extraction: Bag-of-Words and VLAD

Feature extraction is an essential process in computer vision, enabling effective image representation and retrieval by summarizing an image’s local descriptors into compact vectors. Two influential methods are the Bag-of-Words (BoW) and Vector of Locally Aggregated Descriptors (VLAD) approaches.

Bag-of-Words (BoW) Model

Concept

Inspired by the bag-of-words model in text analysis, the BoW model for images represents an image as an unordered collection (bag) of visual features.
Each image is characterized by the frequency of certain local descriptors (called “visual words”), ignoring their spatial arrangement in the image.

Steps in BoW Feature Extraction

Feature Detection and Description
- Detect local keypoints using algorithms (e.g., SIFT, SURF, ORB).
- Extract feature descriptors (e.g., SIFT yields 128-dimensional vectors per keypoint).
Codebook Generation
- Combine descriptors from a large set of images and cluster them using algorithms like k-means.
- Cluster centers represent the visual words.
Vector Quantization
- Each local descriptor in an image is assigned to its nearest visual word (cluster center).
- The image is represented as a histogram that counts how many times each visual word appears.
Image Representation
- The final image descriptor is this histogram—a fixed-length vector regardless of image size.

Advantages and Use Cases

Compact: Transforms variable numbers of descriptors into a uniform-length representation.
Efficient Comparison: Suitable for large-scale image classification, retrieval, and content-based indexing.
Limitations: Ignores spatial relationships; cannot distinguish between different arrangements of the same features.

VLAD (Vector of Locally Aggregated Descriptors)

Concept

VLAD is an advancement over BoW, providing a richer representation by accumulating the difference between local descriptors and their nearest codebook centers (visual words), rather than just counting occurrences.

Steps in VLAD Feature Extraction

Feature Detection and Description
- Similar to BoW, extract local descriptors from keypoints throughout the image.
Codebook Construction
- Run k-means clustering on a large set of descriptors to form cluster centers (visual words).
Residual Computation
- For every descriptor in the image, find its closest cluster center.
- Compute the residual (difference vector) between each descriptor and its assigned center.
Residual Aggregation
- For each cluster, accumulate the sum of residuals for all descriptors assigned to it.
Vector Concatenation
- Concatenate all aggregated residuals across clusters to produce a single, high-dimensional feature vector representing the image.

Properties and Benefits

Encodes More Information: Each cluster summarises the way local features differ from the codebook, making the representation more discriminative than BoW.
Global Descriptor: The final vector is compact but informative, supporting efficient image matching, retrieval, and place recognition.
Applications: Used widely in place recognition, object retrieval, and increasingly in deep learning pipelines as a pooling mechanism (e.g., NetVLAD).

Comparison Table

Aspect	Bag-of-Words	VLAD
Descriptor	Histogram of visual word frequencies	Concatenated residuals (vector differences)
Codebook	Cluster centers (e.g., via k-means)	Same, via k-means
Output Vector Size	Number of visual words	Number of clusters × descriptor dimension
Information Preserved	Count (occurrence only)	Both count and distribution (residuals)
Use Cases	Image retrieval, classification	Retrieval, place recognition, image matching
Strengths	Simplicity, efficiency	Richer, more discriminative representation
Weaknesses	Forgets spatial/contextual detail	Higher dimensionality, more computation

Shanlaksh

Visual Feature Extraction: Bag-of-Words and VLAD

Bag-of-Words (BoW) Model

Concept

Steps in BoW Feature Extraction

Advantages and Use Cases

VLAD (Vector of Locally Aggregated Descriptors)

Concept

Steps in VLAD Feature Extraction

Properties and Benefits

Comparison Table

Syllabus for UI and UX Design (CCS370)

UI and UX Unit 1 Important question and answer

UI vs. UX Design

AL3502: Deep Learning for Vision

Introduction to Image Formation

Visual Feature Extraction: Bag-of-Words and VLAD

Bag-of-Words (BoW) Model

Concept

Steps in BoW Feature Extraction

Advantages and Use Cases

VLAD (Vector of Locally Aggregated Descriptors)

Concept

Steps in VLAD Feature Extraction

Properties and Benefits

Comparison Table

Join the conversation