Visual Feature Extraction: Bag-of-Words and VLAD
Feature extraction is an essential process in computer vision, enabling effective image representation and retrieval by summarizing an image’s local descriptors into compact vectors. Two influential methods are the Bag-of-Words (BoW) and Vector of Locally Aggregated Descriptors (VLAD) approaches.
Bag-of-Words (BoW) Model
Concept
-
Inspired by the bag-of-words model in text analysis, the BoW model for images represents an image as an unordered collection (bag) of visual features.
-
Each image is characterized by the frequency of certain local descriptors (called “visual words”), ignoring their spatial arrangement in the image.
Steps in BoW Feature Extraction
-
Feature Detection and Description
-
Detect local keypoints using algorithms (e.g., SIFT, SURF, ORB).
-
Extract feature descriptors (e.g., SIFT yields 128-dimensional vectors per keypoint).
-
-
Codebook Generation
-
Combine descriptors from a large set of images and cluster them using algorithms like k-means.
-
Cluster centers represent the visual words.
-
-
Vector Quantization
-
Each local descriptor in an image is assigned to its nearest visual word (cluster center).
-
The image is represented as a histogram that counts how many times each visual word appears.
-
-
Image Representation
-
The final image descriptor is this histogram—a fixed-length vector regardless of image size.
-
Advantages and Use Cases
-
Compact: Transforms variable numbers of descriptors into a uniform-length representation.
-
Efficient Comparison: Suitable for large-scale image classification, retrieval, and content-based indexing.
-
Limitations: Ignores spatial relationships; cannot distinguish between different arrangements of the same features.
VLAD (Vector of Locally Aggregated Descriptors)
Concept
-
VLAD is an advancement over BoW, providing a richer representation by accumulating the difference between local descriptors and their nearest codebook centers (visual words), rather than just counting occurrences.
Steps in VLAD Feature Extraction
-
Feature Detection and Description
-
Similar to BoW, extract local descriptors from keypoints throughout the image.
-
-
Codebook Construction
-
Run k-means clustering on a large set of descriptors to form cluster centers (visual words).
-
-
Residual Computation
-
For every descriptor in the image, find its closest cluster center.
-
Compute the residual (difference vector) between each descriptor and its assigned center.
-
-
Residual Aggregation
-
For each cluster, accumulate the sum of residuals for all descriptors assigned to it.
-
-
Vector Concatenation
-
Concatenate all aggregated residuals across clusters to produce a single, high-dimensional feature vector representing the image.
-
Properties and Benefits
-
Encodes More Information: Each cluster summarises the way local features differ from the codebook, making the representation more discriminative than BoW.
-
Global Descriptor: The final vector is compact but informative, supporting efficient image matching, retrieval, and place recognition.
-
Applications: Used widely in place recognition, object retrieval, and increasingly in deep learning pipelines as a pooling mechanism (e.g., NetVLAD).
Comparison Table
Aspect | Bag-of-Words | VLAD |
---|---|---|
Descriptor | Histogram of visual word frequencies | Concatenated residuals (vector differences) |
Codebook | Cluster centers (e.g., via k-means) | Same, via k-means |
Output Vector Size | Number of visual words | Number of clusters × descriptor dimension |
Information Preserved | Count (occurrence only) | Both count and distribution (residuals) |
Use Cases | Image retrieval, classification | Retrieval, place recognition, image matching |
Strengths | Simplicity, efficiency | Richer, more discriminative representation |
Weaknesses | Forgets spatial/contextual detail | Higher dimensionality, more computation |
Join the conversation