Friday, February 25, 2011

Object Recognition - Bag of Keypoints

Bag of Keypoints is a object recognition technique presented in a paper "Visual Categorization with Bags of Keypoints". The Bag-of-Keypoints idea is borrowed from Bag-of-Words for text data mining.

The paper differentiates this 'multi-class' object recognition technique from 'object recognition', 'content-based image retrieval' and 'object detection'.

Very Brief Summary
The goal is to recognize 'classes' of object given an input image. Each object class is associated with a bunch of interest-points (features). In Naive-Bayes terms, we could train a classifier with labeled data to predict the class of object present in an image by looking at the set of detected interest-points. Linear SVM is also trained as classifier for comparison. Since it is a  multi-class problem, a one-against-all method is used. They trained 'm = # classes' numbers of SVM. Each output a confidence value on whether the input image belongs to that class.
The bag-of-words refers to a vocabulary set. The vocabulary set is built by k-mean clustering of keyPoint-descriptors (such as SIFT). A BOW Descriptor is a histogram of vocabularies. One BOW descriptor for one image. Key-points detected on images are looked up from its associated vocabulary. The corresponding bin from the histogram is incremented. The BOW descriptors (histogram) from training images are then used to train the SVM classifiers.

Harris-Affine-Detector -> SIFT Descriptor -> Assigned to Cluster -> 'Feature Vector' -- (+ label) -> Naive Bayer Bayes (or m SVMs).

Sample (bagofwords_classification.cpp)

The bagofwords sample program performs the bag-of-keypoints training and classification as in the paper. The training and test data format supported is PASCAL 2007-2010.

The program defines a class VocData that understands the PASCAL VOC challenge data format. It is used to look up the list of training-set and test-set image files for a specific object-class. The class also defines helper functions to load/save classifier results and gnuplots.

The program defines functions to load/save last run parameters, vocabulary-set and BOW descriptors.

Despite the big chunk of code, the functions are pretty well-defined. There are sufficient code-comments.

User specifies the keypoint-detection method, keypoint-descriptor and keypoint-matching method to the BOWKmeansTrainer. The vocabulary-set is built with one chosen class of object training images. The code-comment says building with one particular class is enough.

SVM is used for object classification. The CvSVM class is used. The number of instances is the same as the number of object classes. Each is trained with both positive and negative samples of a particular class object. That SVM would be tested with all classes of test objects.

See LIBSVM for more details on CvSvm implementation for OpenCV.

BOW Image descriptor is the histogram of vocabulary occurrences in a single image. It is a simple array - rows-of-image x cols-of-vocabulary. Each row is send to SVM for training.

DDMParams load/stores the keypoint detector-descriptor-matcher type.

VocabTrainParams stores the name of the object-class to be used for training. It also loads/saves the maximum vocabulary size, memory to use, and proportion-of-descriptor to use for building vocabulary. Not all the detected image key-points are used. The last parameter specifies the fraction of that to be randomly picked from each input.

SVMTrainParamsExt stores some parameters that control the input to the SVM training process. These do not overlap with the CvSVMParam. There are 3 parameters:

  1. descPercent controls the fraction of the BOW image descriptors to be used for training. 
  2. targetRatio is preferred ratio of positive to negative samples. Precisely this parameter is the fraction of positive samples from all samples. It also means that some of the samples will be thrown away to maintain this ratio. 
  3. balanceClasses is a boolean. If it is true, then the C-SVC weight given to the positive and negative samples will be same to the pos:neg ratio of samples used for training. See CvSVMParams::class_weights for usage. If it's set to true, the targetRatio will not be used.

RBF is chosen as kernel function for SVMs. The related parameters will be chosen automatically, presumably by the crawling the 'Grid'. See LIBSVM docs.


Used Harris Affine Detector - SIFT descriptor - BruteForce matcher for key Points matching.
Default parameters for BOWKmeansTrainer.
As stated above, the demo application saves user-preferences, BOW descriptors, SVM classifier parameters and Test results to an output directory.
Stopped the running after 'aeroplane' class. It took too long. Save for another time when there is a spare PC. On the other hand, 10103 BOW descriptors are already built. And there are 11322 JPEG images. That means only 1000 more image descriptors to extract. Most of the time would be spent on training SVMs in the future.

Took very long time to build the vocabulary - k-means never seem to converge below the default error value. So it stops after 100 iterations which is the default maximum.
Computing Feature Descriptors (Detect + Extract): 6823 secs ~ 2hrs
Vocabulary Training ( 3 attempts of (k=1000)-means ): 75174 secs ~ 21 hours

SVM Classifier Training (for one classifer, aeroplane)
  • Took 5 hours to extract BOW Descriptors from 4998 Training Set images.
  • Took another 2.6 hours to train SVM classifer with 2499 descriptors of above. Meaning only 50% is used for training. Of which 143 are positive and 2356 are negative.
SVM Classifier Testing (for one classifier, aeroplane)
  • Took 5 hours to extract image descriptors from the 5105 Test Set images.
  • Took only 0.04 seconds to classify all the Test Set descriptors.
  • The output has a gnuplot command file. Applied to cygwin gnuplot, output a PNG file. It shows the Average Precision of 0.058 and a plot of Precision versus Recall.

  • Visual Categorization with Bags of Keypoints, Csurka, et al.
  • A Practical Guide to Support Vector Classification, see LIBSVM from Resources


  1. Yes - definitely. Not the chemical company or football team...

  2. Does CvSVM in OpenCV support multi-class ?

  3. AFAIK, CvSVM is derived from LibSVM 2.6. I guess the answer lies in whether LibSVM 2.6 multi-class support.

  4. Hi,
    I tried to use OpenCV Bag of words, but the problem is I got 0 support vector for my SVMs.
    Do you have any idea about this problem? I used SIFT descriptor with BRUTEFOrce.

    Thank you for your help.

  5. very good information but how did you train the SVM? we are trying to train the svm with the histogram of BOW descriptors and we keep getting errors. Did you convert the Mat class object to a particular format before using it to train the SVM ?


  6. Hi Bhargav,

    I did not do anything extra. AFAIK, the SVM training is performed in trainSVMClassifier() as part of the sample program. I did not make any changes regarding the Mat class. On the other hand, I noticed that some functions in the sample app such as setSVMParams() accepts C-style CvMat argument instead of C++ cv::Mat. I am not sure if you are referring to this. That should come off at compile time though.
    It has been a while since I try the sample app, so I might not be 100% correct :-)

  7. Hi, your article is very good but please tell me one thing, how to label the feature vector. recently i have extracted the feature vector but i am confused how to proceed further. i.e., 'Feature Vector' -- (+ label)

  8. It is very interesting. Could you share the code?

  9. 0 positive training samples; 3325 negative training samples
    OpenCV Error: Bad argument (There is only a single class) in cvPreprocessCategoricalResponses, file /home/vikiboy/Documents/opencv-2.4.9/modules/ml/src/inner_functions.cpp, line 729
    terminate called after throwing an instance of 'cv::Exception'
    what(): /home/vikiboy/Documents/opencv-2.4.9/modules/ml/src/inner_functions.cpp:729: error: (-5) There is only a single class in function cvPreprocessCategoricalResponses

    Aborted (core dumped)

    This is the error I am getting after the bag of words vector are loaded ...

  10. Great!
    I wanted to do this. But failed. Can you give me some hint to use Harris detector with SIFT descriptor?