The sample program only demonstrates how to use the latent SVM for classification. The paper describes the training part in details. Although I don't understand all of it, here is the summary:
Latent SVM is a system built to recognize object by matching both
1. the HOG models, which consists of the 'whole' object and a few of its 'parts', and 2. the position of parts. The learned positions of object-parts and the 'exact' position of the whole object are the Latent Variables. The 'exact' position is with regard to the annotated bounding box from the input image. As an example, a human figure could be modeled by its outline-shape (whole-body head-to-toe) together with its parts (head, upper-body, left arm, right arm, left lower lib, right lower lib, feet).
The HOG descriptor for the whole body is Root Filter and those for the body parts are Parts Filter.
The target function is the best response by scanning a window over an image. The responses consists of the outputs from the all the filters. The search for best match is done in a multi-scale image pyramid. The classifier is trained iteratively using coordinate-descent method by holding some components constant while training the others. The components are Model Parameters (Filters Positions, Sizes), weight coefficients and error constants. The iteration process is a bit complicated - so much to learn! One important thing to note is the positive samples are composed of moving the parts around an allowable distance. There is a set of latent variables for this ( size of the movable-region, center of all the movable-regions, quadratic loss function coefficients). Able to consider the 'movable' parts is what I think being 'deformable' means.
Detection Code
Datasets
Latent SVM is a system built to recognize object by matching both
1. the HOG models, which consists of the 'whole' object and a few of its 'parts', and 2. the position of parts. The learned positions of object-parts and the 'exact' position of the whole object are the Latent Variables. The 'exact' position is with regard to the annotated bounding box from the input image. As an example, a human figure could be modeled by its outline-shape (whole-body head-to-toe) together with its parts (head, upper-body, left arm, right arm, left lower lib, right lower lib, feet).
The HOG descriptor for the whole body is Root Filter and those for the body parts are Parts Filter.
The target function is the best response by scanning a window over an image. The responses consists of the outputs from the all the filters. The search for best match is done in a multi-scale image pyramid. The classifier is trained iteratively using coordinate-descent method by holding some components constant while training the others. The components are Model Parameters (Filters Positions, Sizes), weight coefficients and error constants. The iteration process is a bit complicated - so much to learn! One important thing to note is the positive samples are composed of moving the parts around an allowable distance. There is a set of latent variables for this ( size of the movable-region, center of all the movable-regions, quadratic loss function coefficients). Able to consider the 'movable' parts is what I think being 'deformable' means.
Detection Code
The code for latent SVM detector code is located at OpenCV/modules/objdetect/. It seems to be self-contained. It has all the code needed to build HOG pyramids.
The detection code extract HOG descriptors from the input image and build multi-scale pyramids. It then scan the models (root and parts) over the pyramids for the good matches. Non-max suppression is used I think to remove those proximity matches. A threshold is applied to the score from SVM equation to determine the classification.
Datasets
Some trained models in matlab file format (voc-release4.tgz and older) are available for download at the website. But how to convert the available matlab files (such as cat_final.mat) to that XML format? There is a VOCWriteXML function in the VOC devkit (in matlab). Wonder if that could help. http://fwd4.me/wSG
Sample (latentsvmdetector.cpp)
- Load a pre-built model and detect the object from an input image.
- There does not seem to be a detector builder in OpenCV.
- By looking at cat.xml The cat model has 2 models. They are probably bilateral symmetric model. Each model has 6 parts. The root filter sizes are 7x11 and 8x10.
- [cat.jpg] Took 61 seconds to finish. Able to detect the cat. Two false-positives at the top-right corner.
- [lena.jpg] Took 77 seconds. It detected Lena's beautiful face (including the purple feather hat and shoulder) ! Two other detected objects: her hat and some corner at the top-left corner of the picture.
- [tennis-cats.jpg] Took 44 seconds. It detected all 3 cats. Although the middle one and left cat and treated as one. Those two are closer together.
- [295087.jpg from GrabCut collection] Took 50 seconds. Somehow classified the Tree and the Rock Landscape as a cat!
- [260058.jpg from GrabCut collection] Took 76.5 seconds. Detected two false objects: 1) an area of the desert sand (small pyramid at the top edge), 2) part of the sky with clouds nears the edges.
- Without knowing how the model is trained, hard to tell the quality of this detector. http://tech.dir.groups.yahoo.com/group/OpenCV/message/75507; It is possible that it is taken from the 'trained' classifier parameters from the releases from the paper author (voc*-release.tgz).
Resources
Latent SVM: http://people.cs.uchicago.edu/~pff/latent/
Readings
A Discriminatively Trained, Multiscale, Deformable Part Model, P. Felzenszwalb, et al.