OpenCV Adventure: January 2011

Monday, January 31, 2011

cv::DescriptorMatcher

There are currently two implementations of DescriptorMatcher. One is 'brute-force' and the other one is 'flann'.

Earlier posts on FLANN here and here.

Following its name, the DescriptorMatcher focuses on matching the descriptors provided by the caller. The train() member will only be relevant FLANN matcher. It when the index trees are built.

One-to-One image registration Sample Program ( matcher_simple ):

SURFDescriptorExtractor is used to build the SURF descriptors from input key-points. Interestingly, the default parameters defining the scaling properties are different for SURF descriptor-extractor and SURF feature-detectors. Always thought that the octaves are used for building filter pyramids (LoG) during key-point detection. Does it actually get used in building descriptors?!

Using the provided image pair: box.png and box_in_scene.png

Turns out that filtering some correspondences is necessary. By default it generates tons of matches. Could focus on strong matches by first doing 2-NN search (K=2). Eliminate those matches whose NN are too similar. This idea is implemented in the find_obj C sample.
Use overloaded function DrawMatches() that supports K matches for each feature point with vector<vector<DMatch>>. DescriptorMatcher::knnMatch() provided exactly the result type. Only put a line between 'strong correspondences' by setting up a mask-of-matches to DrawMatches().

One-to-Many image registration Sample Program (matching_to_many_images):

The matching pipeline: FeatureDectector, DescriptorExtractor, MatcherDescriptor.
All 3 abstract classes supplies static member "::create( name )" to instantiating a specific implementation class.
Simply pass in a vector of Mat to supply all the (training) images to match from.

Results/Observations (using the provided train set (1,2,3.png) and query image(query.png)
Some points along the white table edges are extracted. They often get matched (false positives).

Results/Observations (using the set of Ground Truth images from Dundee University)

There are 4 query images and 2 sets of training images of 10 each.
There is a provided ground-truth file for verification.
It's here: http://www.computing.dundee.ac.uk/staff/jessehoey/teaching/vision/project1.html
With 10126 train-descriptors (over 10 training images), and 1148 matches, FLANN takes about 1 second to build tree(0.6) and match(0.4). BruteForce-L1 takes 44 seconds. The almost matching time taken by FLANN is so short that the upstream processes now becomes the bottleneck.
Overall quality of matching of 'book1' is not good. Plenty of false matches.
Ball matching is better, it has more distinctive lines.
The mobile phone keypad is matching away a lot of label characters on the juice-box.
The lack of distinctive features on 'kit' is making it very difficult to match. Switched to MSER detector does not help. Only tried with default parameters.

Friday, January 28, 2011

FLANN

OpenCV incorporates the FLANN library in cv::flann namespace.

FLANN provides a library of feature matching methods. It can provide automatic selection of index tree and parameter based on the user's optimization preference on a particular data-set. They are chosen according to the user's preferences on the importance between between build time, search time and memory footprint of the index trees. The type of index trees are KD, randomized KD and Hierachical K-Means.

FLANN uses the Hierachical K-means Tree for generic feature matching. Nearest neighbors are discovered by choosing to examine the branch-not-taken nodes along the way. FLANN uses a priority-queue (Best-Bin-First) to do ANN from Hierachical K-Means Tree.
See background here:OpenCV Adventure: Algorithms used in FLANN

Sample in C (find_obj)

The program attempts to locate an object from an image. Looking at the supplied pair (box, box_in_scene), the box is actually partly obscured. Besides using FLANN from OpenCV, the program also implements a primitive(naive) matching method. Using the output from either matching method, a perspective transform matrix will by obtained by cvHomography() with the corresponding key-points. The 4 corners from the source object will be mapped to the corresponding location on the image. A box will be drawn with the new corners. At the end, the straight-lines will be drawn between the corresponding key-point pairs.
The feature points are detected with SURF detectors and represented in SURF-128 descriptors. The program by default set up 4 Randomized KDTrees and search for 2-nearest-neighbors. The matched key-point is only counted when the second of the nearest-neighbors is doubly farther than the first.
The 'naive' NN search simply do the search linearly of each object key-point (~593) from the image key-points (782). It does however, keep track of the nearest 2 neighbors and made sure the nearest is indeed dominant before being accepted. The 'naive' distance is simply a sum of squared- differences between the SURF descriptor tuples.
Tried using 'auto-tuned' method. The choice is Linear! Perhaps the feature point are too few?
Tricky namespaces: ::cv:flann and ::cvflann
Perspective transform equation: see eqn(2.21) from the book by Szeliski 2010.

Code

http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
There is link to both User Manual of FLANN library and Academic Paper behind it.

Resource

Computer Vision: Algorithms and Applications, Szeliski 2010
OCR sample with KNN http://blog.damiles.com/2008/11/basic-ocr-in-opencv/
Another OCR sample with KNN http://www.aishack.in/2010/10/k-nearest-neighbors-in-opencv/

Thursday, January 27, 2011

Algorithms used in FLANN

Data-Structures

K-D Tree
K-D stands for K-Dimension. It is a type of binary tree for multi-dimension vectors. It is balanced when it is built by splitting the nodes at the median values. Choose the dimension that could divide the samples in half (largest variance) at each level. The tree is built with a training set of feature vectors. It is used to find query specific value or value ranges, nearest neighbors.
Randomized K-D Tree
Improve the approximation of nearest neighbors from the query point by searching simultaneously across a number of randomized trees. The tree are built from the same set of samples. The author argues that in practice the variances among dimensions do not vary much. Randomly picking from the highest few will be enough to make the different trees. (RKD) It could be further improved by performing PCA. Re-align the data to the highest principle components. (PKD)
Hierarchical K-Means Tree
It is a tree of which every inner node is split in K-ways. K-means clustering is used to classify the data subset at each node. A L level tree would have approximately K^L leaf nodes. They are all at level L.
The papers describe how to apply text-retrieval methodologies to object-retrieval from video. The feature descriptors are 'visual vocabulary'. The descriptors are clustered (quantized). Uses hierarchical tree to improve efficiency. A inverted-book (index) is built (offline) to keep track of where the vocabulary appears (on frame(s) of a video).

ANN Search

On K-D Tree
Start backtracking after reaching the terminal node from the initial depth first search. Examine, update-closest node list or eliminate neighboring sub-trees by checking the 'distances' from the query point. For approximation, only a fixed number of leaf nodes will be looked at during back-tracking. A priority queue of nodes to be examined is maintained. High priority is given to the closer 'cells'.
On RKD Tree
The priority queue will be shared across all trees. In other words, the next cell to examine will be based on the closest cell-to-query-point distance among ALL trees.
On Hierarchical K-Means Tree
The "Vocabulary Tree" paper presents methods in looking up image from an image database. The primary object in the image will have its features looked up from the tree. Assuming that similar images has the similar set of feature points, the path on which the features arrives at the leaf nodes should be 'close-enough'. Each leaf node (vocabulary) is characterized the list of inner nodes it passed through. It is a vector of weighted nodes. The weighting method is based on ID-ITF used by text retrieval. Simply put, a 'similarity' will rank highly if it happen frequently in this frame and the feature does not occur often in the video. The candidate images from database could be looked up from the inverted files. The images are ranked according to a score based on ID-ITF. I suppose each candidate image will have a score that is based on the differences from the query-image of how the feature traverse the tree. Other techniques are used to filter for more precise results: stop-lists, spatial-consistency.

Other note

The Google Video paper uses both MSER to detect blobs and shape detectors to find corners. They complement each other. Uses Mahalanobis distance to measure distance between SIFT descriptors.

Readings

[ FLANN ] http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
There is link to both User Manual of FLANN library and Academic Paper behind it.
[ K-D Tree ] Multidimensional Binary Search Trees Used for Associative Searching, Jon Louis Bentley
[ NN search on K-D Tree ] AN ALGORITHM FOR FINDING BEST MATCHES IN LOGARITHMIC EXPECTED TIME, Friedman et al.
[ ANN search on K-D Tree ] Shape Indexing Using Approximate Nearest-Neighbour Search in High-Dimensional Spaces, Beis and Lowe
[ Randomized K-D Trees ] Optimised KD-trees for fast image descriptor matching, Silpa-Anan and Hartley
[ Hierarchical K-means Trees ] Scalable Recognition with a Vocabulary Tree, Nister and Stewenius
[ K-means clustering on Video Vocabulary ] Video Google: A Text Retrieval Approach to Object Matching in Videos, Sivic and Zisserman.

Monday, January 24, 2011

Star Feature Detector

Star Feature Detector is derived from CenSurE (Center Surrounded Extrema) detector. While CenSurE uses polygons such as Square, Hexagon and Octagons as a more computable alternative to circle. As far as I could tell, Star mimics the circle with 2 overlapping squares: 1 upright and 1 45-degree rotated. These polygons are bi-level. The can be seen as polygons with thick borders. The borders and the enclosed area have weights of opposing signs.

* It works in gray scale only *

Detection

Build Integral Image - Square and/or Slanted-variants (trapezoids). The slanted ones are used to compute polygon shaped filters.
Approximate LoG (Laplacian of Gaussian) with choice of the bi-level polygons. The outer strip and inner strip has opposing weights. It is supposed to produce zero DC value. Meaning the ratio of the weights should somehow be related to the area under the 2 levels.
Let m, n be parameters that defines of the inner region and outer boundary (thickness, sort-of).
A set of filters will be defined, based on the pairs of (m, n) values. (By experiment?)
Repeat the convolution on the same input image with each of the scaled filter. There will be no-subsampling, therefore no need to localize points from higher-scales.
Non Maxima Suppression

Feature corners are filter-response maximas in a 3x3x3 neighbors.
Eliminate weak feature points - with values below a filter response threshold.

Edge Suppression

Remove unstable points on edges by examining each feature corner with Harris measure under a 9x9 window. Eliminate points having the high ratio of the two highest Principle Curvatures, above a predefined threshold.

Descriptor

The CenSurE paper suggests to use improved Upright SURF: MU-SURF. That reduces the 'boundary effect' by making oversized masks when calculating Haars wavelet responses.
It also weighted the wavelet responses twice with Gaussian.
Shares the characteristics of SURF descriptors in terms of quick matching with +/- signs.

Sample (detector_simple)

Parameters (by observation)

maxSize - choose the number of filters to be applied, the parameter value set the maximum size (last filter). Refer to icvStarDetectorComputeResponses():sizes0[] array.
responseThreshold - eliminate weak corners.
lineThresholds: maximum ratio between some factors that made up the Harris messure

lineThresholdProjected (first test) - Harris of responses
lineThresholdBinarized (second test) - Harris of sizes (refer to size1[]). Seems like size1 is simply a copy of size0 up to the max scale currently used.
Has to satisfy both tests to avoid elimination.

suppressNonMaxSize - window size (n-by-n) to apply the non-maximal suppression.

Observations

Detect fewer points on human faces with default parameters comparing to SURF/SIFT. Fewer on the hair. (line-suppression?)
maxSize @ default value (45) detects the most points on Kobe's Picture. Not apply to all images. But the values towards both ends detects much fewer points.
responseThreshold - when threshold set to relatively high, only circular objects or good corners like 30-45 degrees (such as necklace pendants, polo-shirt collars) remains.
suppressNonMaxSize - increasing the window size remove feature points that are close to each other.
increase/decrease the lineThreshold pairs with the same ratio, seeing fewer/more points along the hairlines.

Performance

572 points detected in 1 second (Obama portrait in original size)

Reading

CenSurE: Center Surround Extremas for Realtime Feature Detection and Matching. Agrawal, Konolige and Blas.
Computer Vision and Graphics: Second International Conference, Iccvg ..., Part 2 edited by Leonard Bolc, Ryszard Tadeusiewicz, Leszek J. Chmielewski, Konrad Wojciechowski

Friday, January 21, 2011

MSER, MSCR detectors

Basic idea - could be overly simplified
A method to detect blobs (Distinguished Regions) by grouping nearby pixels of similar intensity (gray) or color. They try to Maximize the Stable Regions. The region is stable when the area increase very slowly across changes in (color)intensities.

MSER is like Watershed, with focus on finding stable connected-components with the largest possible size. It does so by applying a threshold value from 0 to maximum, one step at a time. Similar to flooding the basins in Watershed-speak. Smaller regions would merge to become larger regions if area remains 'stable'. The region becomes a candidate when the area growth rate reaches a local stationary point.

Rate-of-change = Delta-area/Delta-threshold.

That's likely where the edge (blob boundary) lies.

MSCR is an color extension of MSER.

MSCR is similar to MSER. It detects blobs by aggregating clusters of pixels. Clusters are formed by grouping neighborhood pixels having similar color. Comparing to MSER, this 'similarity' is modeled by Chi-Squared Distribution. The 'expected difference in color' of neighboring pixels varies with intensity. That means, the 'distance's are non-uniform across evolutions (iterations). This varying distance at each evolution (iteration) maintains a constant area growth if the intensity increase linearly. Again, the goal is the find the stationary points of area-increase-rate.

Invariance

Achieve by using Hu invariant Moments as descriptor?

Code

Contributed by Liu Liu - support both single-channel and multi-channel image.
OpenCV implements "Linear Time MSER" for grayscale input image and MSCR for 3-channel images. According the abstract of the paper "Linear Time Maximum Stable Extrema Regions", Nister et al proposed a more efficient method in computing the DR than the union-find method.
See MSER section of VLFeat website (resources) for description of the parameters (delta, minArea, maxArea, maxVariation and minDiversity).
Parameters for MSCR (color)

Minimum 'color-difference' (minMargin) for a region - to be adjusted according to degree of edge-smoothing (edgeBlurSize).
The blobs are initially returned as contours. Approximated to ellipses. Only the major orientation and center point of the ellipse is saved to KeyPoint. Information about the shape is lost.

Sample 1 (detector_simple)

Comparing the color versus grayscale of the same picture (ABC Test-Pattern.jpg), using default parameters.

Total KeyPoints Detected on Gray is fewer than Color ( 344 versus 373 ).
Both taken less than 1 second.
Gray image has a more complete coverage of Key points on gray tiles pattern.
Color image has a more complete coverage of Key points on horizontal strip of black tiles.

MSCR is able to identify all 6 color blocks (Yellow, ..., Blue) while MSER sees leftmost 4 as a single one.
MSCR is able to identify 5 out of 6 gray scale boxes (above UniSA TV) while MSER identify only one.

MSER (falsely?) identify regions at the busy (high-frequency) side of the alternating black-and-white vertical strips.

Performance on large image (Obama Official Portrait), default parameters:
Color - 3312 Keypoints in 12 secs
Gray - 436 Keypoints in 2 secs

Parameters

Most of the parameters are being used in checking whether the current regions are 'stable'.

With Gray Image of "ABC Test-Pattern.jpg":

Increase delta to 50 - no more blobs on high-frequency region of B&W vertical strips.
Decrease the maxVariations to 0.005 - no more blobs on the high-frequency region of B&W vertical strips. Gray Tiles remains detected. They have clear cut high-contrast and narrow boundaries
More overlapping blobs appear after decreasing minDiversity to 0.05.
Increase minDiversity to 1 or above - zero blobs detected.
Unable to detect the all the separate blocks forming a horizontal gradient across the circle by changing parameters. It could be related to the surrounding textures/geometries because the same program is able to detect those blocks if they are cut out as a standalone image.

With original color image of "ABC Test-Pattern.jpg":

maxEvolution divided the Chi table. Fewer evolutions results in larger steps in 'distances' to cover the whole color intensity range, and vice versa.
More evolutions increases the number of detected blobs, as seen on more complete coverage of the gray tiles. Unfortunately, this also brings about more overlapping key-points.
minDiversity also works for color image.
Increase edgeBlurSize naturally remove the blobs from high-frequency area; Disable edgeBlurSize (0) also detect more some gray tiles as blobs.
minArea and maxArea also applies to color input
Setting minArea to 0 could cause run-time assertion, presumably it is due to the minimum pixels required in Contour functions.

Sample 2 (mser_sample.cpp) [ Added Jan-22 ]

Using C API - able to draw the Contours for more complete evaluation of results.
Convert input image to Y-Cb-Cr color-space for MSER (in fact MSCR) by default.
With puzzles.png, the MCSR detects 100 more contours (164 versus 60) but quite a few are probably owing to the uneven lighting on the big board. It is not trivial to get rid of without visible side-effects, like sacrificing the long board edges. Best parameter I tried is to increase the blurring to aperture size of 11.
The MSER is sensitive to shadows posted from the solids.

Reading

RobustWide Baseline Stereo from Maximally Stable Extremal Regions, by Matas et al.
Maximally Stable Colour Regions for Recognition and Matching, by Forseen
Linear Time Maximally Stable Extremal Regions, by Nister, Stewenius

Implementations

VLFeat http://www.vlfeat.org/overview/mser.html

Thursday, January 20, 2011

FAST corner detector

Segment-Test + Decision Tree

According to the paper, real-time video tracking (augmented reality) is the primary goal of devising FAST.
The method involves training a decision tree with attribute vector of 16 elements. Each one is a point around a circle. The center of the circle is the candidate feature-point (corner). The candidate is classified as a 'corner' or 'non-corner' with the resulting tree.

Segment-Test: Put a Bresenham circle, of 3-pixel-radius (r=3), around our study point. There will be 16 perimeter pixels. The study point is a corner if there are 12 (N=12) consecutive pixels from the perimeter that are consistently brighter than the center (study pixel) by a threshold 't'. This case is dark-in-center. It would also count if those N pixels are darker than the center (bright-in-center).
Classify all the training images by finding out all the corner points using Segment-Test.
Build the decision tree in two phases

Choose an attribute, let's say pixel-5 from the perimeter. Classify the current center as either 'Brighter', 'Similar' or 'Darker' with respect to this point. Repeat this for all pixels as center. At the end of this round, the pixels will be divided into 3 groups. Brighter, Similar or Darker.
The choice of pixel-5 is based on that fact that it would reduce the most entropy (highest information gain). That would like to introduce the most uneven split of classifications going into the next level. This is ID3 Decision Tree. Each node is split 3-ways. There are 2 classifications (corner - yes , no).
Repeat the above with other perimeter pixels to grow the decision tree until all the leave nodes have 100% yes-no corner (entropy of 0).

Apply non-maxima suppression to detected corners to eliminate overlapping features. Compare absolute sum of the intensity-difference from the circle to find the the local maxima.
The Rosten paper suggests that it takes on average 2.2 questions (tree-level?) to classified a point.

It is able to generalize the 'corner' with this supervised learning method.

Code

Probably need a DetectorParam struct to group together separate 'threshold' and 'nonMaxSuppression' argument.

Sample

The FAST implementation in OpenCV performs a 9-point segment test. It is not a learned method. Meaning that it does not build a decision tree. It uses Segment Test only.
It's very fast even with the large Obama picture, taking less than 1 second to detect 5000+ corners.
Using default threshold (intensity-difference), it detects 4 times as much corners as SURF.
Higher threshold, fewer detected points and vice-versa.

Readings

Machine learning for high-speed corner detection, Edward Rosten and Tom Drummond
(see resources)

Resources

Edward Rosten Page: http://mi.eng.cam.ac.uk/~er258/work/fast.html.
A few iPhone projects is using FAST.
Edward has submitted a newer and FAST-er method.
On ID3 Decision Tree http://www.cse.unsw.edu.au/~billw/cs9414/notes/ml/06prop/id3/id3.html

Wednesday, January 19, 2011

SURF Detector

FAST-Hessian Detector + SURF Descriptor

Key Point Detection

Build Integral Image
Uses Pyramid of Filter (not image) to approximate Laplace of Gaussian (LoG), supposedly run faster than SIFT, which uses DoG for approximation.
Filters will span several octaves and with fixed number of scales in each, similar to SIFT. This type of process is Scale-Space Analysis.
Integral image helps to keep the running speed constant as it is insensitive to increasing filter sizes.
Uses discrete Box Filter to calculate Hessian determinant. Box size in multiples of its current scale.
All filters run on the the original image instead of iterative like SIFT, allowing parallel execution.
Feature points are maxima of the determinants in the adjacent scale and points (3x3x3), similar to SIFT.

Orientation Assignment

The responses from 2 first-order Haar wavelet filters (1, -1), in dx and dy orientations, are collected on each feature point. The responses are put on a 2D plane as vectors [dx, dy].
Again, the Integral Image would help with this box filter.
An orientation-window of 60-degree-angle is slided around the origin at x-y plane. Each angle will have a corresponding sum of magnitudes for every vectors inside that window.
The dominant orientation would be the angle of the largest sum.

Descriptor Extraction

The descriptor is 64-element vector, half the size of SIFT. Each represents the intensity property of a 4x4 square within the "interest region". The intensity property is 4-element tuple [ Sum(dx), Sum(abs(dx)), Sum(dy), Sum(abs(dy)) ]. The Sum is calculated from the Haar wavelet response of a 5x5 sample window.
4x4x4 = 64
Implying that there are more than 1 4x4 square at each key-point location. I suppose the size of the region is related to the associated scale.

Matching

The sign (+ive/-ive) of the Trace of Laplacian (Hessian Matrix) is used in matching phase, speeding up the process.

Code

It is credited to Liu Liu with modifications by Ian Mahon. The authors state the current gotchas and gives a good overview of the code at the file (surf.cpp) header.
The code did try to exploit parallelism using OpenMP (cv::parallel_for).
Define some parameter structure and default values similar SIFT in the future (start with cvSURFParams?).
The 'extended' parameter changes the descriptor size from 64 to 128. It is default to '1' in the code, but it has default argument value of 'false' at the SURF::SURF(...) constructor.
Why not integrate OpenSURF by Chris Evans? (With hyperlink to CMU CV site with Image Databases and other CV resources).

Sample

With default parameters, SIFT takes 3-4 seconds and detect 872 points on Kobe Bryants picture. SURF takes less than 1 second to detect 413 points.
SURF is able to handle the Obama picture at its original resolution, unlike SIFT. Takes 11 seconds to detect 2690 points.
Decrease/Increase the number of detected points by increasing/lowering the threshold argument. Only the points with Hessian response higher than thresholds are considered.

Commercial Application (academic spin-off)

Kooaba - visual search

Readings

Speeded-Up Robust Features (SURF). Herbert Bay et al.
CPSC 643 Presentation of SURF, Herbert Bay et al.
Non-Maxima Suppression Demo
Wavelets in Multiresolution Analysis by Tom Germano
A Tutorial on Wavelets and their Applications by Martin J. Mohlenkamp at the University of Colorado, Boulder
WAVELETS FOR KIDS - A Tutorial Introduction by B. Vidakovic, P. Mueller at Duke University.
Wavelets: A Tutorial by Zhuming Lam (Apr-2002)

Repeatability

The term 'repeatability' (seen on SIFT and SURF papers) is a measure of the ability to detect the same set of key-point from various viewpoints. For example, the rotation of Van Gogh image (Fig 3 of the SURF paper by Bay et al.).
Origin of this term: See "Evaluation of Interest Point Detectors" paper by Shmid et al

Tuesday, January 18, 2011

Edge Detection introductions

Came across some decent introductory materials on edge detection.

SIFT

Attempt to summarize the feature point detection method:

Locate key-points

Find candidate key-points with a Pyramid of DoG from several octaves (1 octave = rho x 2).
Calculate precise locations of key-points
Remove candidates on edges or low-contrast area
Assign orientation to each key-point. Create new key-points at the same location, scale which have similar and significant orientations.
Now each key-point has location, scale, orientation information.

Define descriptor for each key-point

Each sample point now has a gradient magnitude and orientation from previous steps.
For each key-point, make orientation histogram for its neighborhood 4x4 regions.
The histogram bins (r) width is 360 / r. If r = 8, width = 40.
Value of each bin is the sum of gradient magnitudes in that orientation, weighted inversely to the distance from the key-point.
The orientation is specified relative to the feature point orientation.
The n-by-n neighborhood could be 4x4 as used [ Lowe2004].
Number of elements for each descriptor is now r times n times n. 8 x 4 x 4 = 128.

** Works for single color channel images only **

Code

Part of VLFEAT implementation by Andrea Vedaldi: http://www.vlfeat.org/api/index.html
KeyPoint::size is a function of Sigma where Sigma corresponds to the scale of that key-point. Given DrawMatchesFlags::DRAW_RICH_KEYPOINTS flag, drawKeyPoints() will draw a circle centering at the feature point location. The KeyPoint::size is used to determine that diameter.

Sample

Fewer results (keypoints) by increasing the contrast threshold and decrease the edge threshold (difference between biggest 2 gradients).
Background aside, feature points mostly located around the eyes, eyebrows, where the hair-lines meet the face, hair (brunette), lips, teeth, cheek, nose-tip from looking straight.
Fewer points when the face turns a little to one side, with half of the face is shaded. In this case, the default thresholds give better results.
Works fine with Obama portrait. Although, the original sized picture (1916x2608) is causing memory failure at Sift::prepareBuffers().

Readings

Local Features Tutorial Nov 8 2004, F. Estrada & A. Jepson & D. Fleet
CS 664 Lecture #21 "SIFT, object recognition, dynamic programming", Cornell University
Distinctive Image Feature from Scale-Invariant Keypoints, David Lowe 2004
Implementing the Scale Invariant Feature Transform(SIFT) Method, YU MENG, Dr. B Tidderman

Monday, January 17, 2011

Corner Detection

GoodFeatureToTrack

Default use Shi-Tomasi 94 method ("Texturedness" section): Store per-pixel 2 eigenvalues of a 2x2 matrix. It's a corner if both values above some threshold.
Option to use Harris method: Store Corner and Edge Response value for each pixel. It is a corner if the value is a local maxima of the surroundings (8-connected).

* Only works for single channel images *

GFTT sample (my own)

Shi-Tomasi method return more keyPoints than Harris. Decrease the 'k' parameter increase the number of keyPoints detected, but still no match for Shi-Tomasi.
Increase the 'k' parameter to 0.3 detects no keyPoints, using stanley.png and person2.bmp.
Pass 'HARRIS' to FeatureDetector::create() to get Harris corner method
Perhaps make a generic feature param class that FeatureDetector::create() would accept.
Cannot detect any points on human face except teeth and eye-glasses, using person1.jpg and person2.bmp.

Readings

Good Features to Track, Shi, Tomasi
A combined corner and edge detector, Chris Harris & Mike Stephens (1988)

Friday, January 14, 2011

GrabCut

GrabCut is an iterative process that in cutting out foreground object after user specify the approximate region(rectangle) surrounding it. It uses K components of Gaussian Mixture Models to model the color of foreground and background pixels separately. The Graph-Cut technique is used to classify pixels as background / foreground. The factors affecting the classification is the neighborhood color gradients and fitness into existing GMM foreground/background models. The iteration begins with updating the 2K GMMs from the newly classified pixels (Learning), followed by detecting the foreground pixels with Graph Cut.

Sample

OpenCV supports 4 types of classifications: BG, FG, Likely-BG and Likely-FG in user-edit mode.
FG pixels are classified by GrabCut as Likely-FG, not FG (by experiment).
Require manual iteration to observe convergence.
It looks like the iterations are trying to make better GMMs (scissors.jpg).
person1.jpg takes 5 seconds on each iteration; after 6 iterations there is no much improvements.
Hard to get the narrow 'stem' classified as FG after putting a rectangle around the plant (besides the pot) from bush.jpg. Is it because the relatively small area to accummulate a significant weight in GMMs?

Readings

Berkeley Image Database (CalPhoto)

Downloaded Ground truth photos for grabcut sample. Could be a source for test images, well, mostly about nature (classified with various plants, animal species). Permission must be granted from the owner even for storing it on a computer.

Wednesday, January 12, 2011

Delunay

Data Structures

CvSubDiv2Edge is a bit tricky.
CvSubDiv2DEdge: represents a index pair:
[31:2, 00b]: pointer address to a CvQuadEdge2D structure.
[1:0] : One of the 4 edges (delunay + reversed, veronoi + reversed).
CvQuadEdge2D: A CvSet 4 pairs of [ edge, associated-vertex ]; see cvSubdiv2DMakeEdge()
CvSubDiv2D: subdivision header and a graph of CvQuadEdge2Ds

Misc. Notes

Second argument to cvSubdiv2DRotateEdge() is relative, see how it is used to find the voronoi point in draw_subdiv_facet().
Some properties from OpenCV Book:
Each point on the convex hull is connected to at least 2 of the fictitious triangle.

Watershed, Inpainting and Mean-Shift Segmentation

Watershed

watershed() implements a marker-based method based on "Color Image Segmentation" by F. Meyer
Some basic idea of Watershed itself (grayscale, marker-based):
http://cmm.ensmp.fr/~beucher/wtshed.html
http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/KIM1/Seminars/Watershed/index.htm
http://en.wikipedia.org/wiki/Watershed_%28image_processing%29#Meyer.27s_flooding_algorithm
Flood (expand) the catchment-basin from each marker. Watersheds are boundaries at which flooding ends (local maxima). It will never spill over to another basin.
Implies putting markers at local minima (significant ones) for grayscale images.
Without markers: solve over-segmentation

Watershed Sample

The clever part (choosing the markers) is done by human.
Put markers at different color regions instead-of / in-additional-to local minima of intensity.
Single marker is meaningless.

Watershed Code

c_diff() macro hints that the color pixel comparison is based on max of the component differences.
4-connected neighbor flooding

Other Techniques: Mixing segment results with original(color) by adding them together with weights of 0.5 each.

Inpainting

Wikipedia: http://en.wikipedia.org/wiki/Inpainting
Usage:

Restore corrupted data or intentionally blur parts of the image (like faces on Google Street View?)
Remove caption from video/image.

Inpainting Code

Uses Fast Marching Method (FMM) for both Telea and NS options to inpaint points going inwards from boundary of the omega region.
Telea method inpaints by calculating the weighted sum of neighboring pixels with regard to direction, distance and level.

Inpainting Sample

(Observation) Telea introduces less artifacts when the omega region is big comparing to NS, using fruits.jpg as example.

Mean-shift Segmentation
Mean-shift Segmentation Sample

Spatial Radius slows down processing the most.
Increasing the Pyramid Level helps remove the white speckles (light reflections) off the orange (fruits.jpg). At the same time blurry strip appear at bottom and right edge. The strip thickness increases with pyramid levels.
Increasing color radius also helps remove the white speckles, without introducing the edge strips, at the expense of the boundaries between segments (fruits). Those are now more blurry.
Too big a spatial radius also distorts the segment boundaries.
Output looks similar to output from a median filter

Tuesday, January 11, 2011

Background Subtraction and Models

Looked at three background models:

Codebook

Learn and keep ranges of values for each pixel. They will be treated as background pixels. Clear stale pixels to remove moving foregrounds from the learning period.
Foreground objects are detected by subtracting pixels from the ranges (with some allowances).

Mixture of Gaussians

The intensity of each pixel is modeled by a few Gaussians. Background values are the most probable ones (highly weighted).
Model gets continuously updated (adaptive). No separate learning stage.
Foreground objects are detected by subtracting pixels from the most probable Gaussian.

Color Co-occurrence:

Learn and keep a table of color co-occurrences at each pixel location.
Input pixel is classified as foreground, background or moving background.
A reference background image is maintained to compute 'background difference' after binary thresholding.
Temporal difference from frame to frame will be updated to the co-occurrence tables.
Foreground/(Static, Moving) Background objects are classified using Bayes decision rule.

Running the Samples with street side video:

bgfg_codebook (CodeBook method)

Least CPU demanding.
Connected objects does not resemble original shape.
Good result in default parameters.

segment_objects (Mixture of Gaussians and connected-object refinement)

More CPU demanding ( 50% CPU on a dual core)
Lower background-ratio parameter gives a more solid shape of a moving car.
alpha (learning-rate parameter) is adjusted (decreasing) as time progressed.
NoiseSigma parameter seems to be used to set the max. variance of each Gaussian component.

bgfg_segm ( Co-occurrence and Mixture of Gaussians)

cvCreateGaussianBGModel() is basically using BackgroundSubtractorMOG as model. Detected object refinement left out. segment_objects did processing at the application level.
FGDModel is the color co-occurrence model.
Needs double the memory size of MoG. CPU intensive.
changeDetection() actually finds the best binary threshold on each color component by choosing one that would give the maximum variance. That's probably based on the reasoning that noise pixel color has small variance. Paul Rosin uses the term RelativeVariance in Thresholding for Change Detection . Here the code compare Standard Deviation. Wonder if simply comparing variance is enough, saving the final 'square root' operation.

Tested with more videos

Golden Gate Bridge: FGDModel shows no ship moving at all. The slowness is making the ship part of the moving background.
Sea side beach: FGDModel is able to absorb the cloud and wave (slow) movements and into background. MoG can't.
Single Moving Tree: FGDModel unable to put the center portion of the moving branches into background.
Relaxing Aquarium:

Default parameters of FGDModel takes a long time to adapt to the aquarium after fading-in
Slow moving fishes quickly becomes the background of FGDModel.
Increase 'alpha1' parameter of FGDModel makes the slow fishes appear as foreground. But at the same time the oscillating lights on coral turns up as foreground also.
Codebook seems to be able isolate the big fishes as foreground, eliminating the fluctuating lights on coral. At the same time the smaller fishes are filtered out. Making it looks like the isolation is due to keeping out small foreground objects all together.

Thursday, January 6, 2011

ConvexityDefects

Added a C++ wrapper for cvConvexityDefects() similar to how ConvexHull() does to cvConvexHull2().

Observations
The defects results makes sense when all the points are of the same contour. Otherwise, the depth_point would be found located at far end of the of hull, ignoring the closer ones.
Come to this conclusion by comparing 3 cases:

Random points (default example)
Picture of actual human palm in grayscale. Preprocessed by GaussianBlur and Canny Threshold, resulting in 61 contours. Simplified each contour with approxPolyDP(). Feed all points to ConvexHull() and ConvexityDefects()
Hand drawn palm in one continuous line, resulting in a single contour - ConvexityDefects() able to discover all the defects area in good accuracy.

Wednesday, January 5, 2011

Contours samples

Connected Component

Use of binary thresholding: Make dark objects the foreground (white) when threshold is less than 128 and vice versa.

Fit Ellipse (fitellipse)

The ellipse drawn by the second of the two consecutive calls to ellipse() mostly overlaps the first one.

Convex Hull - where is ConvexityDefects()?

ConvexHull() and isContourConvex() are now C++ wrapper function for cvConvexHull2() and cvCheckContourConvexity()respectively.
There is no C++ wrapper for cvConvexityDefects(). The function requires the hull points be indices to the contour points. This implicitly relates all the hull points to the corresponding contour points. Is this how 'defect' searching works?
In some cases the [sklansky82] algorithm have trouble finding convex hull from a polygon, how does that apply to general set of points?
http://cgm.cs.mcgill.ca/~athens/cs601/

findSquares

Basic Idea:
Approximate the contours with polygon for the straight edges. Criteria of being a 'square': polygon edge count is 4, area at least 1000 pixels, maximum angle between edges is close to 90.
Techniques

Use Law-of-Cosine to determine angle between two edges, with Cartesian origin at the edge joint.
The closer the angle is to 90 degree, the nearer the cosine value to 0.
Parameterize epsilon value to approxPolyDP() as a fraction of the contour length.
Reduce noise of source image with a sequence of downsampling-upsampling with Pyramid function.
Choose 0 as low-threshold for Canny to combine edges.
'Dilate' the post-Canny image to remove 'holes' between edges.
Discover as many squares in a picture by iterating all color planes, intensity threshold values and

Tuesday, January 4, 2011

Randomized Hough Transform

Picked this up while looking for fitellipse() from OpenCV discussion group:

http://tech.groups.yahoo.com/group/OpenCV/message/58635

In my current project I use cvFitEllipse to find ellipses. First, I use cvCanny to detect edges. From this egdePixels I randomly draw 6 Pixels and fit an ellipse. Afterwards, I confirm this ellipse by counting the number of edgePixels lying under this ellipse.
The main problem is to choose the right set of edgePixels to draw from, since you need to draw six Points on the Ellipse to have a Chance to find the right ellipse.
This is approach is calles "Randomized Hough Trafo" and works quite well.

This paper proposes this method?
http://www.cs.cuhk.hk/~lxu/papers/journal/xuoprl90.pdf

This article gives an overview of the advances of RHT.
http://www.cse.cuhk.edu.hk/~lxu/papers/RHT08a.pdf

cvAdaptiveskinDetector revisited

Exploring the purpose of histogramHueMotion in cvAdaptiveSkinDetector, noticing that the effect even with a weight value of 0.05 being accounted when being mergeWith() the skinHueHistogram. Tested with a video in 'Office' settings. The outputHueMask is able to cover the human faces more completely, and inevitably covers more areas of similar color tone, like the wall. Perhaps it's the surface of the human faces giving a range of hues that spreads the adaptive-ranges in findCurveThresholds(), and that in turns pick up more pixels as skin.