Thursday, February 17, 2011

K-means Clustering

Good description available from Learning OpenCV book.

Implementation

  • Support multiple attempts with a single call, each time starting with a different set of initial centers/labels. cv::kmeans() returns the result from the best one.
  • 'Best' results is the one with least error (best fit?). Error is the squared-sum of distances between each center and the points that are classified to which it belongs.. 
Sample (kmeans.cpp)

  • Generate 2-D data points, gaussian distributed around K center-points. Data and results visualized as an image.
  • Uses Kmeans++ center initialization (see Resources) by default.
  • Re-discover the K center-points with cv::kmeans()
  • Introduces a shuffle-function: randShuffle(), to randomly re-order the generated data points of a matrix. Allow setting the number of iterations to shuffle.

Kmeans++
The idea of kmeans++ is that Far-Apart centers makes good initialization. The first center is picked with random (uniform weights) from all data points. Of the second to the K-th center, we would pick from a weighted distribution. Weights are proportional to the distance of the candidate point from the last chosen center.

Observations

  • Picking from 3 'attempts' definitely improve results compared to '1', regardless of how the initial centers are chosen.
  • Error increases when two(or more) 'real' centers are close together such that part of their data points overlaps. In such cases, the centers could easily be misplaced. The function would predict one center for 2 nearby clusters. And one large cluster having two centers.
  • Kmeans++ shows significant advantage over pure random choice of initial points in cases where there are 5 center points and they are far apart (such as one of them near the corner of the range), provided that sample size is large enough.
  • In other cases, their results could be similar, it is not rare to see Kmeans++ actually results in higher error.

Resources
Kmeans++ center initialization:
http://seeing-with-computers.blogspot.com/2010/10/kmeans.html

Reading

  • Learning OpenCV, Gary Bradski & Adrian Kaehler (O'Reilly Press)
  • Introduction to Machine Learning, Ethern Alpaydin(MIT Press)

No comments:

Post a Comment