Clustering methods: Part 10 Very large data sets Pasi Fränti 5.5.2014 Speech and Image Processing Unit School of Computing University of Eastern Finland
Methods for large data sets Birch Clarans On-line EM Scalable EM GMG Let’s study this (no material for the others)
Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition] Problem split into two parts, model generation and later processing of the model Gather points into buffer Select subset of points to generate a new component into model Points that fit the model are used to update the model directly Repeat until all points have been used in either component generation or direct update Postprocessing can be done without the original data
Goal of the GMG algorithm EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form
Contours of probability density distributions EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form
Model update New data points are mapped immediately when input. Points too far (from any model) will remain in buffer. Buffered points are re-tested when new models created. Before update After update
Generating new components When buffer full, selected points are used to generate new components. Most compact k-neighborhood is selected as seed for a new component. Data in buffer Selected points and a new component Find k nearest neighbors for all points Pick the one with the smallest maximum distance
Example Red pluses are objects that have not been used yet They have been seen by the algorithm Ellipses represent clusters Gray points are objects that have been used and discarded by the algorithm Objects are used to generate new clusters and update existing ones Data arrives from left to right All objects will be used eventually one way or another
Example
Example
Example
Example
Example
Post-processing Model before processing
Post-processing Model before processing Updated model
Post-processing Model before processing Updated model + data
Literature I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), 784-795, March 2007. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp. 76-80. R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) 1003-1016. M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) 407-432. T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997) 141-182.