Download presentation
Presentation is loading. Please wait.
1
Clustering methods: Part 10
Very large data sets Pasi Fränti Speech and Image Processing Unit School of Computing University of Eastern Finland
2
Methods for large data sets
Birch Clarans On-line EM Scalable EM GMG Let’s study this (no material for the others)
3
Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition]
Problem split into two parts, model generation and later processing of the model Gather points into buffer Select subset of points to generate a new component into model Points that fit the model are used to update the model directly Repeat until all points have been used in either component generation or direct update Postprocessing can be done without the original data
4
Goal of the GMG algorithm
EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form
5
Contours of probability density distributions
EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form
6
Model update New data points are mapped immediately when input.
Points too far (from any model) will remain in buffer. Buffered points are re-tested when new models created. Before update After update
7
Generating new components
When buffer full, selected points are used to generate new components. Most compact k-neighborhood is selected as seed for a new component. Data in buffer Selected points and a new component Find k nearest neighbors for all points Pick the one with the smallest maximum distance
8
Example Red pluses are objects that have not been used yet
They have been seen by the algorithm Ellipses represent clusters Gray points are objects that have been used and discarded by the algorithm Objects are used to generate new clusters and update existing ones Data arrives from left to right All objects will be used eventually one way or another
9
Example
10
Example
11
Example
12
Example
13
Example
14
Post-processing Model before processing
15
Post-processing Model before processing Updated model
16
Post-processing Model before processing Updated model + data
17
Literature I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), , March 2007. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.