Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering methods: Part 10

Similar presentations


Presentation on theme: "Clustering methods: Part 10"— Presentation transcript:

1 Clustering methods: Part 10
Very large data sets Pasi Fränti Speech and Image Processing Unit School of Computing University of Eastern Finland

2 Methods for large data sets
Birch Clarans On-line EM Scalable EM GMG Let’s study this (no material for the others) 

3 Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition]
Problem split into two parts, model generation and later processing of the model Gather points into buffer Select subset of points to generate a new component into model Points that fit the model are used to update the model directly Repeat until all points have been used in either component generation or direct update Postprocessing can be done without the original data

4 Goal of the GMG algorithm
EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form

5 Contours of probability density distributions
EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form

6 Model update New data points are mapped immediately when input.
Points too far (from any model) will remain in buffer. Buffered points are re-tested when new models created. Before update After update

7 Generating new components
When buffer full, selected points are used to generate new components. Most compact k-neighborhood is selected as seed for a new component. Data in buffer Selected points and a new component Find k nearest neighbors for all points Pick the one with the smallest maximum distance

8 Example Red pluses are objects that have not been used yet
They have been seen by the algorithm Ellipses represent clusters Gray points are objects that have been used and discarded by the algorithm Objects are used to generate new clusters and update existing ones Data arrives from left to right All objects will be used eventually one way or another

9 Example

10 Example

11 Example

12 Example

13 Example

14 Post-processing Model before processing

15 Post-processing Model before processing Updated model

16 Post-processing Model before processing Updated model + data

17 Literature I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), , March 2007. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997)


Download ppt "Clustering methods: Part 10"

Similar presentations


Ads by Google