Clustering methods: Part 10

Clustering methods: Part 10
Very large data sets Pasi Fränti Speech and Image Processing Unit School of Computing University of Eastern Finland

Methods for large data sets
Birch Clarans On-line EM Scalable EM GMG Let’s study this (no material for the others) 

Gradual model generator (GMG) [Kärkkäinen & Fränti, 2007: Pattern Recognition]
Problem split into two parts, model generation and later processing of the model Gather points into buffer Select subset of points to generate a new component into model Points that fit the model are used to update the model directly Repeat until all points have been used in either component generation or direct update Postprocessing can be done without the original data

Goal of the GMG algorithm
EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form

Contours of probability density distributions
EM GMG Generate a model of data in single pass without storing the entire data set EM does several passes, GMG one, probability density distribution has quite similar form

Model update New data points are mapped immediately when input.
Points too far (from any model) will remain in buffer. Buffered points are re-tested when new models created. Before update After update

Generating new components
When buffer full, selected points are used to generate new components. Most compact k-neighborhood is selected as seed for a new component. Data in buffer Selected points and a new component Find k nearest neighbors for all points Pick the one with the smallest maximum distance

Example Red pluses are objects that have not been used yet
They have been seen by the algorithm Ellipses represent clusters Gray points are objects that have been used and discarded by the algorithm Objects are used to generate new clusters and update existing ones Data arrives from left to right All objects will be used eventually one way or another

Example

Post-processing Model before processing

Post-processing Model before processing Updated model

Post-processing Model before processing Updated model + data

Literature I. Kärkkäinen and P. Fränti, "Gradual model generator for single-pass clustering", Pattern Recognition, 40 (3), , March 2007. P. Bradley, U. Fayyad, C. Reina, Clustering Very Large Databases Using EM Mixture Models, Proc. of the 15th Int. Conf. on Pattern Recognition, vol. 2, 2000, pp R. Ng, J. Han, CLARANS: A Method for Clustering Objects for Spatial Data Mining, IEEE Trans. Knowledge & Data Engineering 14(5) (2002) M. Sato, S. Ishii, On-line EM Algorithm for the Normalized Gaussian Network, Neural Computation 12(2) (2000) T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: A New Data Clustering Algorithm and Its Applications, Data Mining and Knowledge Discovery 1(2) (1997)

Clustering methods: Part 10

Similar presentations

Presentation on theme: "Clustering methods: Part 10"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering methods: Part 10

Similar presentations

Presentation on theme: "Clustering methods: Part 10"— Presentation transcript:

Similar presentations

About project

Feedback