Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007

ZČU/FAV/KIV2/22 Talk outline motivation motivation background background existing solution existing solution improvements improvements experiments & observations experiments & observations conclusion conclusion

ZČU/FAV/KIV3/22 Motivation why data streams? why data streams? geometric models growing larger geometric models growing larger –Stanford’s Michelangelo project (David 28 mil. vertices, St. Matthew 187 mil. vertices) –187·10 6 points · 3 coordinates · 8 bytes ≈ 4.5 GB must be processed out-of-core must be processed out-of-core why clustering? why clustering? –use hierarchical clustering to create multiresolution model –various LOD in different parts

ZČU/FAV/KIV4/22 Background – data stream ordered set of data ordered set of data data coming online or stored on HDD data coming online or stored on HDD too large to fit in main memory too large to fit in main memory –viewed only in order; random access extremely inefficient or even impossible –processed in one or very few linear scans

ZČU/FAV/KIV5/22 Background – clustering grouping similar elements together grouping similar elements together –vertices, DB entries, documents –similarity most often measured as Euclidean distance k-means, k-median clustering k-means, k-median clustering facility location facility location –clients and facilities –facility cost k-meansk-median

ZČU/FAV/KIV6/22 Facility location no data streams yet no data streams yet introduced by Charikar and Guha, 1999 introduced by Charikar and Guha, 1999 initial solution iteratively refined by local improvements – local search algorithm initial solution iteratively refined by local improvements – local search algorithm initial solution initial solution –points taken in random order –first point always a facility –others become facility with probability p = d / fc if d / fc > 1 then p := 1 –otherwise connect to closest existing facility

ZČU/FAV/KIV7/22 Facility location pick a point at random (new facility candidate) pick a point at random (new facility candidate) compute function gain compute function gain –pay for opening a facility –inspect all points and compare distance to facility –inspect facilities and determine whether they can be closed if gain > 0 then perform reassignments & closures if gain > 0 then perform reassignments & closures repeated m log m times? repeated m log m times?

ZČU/FAV/KIV8/22 Facility location New facility candidateAfter reassignments & closures

ZČU/FAV/KIV9/22 Data stream clustering proposed by Guha et al., 2000 proposed by Guha et al., 2000 data stream processed in blocks data stream processed in blocks –clustering within each block –cluster centers given weight and passed to higher level –when higher level full, clustered again –distances multiplied by point weights

ZČU/FAV/KIV10/22 Data stream clustering time for video time for video

ZČU/FAV/KIV11/22 Improvements limiting the search space limiting the search space –inspect only points whose reassignment can improve the solution –i.e., those assigned to facilities within 2 fc radius –does not work for weighted points

ZČU/FAV/KIV12/22 Improvements modification from k-median to facility location modification from k-median to facility location choosing the facility cost choosing the facility cost –equal to the diagonal of bounding box weight normalization weight normalization –we need to keep weights around 1, i.e., average weight equal to 1 –divide weights by their average

ZČU/FAV/KIV13/22 Experiments – setting the facility cost high setting high setting –aggressive clustering –low number of large clusters low setting low setting –moderate clustering –many small clusters set facility cost equal to diagonal of bounding box set facility cost equal to diagonal of bounding box affects memory and running time affects memory and running time

ZČU/FAV/KIV14/22 Experiments – setting the facility cost diagonal2 diagonal 1/2 diagonal1/4 diagonal

ZČU/FAV/KIV15/22 Experiments – input point distribution many authors rely on data being ordered many authors rely on data being ordered –usually true –presented algorithm can handle unordered data as well there may be a problem there may be a problem with few outliers

ZČU/FAV/KIV16/22 Experiments – input point distribution 1st block2nd block 3rd blockhigher level

ZČU/FAV/KIV17/22 Experiments – input point distribution 1st block2nd block 3rd blockhigher level

ZČU/FAV/KIV18/22 Experiments – block size affects memory requirements affects memory requirements –required memory somewhat affects clustering result somewhat affects clustering result affects running time affects running time –required iterations

ZČU/FAV/KIV19/22 Experiments – number of iterations m log m iterations necessary for a constant-factor approximation m log m iterations necessary for a constant-factor approximation for large blocks running time grows unpleasantly for large blocks running time grows unpleasantly 0.1 m iterations seem to be enough; for data with clusters even less 0.1 m iterations seem to be enough; for data with clusters even less

ZČU/FAV/KIV20/22 Experiments – number of iterations 6560 iterations164 iterations 1640 points 1640 points

ZČU/FAV/KIV21/22 Conclusion modified data stream approach to facility location modified data stream approach to facility location introduced facility weight normalization introduced facility weight normalization improvement to limit the number of points inspected improvement to limit the number of points inspected experiments experiments –discussion of parameter settings –description of algorithm behavior

ZČU/FAV/KIV22/22 References M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Facility Location and k-Median Problems. Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388. M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Facility Location and k-Median Problems. Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388. S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000 S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000 L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality Clustering. In Proceedings of IEEE International Conference on Data Engineering, March 2002. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality Clustering. In Proceedings of IEEE International Conference on Data Engineering, March 2002. S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528. S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528.

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Similar presentations

Presentation on theme: "Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Similar presentations

Presentation on theme: "Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007."— Presentation transcript:

Similar presentations

About project

Feedback