Presentation is loading. Please wait.

Presentation is loading. Please wait.

Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007.

Similar presentations


Presentation on theme: "Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007."— Presentation transcript:

1 Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007

2 ZČU/FAV/KIV2/22 Talk outline motivation motivation background background existing solution existing solution improvements improvements experiments & observations experiments & observations conclusion conclusion

3 ZČU/FAV/KIV3/22 Motivation why data streams? why data streams? geometric models growing larger geometric models growing larger –Stanford’s Michelangelo project (David 28 mil. vertices, St. Matthew 187 mil. vertices) –187·10 6 points · 3 coordinates · 8 bytes ≈ 4.5 GB must be processed out-of-core must be processed out-of-core why clustering? why clustering? –use hierarchical clustering to create multiresolution model –various LOD in different parts

4 ZČU/FAV/KIV4/22 Background – data stream ordered set of data ordered set of data data coming online or stored on HDD data coming online or stored on HDD too large to fit in main memory too large to fit in main memory –viewed only in order; random access extremely inefficient or even impossible –processed in one or very few linear scans

5 ZČU/FAV/KIV5/22 Background – clustering grouping similar elements together grouping similar elements together –vertices, DB entries, documents –similarity most often measured as Euclidean distance k-means, k-median clustering k-means, k-median clustering facility location facility location –clients and facilities –facility cost k-meansk-median

6 ZČU/FAV/KIV6/22 Facility location no data streams yet no data streams yet introduced by Charikar and Guha, 1999 introduced by Charikar and Guha, 1999 initial solution iteratively refined by local improvements – local search algorithm initial solution iteratively refined by local improvements – local search algorithm initial solution initial solution –points taken in random order –first point always a facility –others become facility with probability p = d / fc if d / fc > 1 then p := 1 –otherwise connect to closest existing facility

7 ZČU/FAV/KIV7/22 Facility location pick a point at random (new facility candidate) pick a point at random (new facility candidate) compute function gain compute function gain –pay for opening a facility –inspect all points and compare distance to facility –inspect facilities and determine whether they can be closed if gain > 0 then perform reassignments & closures if gain > 0 then perform reassignments & closures repeated m log m times? repeated m log m times?

8 ZČU/FAV/KIV8/22 Facility location New facility candidateAfter reassignments & closures

9 ZČU/FAV/KIV9/22 Data stream clustering proposed by Guha et al., 2000 proposed by Guha et al., 2000 data stream processed in blocks data stream processed in blocks –clustering within each block –cluster centers given weight and passed to higher level –when higher level full, clustered again –distances multiplied by point weights

10 ZČU/FAV/KIV10/22 Data stream clustering time for video time for video

11 ZČU/FAV/KIV11/22 Improvements limiting the search space limiting the search space –inspect only points whose reassignment can improve the solution –i.e., those assigned to facilities within 2 fc radius –does not work for weighted points

12 ZČU/FAV/KIV12/22 Improvements modification from k-median to facility location modification from k-median to facility location choosing the facility cost choosing the facility cost –equal to the diagonal of bounding box weight normalization weight normalization –we need to keep weights around 1, i.e., average weight equal to 1 –divide weights by their average

13 ZČU/FAV/KIV13/22 Experiments – setting the facility cost high setting high setting –aggressive clustering –low number of large clusters low setting low setting –moderate clustering –many small clusters set facility cost equal to diagonal of bounding box set facility cost equal to diagonal of bounding box affects memory and running time affects memory and running time

14 ZČU/FAV/KIV14/22 Experiments – setting the facility cost diagonal2 diagonal 1/2 diagonal1/4 diagonal

15 ZČU/FAV/KIV15/22 Experiments – input point distribution many authors rely on data being ordered many authors rely on data being ordered –usually true –presented algorithm can handle unordered data as well there may be a problem there may be a problem with few outliers

16 ZČU/FAV/KIV16/22 Experiments – input point distribution 1st block2nd block 3rd blockhigher level

17 ZČU/FAV/KIV17/22 Experiments – input point distribution 1st block2nd block 3rd blockhigher level

18 ZČU/FAV/KIV18/22 Experiments – block size affects memory requirements affects memory requirements –required memory somewhat affects clustering result somewhat affects clustering result affects running time affects running time –required iterations

19 ZČU/FAV/KIV19/22 Experiments – number of iterations m log m iterations necessary for a constant-factor approximation m log m iterations necessary for a constant-factor approximation for large blocks running time grows unpleasantly for large blocks running time grows unpleasantly 0.1 m iterations seem to be enough; for data with clusters even less 0.1 m iterations seem to be enough; for data with clusters even less

20 ZČU/FAV/KIV20/22 Experiments – number of iterations 6560 iterations164 iterations 1640 points 1640 points

21 ZČU/FAV/KIV21/22 Conclusion modified data stream approach to facility location modified data stream approach to facility location introduced facility weight normalization introduced facility weight normalization improvement to limit the number of points inspected improvement to limit the number of points inspected experiments experiments –discussion of parameter settings –description of algorithm behavior

22 ZČU/FAV/KIV22/22 References M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Facility Location and k-Median Problems. Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388. M. Charikar, S. Guha, Improved Combinatorial Algorithms for the Facility Location and k-Median Problems. Proc. 40th Sympos. on Foundations of Computer Science, 1999, pp. 378-- 388. S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000 S. Guha, N. Mishra, R. Motwani, L. O’Callaghan, Clustering Data Streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000 L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality Clustering. In Proceedings of IEEE International Conference on Data Engineering, March 2002. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, R. Motwani, Streaming-Data Algorithms for High-Quality Clustering. In Proceedings of IEEE International Conference on Data Engineering, March 2002. S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528. S. Guha, A. Meyerson, N. Mishra, R. Motwani, L. O'Callaghan, Clustering data streams: Theory and practice. IEEE Trans. Knowl. Data Eng 15, 3 (2003), 515-528.


Download ppt "Clustering Geometric Data Streams Jiří Skála Ivana Kolingerová ZČU/FAV/KIV2007."

Similar presentations


Ads by Google