Download presentation
Presentation is loading. Please wait.
Published byVirgil Austin Modified over 6 years ago
1
Framework for real-time clustering over sliding windows
Sobhan Badiozamany Kjell Orsborn Tore Risch Uppsala University, Sweden s:
2
Outline Why clustering over sliding window is interesting
State of the art solutions Our contributions: SBM Generic state maintenance using contexts Contextualized indexing Results Related work
3
Clustering over data streams
Online data stream analysis Monitoring the distribution of moving objects, e.g. urban traffic monitoring Spatio-temporal event monitoring, e.g. detecting major events using social media Give a typical application example of a clustering over sliding window.
4
Sliding window characteristics
Sliding windows capture the evolving behavior of data streams. W2,12 highly overlaps with W0,10 (gray portions) Building W2,12 from W0,10W W10,12 is new data W0,2 is expired data.
5
GROUPBY queries over sliding windows
First phase summarize the small blocks Produces partial summaries Road #cars E4 10 E20 20 E18 30 E10 15 Road #cars E10 3 E18 5 Road #cars E4 2 E18 10 Second phase: Reuses the summary in W0,10 to produce W2,12 Green is merged (incremental) Red is excluded (decremental) W2,12 Road #cars E4 12 E20 20 E18 35 E10 Group memberships are identified using distinct values of the Road attribute Deterministic group membership Only aggregates are updated Add #cars Subtract #cars
6
Clustering queries over sliding windows
Clustering is dynamic, because grouping is based on similarity Groups might merge Groups might split For many clustering algorithms the exclude function: Does not exist, e.g. BIRCH exists e.g. [Ester et.al. 1998] but is shown to be very expensive [Yang et.al. 2009] Have to Only rely on the merge function to maintain clusters over sliding windows
7
Window Partition Ratio (PR)
Partition Ratio PR = the number of partial summaries that comprise a window Here PR=5 Higher PR ->finer grain slides ->real time change tracking Scaling PR is desirable for many queries
8
Repetitive Merge [Guha et.al. 2000] [Babcock et.al. 2003]
Only uses merging, no exclude needed. Maintains PR windows in parallel. Each arriving partial summary is merged into PR windows, e.g. W8,10 W0,10 = merge(W0,2, W2,4, W4,6, W6,8, W8,10) W2,12 = merge(W2,4, W4,6, W6,8, W8,10, W10,12) W4,14 = merge(W4,6, W6,8, W8,10, W10,12, W12,14) W6,16 = merge(W6,8, W8,10, W10,12, W12,14, W14,16) W8,18 = merge(W8,10, W10,12, W12,14, W14,16, W16,18) W0,2 W2,4 W4,6 W6,8 W8,10 W10,12 W12,14 W14,16 W16,18 t 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 14 17 18 The number of merges per slide: PR
9
Sliding Binary merge (SBM)
Uses a lattice to represent temporal relationships between window instances in terms of their time intervals Here window range: 32, window stride:4 -> PR=8 4 8 12 16 20 24 28 32 36 40 44 48 52 W ,4 W 4,8 W 8,12 W 32,36 28,36 20,36 4,3 6 W 12,16 W 16,20 W 8,40 24,40 32,40 36,40 W 40,44 44,48 48,52 36,44 40,48 44,52 28,44 32,48 36,52 12,44 16,48 20,52 W 20,24 W 24,28 W 28,32 W ,8 W 4,12 W 8 , 16 W 12,20 W 16,24 W 20,28 W 24,32 W ,16 W 4 ,20 W 8,24 W 16,32 W 12,28 W ,32 The number of merges per slide: log2PR The old nodes should be removed Gray nodes already removed Red nodes being removed right now (t=52)
10
Properties of SBM Reduces the number of merges per slide from PR to log2PR Only slightly higher memory footprint compared to repetitive merge Supports arbitrary window sizes Proof: …Or maybe read the paper :-)
11
How to use SBM in a framework?
Generic 2-phase Continuous Summarization(G2CS) framework generalizes the GROUPBY frameworks to support clustering. Each node in the lattice represents a window instance having a number of clusters. In G2CS a window instance is represented by its time interval, also called its context. Contexts are objects that are managed by G2CS.
12
Contextualizing the window state
Each context contains a number of clusters, each having an arbitrarily complex structure. G2CS uses a uniform schema that represent all clusters in all window instances. Contextualized Clustering Table (CCT) CCT(cid, cxtid, a1,….an) cid is cluster identifier cxtid is context identifier a1, …an are algorithm specific BIRCH clustering algo. LS: linear sum SS: Squared sum N: number of points CM: center of mass cid cxtid LS SS N CM 1 {0,8} … 2 3 4 {8,12} 5 A context identifies a partition in the CCT that contain its window instance data. A node in SBM-lattice corresponds to a partition in CCT.
13
Generic 2-phase Continuous Summarization(G2CS) framework
Continuous Summarization Queries G2CS Modularizes the solution G2CS Clustering algorithm (red) operate on system managed contexts, merger is the most expensive. Provides transparent indexing per context, i.e. per partition in CCT SBM is implemented in the final summarizer. Partial Summarizer Final Summarizer Stream adder copier merger excluder reporter Continuous Summary Main Memory Data Manager Context Manager Contextualized index manager
14
Why indexing is needed? The expensive merger plug-in receives two sets of clusters to merge, here black and green. Often performs nearest neighbor search to form links between micro-clusters for each green micro-cluster, we need to find the closest black micro-cluster. Multi-dimensional indexing on the set of black micro-clusters helps.
15
Contextualized indexing
The nearest neighbor search in merger always have a bound context, e.g. for each green micro-cluster a search in the black context is done. Two layered index: Global hash index on context id, cxtid, Local spatial index on each context data cid cxtid … ai 1 ai1 2 ai2 3 ai3 4 ai4 5 ai5 6 ai6 2 1 3 cxtid 2 1 index … X-tree containing the black Many contexts ->many X-trees ->hard to find “the one” 5 6 4 X-tree containing ai4 , ai5, and ai6 The CCT
16
Experimental results, GROUPBY
No contextualized indexing Conventional GROUPBY, very efficient exclude method. Synthetic data Differential Maintenance DM takes constant time, SBM scales logarithmically and RM scales linear
17
Experimental results, Indexing
AC: the average number of clusters per window instance SBM with contextualized indexing scales logarithmic to the AC, no index scales quadratically.
18
Experimental results, Real data
BIRCH Clustering on real data from a soccer game. As PR is scaled AC is also scaled SBM is significantly better than RM. The gain by indexing is limited in RM (15%) due to intensive copying, compared to 60% gain for indexing SBM.
19
Experimental results, Memory Utilization
Back up slide for memory utilization
20
Experimental results, Work breakdown
Copier plug-in dominates in RM In RM all window instances have the full extent of the window -> more data to copy -> indexing does not help Merger plug-in dominates the SBM Copier is relatively cheap because most nodes in the lattice cover a short extent -> less data to copy -> Indexing helps Low G2CS overhead
21
References Repetitive Merge is used in the following papers:
[1] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering data streams," in Proceedings of Foundations of Computer Science conference, Redondo Beach, CA, 2000, pp [2] B. Babcock, D. Mayur, M. Rajeev, and L. O'Callaghan, "Maintaining variance and k-medians over data stream windows," in SIGMOD conf., San Diego, 2003, pp Decremental DBSCAN: [3] M. Ester, H-P. Kriegel, J. Sander, M. Wimmer, and X. Xu, "Incremental clustering for mining in a data warehousing environment," in VLDB conf., New York, 1998, pp Why decremental clustering algorithms are not suitable for streaming: [4] Di Yang, E. A. Rundensteiner, and M. O. Ward, "Neighbor-based pattern detection for windows over streaming data.," in EDBT conf., Saint Petersburg, 2009, pp BIRCH [5] T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in SIGMOD conf., Montreal, 1996., pp [lastname, et.al. year]
22
Framework for real-time clustering over sliding windows
Sobhan Badiozamany Kjell Orsborn Tore Risch Uppsala University, Sweden s:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.