Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive Exploration of Density-Based Clusters In Streaming Windows This work is supported under NSF grants CCF , IIS , IIS
What are Density-Based Clusters? 2 Clusters that are defined by individual data points (tuples) and their local “neighborhood”. How they are different from K-median style clustering? Cluster 1 Cluster 2 Cluster 1Cluster 2 Cluster 3Cluster 4
Formal Definition Core Object: has more than neighbors in distance from it. Edge Object: not core object but a neighbor of a core object. Noise: not core object and not a neighbor of any core object. θ range θ cnt A Density-Based Cluster (DB-Cluster) is a maximum group of connected core objects and the edge objects attached to them
Cluster Detection in Sliding Windows W W2 4 Template Density-Based Clustering Query Over Sliding Windows Pattern-specific Window-specific
Application Examples: 5 transaction info clusters Stock Market Are there intensive-transaction areas in last 1 hour transactions? Battle field position info Stock Analysts Commander Where are the main clusters formed by enemy war-crafts clusters 5
State-of-Art 6 Existing algorithms for density-based clustering query over sliding windows include Incremental DBSCAN, Exact-N, Abstract-C and Extra-N [Ester98] [Yang09]. Extra-N suffers from the performance inefficiency as the slide/win rate increases. No evolution semantics defined for density-based cluster changes over the time. No existing system allowing interactive exploration of density-based clusters in streaming windows.
Goals 7 1. A more efficient density-based clustering algorithm over streams. 2. An evolution semantics that intuitively explain cluster changes. 3. A visualized pattern space allowing interactive exploration of clusters.
Review: existing algorithm– Extra-N 8 In highly dynamic streaming environments: Re-computation. Incremental cluster maintenance. Extra-N[Yang09] proposed a hybrid neighbor relationship (neighborship) mechanism to represent cluster structure. Maintain “Exact Neighborships” (neighbor lists) for none-core objects. Maintain “Abstract Neighborships” (cluster memberships) for core objects. A general concept of “Predicted View” is applied to efficiently update the cluster structure. —Key: a compact and easy-maintainable cluster representation.
Concept of Predicted Views Current View of W 0 window size=16, slide size=4, time=1 Predicted View of W Predicted View of W Predicted View of W W0W0 W1W1 W2W2 W3W3 9
Update Predicted Views Current View of W 1 Predicted View of W Predicted View of W Predicted View of W W1W1 W2W2 W3W3 W4W New Data Points window size=16, slide size=4, time=1 10 Expired View of W 0
Inefficiency of Extra-N 11 When Slide/Win rate increases, (for example Win=10000, slide=10), large number of predicted views need to be maintained independently. Heavy burden to both CPU and memory resources. Win Slide
Proposed Solution: IWIN 12 Any relationship between the cluster identified ?
“Growth Property” among DB-cluster Sets 13 Independent Cluster Structure StorageHierarchical Cluster Structure Storage Grow If any cluster Ci in Clu_Set1 is “contained” by one cluster in Clu_Set2, Clu_Set2 is a “Growth” of Clu_Set1. c6c5c4 c6c5c4
Integrated Vs. Independent Maintenance of Predicted Views 14 IWIN: Integrated maintenanceExtra-N: Independetmaintenance
Benefits of Integrated Maintenance 15 Benefits for Memory Resources: Memory space needed by storing cluster sets identified by multiple queries in QG is independent from |QG|. Benefits for Computational Resources: Multiple cluster sets stored in the hierarchical cluster structure (which are usually similar) can be maintained incrementally, rather than independently. IWIN outperforms Extra-N in both CPU and memory utilizations.
Goals A more efficient density-based clustering algorithm over streams. 2. An evolution semantics that intuitively explain cluster changes. 3. A visualized pattern space allowing interactive exploration of clusters.
Why we need evolution semantics? 17 Analysts need to know how clusters change over time. It is hard to observe by looking at the clusters only (even with visualization). Commander History: Did any clusters merge? Now: Are their any new cluster? Future: Is there any cluster breaking shortly?
Proposed Semantics 18 Single Step Evolutions: birth termination split merge Preserve/expand/shrink Multi Step Evolutions: split-expand split-merge shrink-split //
How to Compute 19 Extract Predicted Evolution (before window slide) Update Evolution (after window slide) preserve split preserve shrink
Conclusion for Proposed Semantics Intuitively describe the cluster evolution over the time. 2. Easily maintainable: can be computed on-the-fly during cluster maintenance.
Goals A more efficient density-based clustering algorithm over streams. 2. An evolution semantics that intuitively explain cluster changes. 3. A visualized pattern space allowing interactive exploration of clusters.
Outline What is Neighbor-Based Pattern Detection 2. State-of-Art 3. Potential Solutions & Their Inefficiency 4. Proposed Solution: Extra-N 5. Experimental Study 6. Conclusion
Why needed? 23 Analysts need to navigate along the time axis to learn the current, review the history, and predict the near future. Example: how are the two clusters in current window related to those detected 30 minutes back? Analysts need to study the clusters and their evolution at different abstraction level. Example: for routine traffic monitoring, only the position of major clusters need to be reported; when accident happened, specific information of cluster members need to be reported.
Proposed Pattern Space 24
Evaluation for IWIN 25 Alternative Methods: 1. Incremental DBSCAN [Ester98] 2. Extra-N [Yang09] 3. IWIN Real Streaming Data: 1. GMTI data recording information about moving vehicles [Mitre08]. 2. STT data recording stock transactions from NYSE [INETATS08]. Measurements: 1. Average processing time for each tuple. 2. Memory footprint.
Evaluation for IWIN 26
Case Study 1
Case Study 2 28
Conclusion Presented the first unified framework supporting interactive exploration of density-based clusters in streaming windows. 2. Designed a more efficient density-based clustering algorithm IWIN. 3. Define the first evolution semantics for density-based clusters. 4. Our experimental study confirms the both the efficiency and effectiveness of our proposed framework.
Future work 30 Support multiple queries. Support other pattern types, such as outliers, association rules… Support pattern storage and match. More?
The End 31 Thanks