Framework for real-time clustering over sliding windows

Slides:



Advertisements
Similar presentations
When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)
Advertisements

Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
Incremental Clustering for Trajectories
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Fast Algorithms For Hierarchical Range Histogram Constructions
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
BIRCH: Is It Good for Databases? A review of BIRCH: An And Efficient Data Clustering Method for Very Large Databases by Tian Zhang, Raghu Ramakrishnan.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Spatio-Temporal Databases
Segmentation Divide the image into segments. Each segment:
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Spatio-Temporal Databases. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases …..
© University of Minnesota Data Mining CSCI 8980 (Fall 2002) 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center.
FLANN Fast Library for Approximate Nearest Neighbors
Birch: An efficient data clustering method for very large databases
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.
Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Garrett Poppe, Liv Nguekap, Adrian Mirabel CSUDH, Computer Science Department.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
BIRCH: An Efficient Data Clustering Method for Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny University of Wisconsin-Maciison Presented.
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies A hierarchical clustering method. It introduces two concepts : Clustering feature Clustering.
Presented by Ho Wai Shing
Di Yang, Zhengyu Guo, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute EDBT 2010, Submitted 1 A Unified Framework Supporting Interactive.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Presented by Niwan Wattanakitrungroj
Managing Massive Trajectories on the Cloud
University of Waikato, New Zealand
Spatio-Temporal Databases
CSCI5570 Large Scale Data Processing Systems
Data Science Algorithms: The Basic Methods
Data Mining K-means Algorithm
Hash-Based Indexes Chapter 11
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Evaluation of Relational Operations
BIRCH: An Efficient Data Clustering Method for Very Large Databases
CS 685: Special Topics in Data Mining Jinze Liu
Image Segmentation Techniques
Effective Social Network Quarantine with Minimal Isolation Costs
Spatio-Temporal Databases
Joining Interval Data in Relational Databases
Hash-Based Indexes Chapter 10
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Multidimensional Indexes
A Framework for Clustering Evolving Data Streams
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Clustering Large Datasets in Arbitrary Metric Space
Online Analytical Processing Stream Data: Is It Feasible?
Clustering.
Chapter 11 Instructor: Xin Zhang
Efficient Processing of Top-k Spatial Preference Queries
Data Mining CSCI 307, Spring 2019 Lecture 21
CS 685: Special Topics in Data Mining Jinze Liu
BIRCH: Balanced Iterative Reducing and Clustering using Hierarchies
Efficient Aggregation over Objects with Extent
Presentation transcript:

Framework for real-time clustering over sliding windows Sobhan Badiozamany Kjell Orsborn Tore Risch Uppsala University, Sweden Emails: firstname.lastname@it.uu.se

Outline Why clustering over sliding window is interesting State of the art solutions Our contributions: SBM Generic state maintenance using contexts Contextualized indexing Results Related work

Clustering over data streams Online data stream analysis Monitoring the distribution of moving objects, e.g. urban traffic monitoring Spatio-temporal event monitoring, e.g. detecting major events using social media Give a typical application example of a clustering over sliding window.

Sliding window characteristics Sliding windows capture the evolving behavior of data streams. W2,12 highly overlaps with W0,10 (gray portions) Building W2,12 from W0,10W W10,12 is new data W0,2 is expired data.

GROUPBY queries over sliding windows First phase summarize the small blocks Produces partial summaries Road #cars E4 10 E20 20 E18 30 E10 15 Road #cars E10 3 E18 5 Road #cars E4 2 E18 10 Second phase: Reuses the summary in W0,10 to produce W2,12 Green is merged (incremental) Red is excluded (decremental) W2,12 Road #cars E4 12 E20 20 E18 35 E10 Group memberships are identified using distinct values of the Road attribute Deterministic group membership Only aggregates are updated Add #cars Subtract #cars

Clustering queries over sliding windows Clustering is dynamic, because grouping is based on similarity Groups might merge Groups might split For many clustering algorithms the exclude function: Does not exist, e.g. BIRCH exists e.g. [Ester et.al. 1998] but is shown to be very expensive [Yang et.al. 2009] Have to Only rely on the merge function to maintain clusters over sliding windows

Window Partition Ratio (PR) Partition Ratio PR = the number of partial summaries that comprise a window Here PR=5 Higher PR ->finer grain slides ->real time change tracking Scaling PR is desirable for many queries

Repetitive Merge [Guha et.al. 2000] [Babcock et.al. 2003] Only uses merging, no exclude needed. Maintains PR windows in parallel. Each arriving partial summary is merged into PR windows, e.g. W8,10 W0,10 = merge(W0,2, W2,4, W4,6, W6,8, W8,10) W2,12 = merge(W2,4, W4,6, W6,8, W8,10, W10,12) W4,14 = merge(W4,6, W6,8, W8,10, W10,12, W12,14) W6,16 = merge(W6,8, W8,10, W10,12, W12,14, W14,16) W8,18 = merge(W8,10, W10,12, W12,14, W14,16, W16,18) W0,2 W2,4 W4,6 W6,8 W8,10 W10,12 W12,14 W14,16 W16,18 t 1 2 3 4 5 6 7 8 9 10 11 12 13 15 16 14 17 18 The number of merges per slide: PR

Sliding Binary merge (SBM) Uses a lattice to represent temporal relationships between window instances in terms of their time intervals Here window range: 32, window stride:4 -> PR=8 4 8 12 16 20 24 28 32 36 40 44 48 52 W ,4 W 4,8 W 8,12 W 32,36 28,36 20,36 4,3 6 W 12,16 W 16,20 W 8,40 24,40 32,40 36,40 W 40,44 44,48 48,52 36,44 40,48 44,52 28,44 32,48 36,52 12,44 16,48 20,52 W 20,24 W 24,28 W 28,32 W ,8 W 4,12 W 8 , 16 W 12,20 W 16,24 W 20,28 W 24,32 W ,16 W 4 ,20 W 8,24 W 16,32 W 12,28 W ,32 The number of merges per slide: log2PR The old nodes should be removed Gray nodes already removed Red nodes being removed right now (t=52)

Properties of SBM Reduces the number of merges per slide from PR to log2PR Only slightly higher memory footprint compared to repetitive merge Supports arbitrary window sizes Proof: …Or maybe read the paper :-)

How to use SBM in a framework? Generic 2-phase Continuous Summarization(G2CS) framework generalizes the GROUPBY frameworks to support clustering. Each node in the lattice represents a window instance having a number of clusters. In G2CS a window instance is represented by its time interval, also called its context. Contexts are objects that are managed by G2CS.

Contextualizing the window state Each context contains a number of clusters, each having an arbitrarily complex structure. G2CS uses a uniform schema that represent all clusters in all window instances. Contextualized Clustering Table (CCT) CCT(cid, cxtid, a1,….an) cid is cluster identifier cxtid is context identifier a1, …an are algorithm specific BIRCH clustering algo. LS: linear sum SS: Squared sum N: number of points CM: center of mass cid cxtid LS SS N CM 1 {0,8} … 2 3 4 {8,12} 5 A context identifies a partition in the CCT that contain its window instance data. A node in SBM-lattice corresponds to a partition in CCT.

Generic 2-phase Continuous Summarization(G2CS) framework Continuous Summarization Queries G2CS Modularizes the solution G2CS Clustering algorithm (red) operate on system managed contexts, merger is the most expensive. Provides transparent indexing per context, i.e. per partition in CCT SBM is implemented in the final summarizer. Partial Summarizer Final Summarizer Stream adder copier merger excluder reporter Continuous Summary Main Memory Data Manager Context Manager Contextualized index manager

Why indexing is needed? The expensive merger plug-in receives two sets of clusters to merge, here black and green. Often performs nearest neighbor search to form links between micro-clusters for each green micro-cluster, we need to find the closest black micro-cluster. Multi-dimensional indexing on the set of black micro-clusters helps.

Contextualized indexing The nearest neighbor search in merger always have a bound context, e.g. for each green micro-cluster a search in the black context is done. Two layered index: Global hash index on context id, cxtid, Local spatial index on each context data cid cxtid … ai 1 ai1 2 ai2 3 ai3 4 ai4 5 ai5 6 ai6 2 1 3 cxtid 2 1 index … X-tree containing the black Many contexts ->many X-trees ->hard to find “the one” 5 6 4 X-tree containing ai4 , ai5, and ai6 The CCT

Experimental results, GROUPBY No contextualized indexing Conventional GROUPBY, very efficient exclude method. Synthetic data Differential Maintenance DM takes constant time, SBM scales logarithmically and RM scales linear

Experimental results, Indexing AC: the average number of clusters per window instance SBM with contextualized indexing scales logarithmic to the AC, no index scales quadratically.

Experimental results, Real data BIRCH Clustering on real data from a soccer game. As PR is scaled AC is also scaled SBM is significantly better than RM. The gain by indexing is limited in RM (15%) due to intensive copying, compared to 60% gain for indexing SBM.

Experimental results, Memory Utilization Back up slide for memory utilization

Experimental results, Work breakdown Copier plug-in dominates in RM In RM all window instances have the full extent of the window -> more data to copy -> indexing does not help Merger plug-in dominates the SBM Copier is relatively cheap because most nodes in the lattice cover a short extent -> less data to copy -> Indexing helps Low G2CS overhead

References Repetitive Merge is used in the following papers: [1] S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan, "Clustering data streams," in Proceedings of Foundations of Computer Science conference, Redondo Beach, CA, 2000, pp. 359-366. [2] B. Babcock, D. Mayur, M. Rajeev, and L. O'Callaghan, "Maintaining variance and k-medians over data stream windows," in SIGMOD conf., San Diego, 2003, pp. 234-243. Decremental DBSCAN: [3] M. Ester, H-P. Kriegel, J. Sander, M. Wimmer, and X. Xu, "Incremental clustering for mining in a data warehousing environment," in VLDB conf., New York, 1998, pp. 323-333. Why decremental clustering algorithms are not suitable for streaming: [4] Di Yang, E. A. Rundensteiner, and M. O. Ward, "Neighbor-based pattern detection for windows over streaming data.," in EDBT conf., Saint Petersburg, 2009, pp. 229-540. BIRCH [5] T. Zhang, R. Ramakrishnan, and M. Livny, "BIRCH: an efficient data clustering method for very large databases," in SIGMOD conf., Montreal, 1996., pp. 103-114. [lastname, et.al. year]

Framework for real-time clustering over sliding windows Sobhan Badiozamany Kjell Orsborn Tore Risch Uppsala University, Sweden Emails: firstname.lastname@it.uu.se