A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad

Outline Background of Clustering Motivation for Clustering over Streaming Data. Overall Solution Micro Clusters Pyramid Time Frame Macro Cluster Cluster Maintenance

Background of Clustering Definition of Clustering  For a given set of data points, partitioning them into one or more groups of similar objects.  “Similarity” is often defined with the use of some distance measure. Difference between “group by” queries and clustering.

Background of Clustering Some of the most popular clustering algorithms:  K- Means, BIRCH, CURE, Density Based Clustering. Clustering has many applications in data bases, information visualization, data mining. What are Oultiers?

Motivation Challenge in Streaming Environment:  Clustering is an expensive process.  Resource constraints.  Infinite streams. Can simply extending one pass algorithms for static databases to stream processing suffice?

Motivation Requirements of clustering for stream processing:  Statistical summary information storage.  Efficient update process.  Ability to cluster for a specific time horizon,

Overall Solution of the Paper Divide the clustering process to two phases Online Component: periodically stores detailed summary statistics Offline Component uses only the summary statistics to do clustering

Micro-Clusters What is a Micro-Cluster A Micro-Cluster is a set of individual data points that are close to each other and will be treated as a single unit in further offline Macro-clustering. View of Micro-ClusterView of Macro-Cluster

Micro-Clusters What to Store in a Micro-Cluster = Key idea: Additivity Property

Pyramidal Time Frame The snapshots follow a pyramidal pattern … … When should we make the snapshot? The micro-clusters are stored at snapshots. Snapshot

Pyramidal Time Frame Snapshots are classified into different orders which can vary from 1 to log α(T). For example, T is 55, α=2, then we have orders 0 with interval 2^0=1, order 1 with interval 2^1=2, order 2 with interval 2^2=4, order 3 with interval 2^3=8, order 4 with interval 2^4=16, order 5 with interval 2^5=32. For a data stream the maximum number of snapshots maintained at T time units since the beginning of the stream mining process is (α + 1) log α(T). (α + 1 for each order)

Why Pyramidal Pattern? For any user-specified time window of h, at least one stored snapshot can be found within 2 h units of the current time. Please Note: Only Approximate Answers!!!

Micro Cluster Creation It is assumed that a total of q micro- clusters are maintained at any moment by the algorithm. This is done using an offline process (k- means) at the very beginning of the data stream computation process.

Online Micro Cluster Maintenance How to deal with a new coming point? 1. Join one of the old cluster 2. Create a new cluster by its own How to deal with the old clusters 1. Delete them (based on relevance stamp) 2. Merge them (merge the closest two) A merged cluster will have all the IDs its components have

Macro-Cluster Creation Based on the Additivity Property of cluster feature vector

Macro-Cluster Creation Current Time T, the window size is h. That means the user want to find the clusters formed in (T-h, T). Approach: 1. 1st step: Find the snapshot for T, get the micro-cluster set S(T). 2. 2nd step: Find the snapshot for T-h, get the micro-cluster set S(T-h). 3. Use S(T)-S(T-h) Specifically, we have a merged cluster with Id list (C1, C2, C3) in S(T) and a cluster with Id C1 in S(T-h). Then the we use CFT(C1,C2,C3)-CFT(C1)=CFT(C2,C3), because C1 are formed before T-h, thus should not contribute to the micro-cluster formed in (T-h,T)

Example C_ID: [C1] Time: T-h C_ID: [C1, C2, C3] Time: T C_ID: [C2, C3] Result: T-h

Macro-Cluster Creation Run K-means on Micro-Clusters

How do you feel about this paper? My feeling: Quite Fuzzy Results: Approximation is every where. Nothing New: Micro-Clusters, K-means, Cluster Feature Vectors, Pyramidal Time Frame are all old stuffs.

Counter Example C_ID: [C2] C_ID: [C1, C2, C3] Time: T C_ID: [C1, C3] Time: T-h Result

Advertisement Di and Charu’s project deals with: 1. Deterministic Clusters 2. Clusters with Arbitrary Shapes 3. Real Expirations 4. Disk Version 5. Outlier Detection by Free

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Similar presentations

Presentation on theme: "A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Similar presentations

Presentation on theme: "A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad."— Presentation transcript:

Similar presentations

About project

Feedback