Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Framework for Clustering Evolving Data Streams

Similar presentations


Presentation on theme: "A Framework for Clustering Evolving Data Streams"— Presentation transcript:

1 A Framework for Clustering Evolving Data Streams
Charu C. Aggarwal, Jiawei Han, Jianyong Wang and Philip S. Yu Proc Int. Conf. on Very Large Data Bases (VLDB'03) 2018/12/9 報告人:吳建良

2 Outline Cluster analysis: A general overview Developed methodology
Micro-cluster analysis and maintenance Macro-cluster analysis Evolution analysis Empirical results

3 Cluster analysis: A general overview
What is cluster analysis?—Grouping a set of data objects into a set of clusters s.t. the intra-cluster similarity is high and the inter-cluster similarity is low New requirements in stream clustering Generate high-quality clusters in one scan High quality, efficient incremental clustering Analysis should take care of multi-dimensional space Provide flexibility to compute clusters over user-defined time period

4 Developed methodology: Outline
Divide the clustering process into online and offline components Online: periodically stores summary statistics about the stream data Micro-clustering: better quality than k-means Online processing and maintenance Pyramidal time window: register dynamic changes Offline: answers various user queries based on the stored summary statistics

5 Clustering Feature Vector
Originated from BIRCH Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi2 CF = (5, (16, 30),(54,190)) (3,4) (2,6) (4,5) (4,7) (3,8)

6 Micro-Clusters: Design Methodology
Data streams Multi-dimensional points with time stamps T1, … Tk …. Each point contains d dimensions, i.e., A micro-cluster for n points is defined as a (2*d + 3) tuple: - the sum of the squares of the data values - the sum of the data values - the sum of the squares of the time stamps - the sum of the time stamps - the number of data points

7 Pyramidal Time Frame Snapshots
The micro-clusters are also stored at particular moments in the stream Classified into different frame number which can vary from 0 to log(T), where T is the clock time elapsed since the beginning of the stream The frame number of a particular class of snapshots define the level of granularity in time at which the snapshots are maintained

8 Maintain Snapshot Frame Table
The Rules for insertion of a snapshot t into frame table If (t mod αi)=0 but (t mod αi+1) ≠0, t is inserted into frame number i Each slot has a max_capacity. If the slot has already reached its max_capacity, the oldest snapshot is removed and the new snapshot inserted Example: α= 2 max_capacity =3

9 Micro-clusters Maintenance
The micro-clustering stage is online, statistical data collection – not dependant on user input Initial creation of q micro-clusters M1 … Mq Use k-means clustering algorithm q is usually significantly larger than # of natural clusters q is determined by the amount of available memory Each micro-cluster is associated with a unique id when it is created

10 Incremental Update of Micro-clusters
When a new data point Xik arrives, it is either added to a micro-cluster, or a new micro-cluster is created If Xik falls within the maximum boundary of its closest micro-cluster Mp, Xik is added to Mp Maximum boundary: the RMS deviation of the data points in Mp from its centroid RMS deviation: Otherwise, a new micro-cluster is created for Xik

11 Incremental Update of Micro-clusters (Contd.)
Delete an old cluster or merge two closest clusters? A micro-cluster is deleted whenever the average time stamp of the last m points is less than a given threshold Otherwise, the two closest micro-cluster are merged by adding corresponding cluster feature vectors An idlist is created for the two micro-clusters

12 Macro-Cluster Creation
Macro-clusters are created over a user-specified time horizon h Let S(tc): the set of micro-clusters at time tc S(tc-h): the set of micro-clusters at time tc-h The new set of micro-clusters N(tc-h) are created by subtracting S(tc-h) from S(tc) Subtractive property Let C1 and C2 be two sets of points such that Then

13 Macro-Cluster Creation (Contd.)
Each micro-cluster in N(tc-h) is treated as pseudo-point Each pseudo-point has a weight proportional to the number of points inside it A k-means clustering approach is applied to this set of pseudo-points in order to create a higher level of macro-clusters

14 Evolution Analysis of Micro-Clusters
In many case, it is desirable to find how the micro-clusters have changed over time Given a user-specified time-horizon h and two clock times, t1 and t2 (where t1 < t2 ) Analyze the evolution nature of data arriving between (t2–h, t2), and the data arriving between (t1–h, t1)

15 Evolution Analysis of Micro-Clusters (Contd.)
The following questions Are there new clusters in the data at time t1 which were not present at time t2? Find micro-clusters in N(t2-h) which are not present in N(t1-h) Have some of the original clusters been lost? Find micro-clusters in N(t1-h) which are not present in N(t2-h) Have some of the original clusters at time t1, shifted in position and nature?

16 Empirical Result Data sets
Real Data Sets: Network Intrusion and KDD Cup 98 data set (Charitable Donation) Synthetic Data Sets: Gaussian Distribution Base Size: 100k ~ 1000k points # Cluster: 4 ~ 64 Dimensionality: 10 ~ 100

17 Cluster Quality (Network Intrusion)
Horizon H=1, Stream_speed=2000 Horizon H=256, Stream_speed=200

18 Cluster Quality (Charitable Donation)
Horizon H=4, Stream_speed=2000 Horizon H=16, Stream_speed=200

19 Scalability Stream_speed=2000

20 Sum of Square Distance (SSQ)
Assume there are a total N points in the past horizon H at current time Tc , where is the centroid of macro-cluster closest to pi

21 K-means clustering algorithm
1 2 3 4 5 6 7 8 9 10 10 9 8 7 6 5 Update the cluster means 4 Assign each points to closest center 3 2 1 1 2 3 4 5 6 7 8 9 10 reassign reassign K=2 Arbitrarily choose K points as initial cluster center Update the cluster means


Download ppt "A Framework for Clustering Evolving Data Streams"

Similar presentations


Ads by Google