Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by Niwan Wattanakitrungroj

Similar presentations


Presentation on theme: "Presented by Niwan Wattanakitrungroj"— Presentation transcript:

1 Presented by Niwan Wattanakitrungroj
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawie Han, Jianyong Wang and Philip S. Yu Proceedings of the 29th VLDB Conference, 2003 Presented by Niwan Wattanakitrungroj 22 June 2010

2 Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

3 Introduction Clustering problem: to partition the set of data points into one or more groups of similar objects Traditional clustering algorithms are not efficient for clustering the data stream Data stream may grow at an unlimited rate and may evolving over time Data stream cannot be revisited over the course of computation

4 Introduction (cont.) Previous work: STREAM (O’Callahagn et al., 200)
They implemented a continuous version of the K-means algorithm It is unsafe for evolving data stream, because K-mean is highly sensitive to the arrival of data points If two clusters are merged, there is no way to split them when required by the evolution

5 Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

6 The Stream Clustering Framework
CluStream (proposed) Online component (Micro-cluster maintenance) periodically stores summary statistics Offline component (Macro-cluster creation) uses only this summary statistics (utilized by the analyst)

7 The Stream Clustering Framework(cont.)
Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples A vector of d values, each value is sum of the squares of all data values in the micro-cluster, i.e.,

8 The Stream Clustering Framework(cont.)
Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples A vector of d values, each value is sum of all data values in the micro-cluster, i.e.,

9 The Stream Clustering Framework(cont.)
Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples : sum of the squares of the time stamps : sum of the time stamps n : number of data points

10 The Stream Clustering Framework(cont.)
: cluster feature vector of micro-cluster for a set of points C

11 The Stream Clustering Framework(cont.)
Find the clusters using the subtractive property of micro-clusters at snapshot tc and tc-h time tc tc-h a history of length h How many snapshots should be stored?

12 The Stream Clustering Framework(cont.)
Pyramidal time frame Order of snapshot = 0 to log(T) i -th order occur at time intervals of , where is an integer and is taken at a moment in time t when t is exactly divisible by Only the last snapshots of order i are stored ( ). The maximum number of snapshots at any moment is All the snapshots of order i which are not divisible by are non-redundant. Order of Snapshots Clock Times (Last 5 Snapshots) 1 2 3 4 5 32 α = 2 and l = 2

13 Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

14 Online Micro-cluster Maintenance
Initialization : create initial q micro-clusters Apply a standard k-mean algorithm Online process of updating a new data point Absorbed by a micro-cluster Create a new micro-cluster

15 Online Micro-cluster Maintenance(cont.)
a new data point maximum boundary is defined as a factor of t of the RMS deviation Find the closet micro-cluster Falls in the maximum boundary ? Yes No is absorbed by Create a new micro-cluster

16 Online Micro-cluster Maintenance(cont.)
Create a new micro-cluster assign a new id Reduce # of micro-cluster: calculate the mean and SD CF2t , CF1t relevance stamp is the time of arrival at the m/(2*n)-th percentile Find “relevance stamp” Yes the least relevance stamp of M < δ Join two closet micro-clusters No Delete a micro-cluster Creat idlist which is a union of ids in each micro-cluster M

17 Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

18 Macro-Cluster Creation
Using the compactly stored summary statistics of the micro-clusters Inputs from analyst : time-horizon h number of higher level cluster k Apply a modification of a k-mean algorithm The micro-clusters are treated as pseudo-points

19 Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

20 Experimental Results Test Environment and Data set
CluStream (proposed) vs. STREAM (O’Callaghan et al.) Dataset: KDD-CUP’99 Network Intrusion Detection (33 attributes) KDD-CUP’98 Charitable Donation (56 attributes) Quality of clustering: measured by sum of square distance (SSQ) Parameter setting:

21 Experimental Results (cont.)
horizon=1, stream speed = 2000 horizon=4, stream speed = 200 horizon=256, stream speed = 200 horizon=16, stream speed = 200 Network Intrusion dataset Charitable Donation dataset

22 Experimental Results (cont.)
Charitable Donation dataset, stream speed = 2000 Network Intrusion dataset , stream speed = 2000 Stream Processing Rate

23 Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

24 Conclusions CluStream :
clustering method for large evolving data streams view the stream as a changing process over time flexible to an analyst in a real time and evolving environment

25 Thank you Q & A


Download ppt "Presented by Niwan Wattanakitrungroj"

Similar presentations


Ads by Google