Presented by Niwan Wattanakitrungroj

Presented by Niwan Wattanakitrungroj
A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawie Han, Jianyong Wang and Philip S. Yu Proceedings of the 29th VLDB Conference, 2003 Presented by Niwan Wattanakitrungroj 22 June 2010

Outline Introduction The Stream Clustering Framework
Online Micro-cluster Maintenance Macro-Cluster Creation Experimental Results Conclusions

Introduction Clustering problem: to partition the set of data points into one or more groups of similar objects Traditional clustering algorithms are not efficient for clustering the data stream Data stream may grow at an unlimited rate and may evolving over time Data stream cannot be revisited over the course of computation

Introduction (cont.) Previous work: STREAM (O’Callahagn et al., 200)
They implemented a continuous version of the K-means algorithm It is unsafe for evolving data stream, because K-mean is highly sensitive to the arrival of data points If two clusters are merged, there is no way to split them when required by the evolution

The Stream Clustering Framework
CluStream (proposed) Online component (Micro-cluster maintenance) periodically stores summary statistics Offline component (Macro-cluster creation) uses only this summary statistics (utilized by the analyst)

The Stream Clustering Framework(cont.)
Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples A vector of d values, each value is sum of the squares of all data values in the micro-cluster, i.e.,

Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples A vector of d values, each value is sum of all data values in the micro-cluster, i.e.,

Definition 1 A micro-cluster for a set of d-dimensional points with time stamps is defined as the tuples : sum of the squares of the time stamps : sum of the time stamps n : number of data points

: cluster feature vector of micro-cluster for a set of points C

Find the clusters using the subtractive property of micro-clusters at snapshot tc and tc-h time tc tc-h a history of length h How many snapshots should be stored?

Pyramidal time frame Order of snapshot = 0 to log(T) i -th order occur at time intervals of , where is an integer and is taken at a moment in time t when t is exactly divisible by Only the last snapshots of order i are stored ( ). The maximum number of snapshots at any moment is All the snapshots of order i which are not divisible by are non-redundant. Order of Snapshots Clock Times (Last 5 Snapshots) 1 2 3 4 5 32 α = 2 and l = 2

Online Micro-cluster Maintenance
Initialization : create initial q micro-clusters Apply a standard k-mean algorithm Online process of updating a new data point Absorbed by a micro-cluster Create a new micro-cluster

Online Micro-cluster Maintenance(cont.)
a new data point maximum boundary is defined as a factor of t of the RMS deviation Find the closet micro-cluster Falls in the maximum boundary ? Yes No is absorbed by Create a new micro-cluster

Online Micro-cluster Maintenance(cont.)
Create a new micro-cluster assign a new id Reduce # of micro-cluster: calculate the mean and SD CF2t , CF1t relevance stamp is the time of arrival at the m/(2*n)-th percentile Find “relevance stamp” Yes the least relevance stamp of M < δ Join two closet micro-clusters No Delete a micro-cluster Creat idlist which is a union of ids in each micro-cluster M

Macro-Cluster Creation
Using the compactly stored summary statistics of the micro-clusters Inputs from analyst : time-horizon h number of higher level cluster k Apply a modification of a k-mean algorithm The micro-clusters are treated as pseudo-points

Experimental Results Test Environment and Data set
CluStream (proposed) vs. STREAM (O’Callaghan et al.) Dataset: KDD-CUP’99 Network Intrusion Detection (33 attributes) KDD-CUP’98 Charitable Donation (56 attributes) Quality of clustering: measured by sum of square distance (SSQ) Parameter setting:

Experimental Results (cont.)
horizon=1, stream speed = 2000 horizon=4, stream speed = 200 horizon=256, stream speed = 200 horizon=16, stream speed = 200 Network Intrusion dataset Charitable Donation dataset

Experimental Results (cont.)
Charitable Donation dataset, stream speed = 2000 Network Intrusion dataset , stream speed = 2000 Stream Processing Rate

Conclusions CluStream :
clustering method for large evolving data streams view the stream as a changing process over time flexible to an analyst in a real time and evolving environment

Thank you Q & A

Presented by Niwan Wattanakitrungroj

Similar presentations

Presentation on theme: "Presented by Niwan Wattanakitrungroj"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by Niwan Wattanakitrungroj

Similar presentations

Presentation on theme: "Presented by Niwan Wattanakitrungroj"— Presentation transcript:

Similar presentations

About project

Feedback