Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),

Similar presentations


Presentation on theme: "1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),"— Presentation transcript:

1 1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04), Seattle, WA, Aug. 2004 Speaker: Pei-Min Chou Date:2005/04/01

2 2 Outline Introduction Supervised Micro-cluster Snapshot Maintenance Supervised Micro-cluster Training Data Stream Classification on Demand Empirical Results

3 3 Introduction Advances in data storage often grow without limit referred to as data streams one-pass mining model does not recognize the changes and it is too expensive to keep track of the entire history static classification model likely to drop when there is a sudden burst Our model simultaneous training and testing streams used for dynamic classification of data sets

4 4 Supervised Micro-cluster : Modify Micro-cluster Only from training data and each with same class Data streams  Multi-dimensional points with time stamps T 1, … T k ….  Each point contains d dimensions, i.e., A micro-cluster for n points is defined as a (2*d + 4) tuple: - the sum of the squares of the data values - the sum of the data values - the sum of the squares of the time stamps - the sum of the time stamps -the number of data points -variable corresponding to class id corresponds to the class label of that micro-cluster

5 5 Snapshot not too expensive to keep track history storing the behavior of the micro-clusters at different moments in time if (t mod 2 i ) = 0 but (t mod 2 i+1 )!= 0 reaches max capacity, the oldest snapshot in this frame is removed geometric time frame vary from 0 to a value no larger than log 2 (T), T is the maximum length of the stream maximum number =(max capacity)*log 2 (T)

6 6 Maintenance Supervised Micro-clusters Nearest neighbor and k-means algorithms The initial micro-clusters is offline process offline ---answers various user queries based on the stored summary statistics When a new data point X ik arrives, it is either added to a micro-cluster, or a new micro- cluster is created

7 7 Classification on Demand Construct Find the correct time-horizon The value of k fit Large or small horizon be chosen Test

8 8 Find the correct time-horizon Macro-clusters are created over a user-specified time horizon h Let S(t c ): the snapshot of micro-clusters at time t c S(t c -h): the snapshot of micro-clusters at time t c -h The new set of micro-clusters N(t c -h) are created by subtracting S(t c -h) from S(t c ) Subtractive property Let C 1 and C 2 be two sets of points such that Then

9 9 Training Data Stream A small portion of the stream is used for the process of horizon fitting stream segment k fit :the number of points in the data used and the value small as 1% of the data remaining portion of the training stream is used for the creation and maintenance of the class-specific micro-clusters

10 10 The value of k fit Horizon determined classification accuracy Process executed periodically for changes k fit should be small enough so that the points in it reflect the immediate locality of t c Q fit :pre-specified number of time units a part of the training stream the class labels are known a-priori Nearest neighbor procedure (X ε Q fit ) Find the closest micro-cluster in N(t c,h) to X compare the class label and true label

11 11 Large or small horizon be chosen The accuracy of all the time horizons which are tracked by the geometric time frame are determined The p time horizons which provide the greatest dynamic classification accuracy by First sight ---smallest Stable stream ---large

12 12 Test test stream is a separate process which is executed continuously throughout the algorithm Insert X t, nearest neighbor classication process is applied using each (X t belong H) results in the determination class lable these p class labels reported as the relevant class

13 13 Empirical Results Pentium III,512MB,WinXP Both real and synthetic Advantage much higher classification accuracy Good scalability in terms of dimensionality and the number of class labels stable processing rate Space-efficient

14 14 Experiment

15 15 Experiment

16 16 Experiment


Download ppt "1 On Demand Classification of Data Streams Charu C. Aggarwal Jiawei Han Philip S. Yu Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD'04),"

Similar presentations


Ads by Google