Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge discovery and data mining, pp , 2003 Reporter :侯佩廷 2011/06/08 1
Outline Introduction Concept Drift Data Expiration Ensemble Classifiers Instance Based Pruning Experiments Conclusion 2011/06/08 2
Introduction The problem of mining data : – The tremendous amount of data is constantly evolving – Concept drift Propose: Using weighted classifiers ensemble to prove the problem. 2011/06/08 3
Concept Drift 2011/06/08 4 5/115/125/135/145/155/165/175/185/195/20 When the concept is updating or changing, there consist the concept drift. Figure 1: Concept drift
Data Expiration 2011/06/08 5 The fundamental problem: How to identify the data which is no longer useful? A straight forward solution : Discards the old data after a fixed period time T
Data Expiration Figure 2: data distributions and optimum boundaries Optimum boundary:positive: Overfitting:negative: 2011/06/08 6 t t0t0 t1t1 t2t2 t3t3 S0S0 S1S1 S2S2
Expiration Figure 3: Which training dataset to use? Optimum boundary: 2011/06/08 7 (a) S 1 + S 2 (b) S 0 + S 1 + S 2 (c) S 2 + S 0
Data Expiration 2011/06/08 8 Instead of discarding data using criteria based on their arrival time, we shall make decisions based on their class distribution.
Ensemble Classifiers y : a test example f c (y): the probability of y being an instance of class c The probability output of the ensemble(via averaging): 2011/06/08 9 Where is the probability output of the i-th classifier in the ensemble
2011/06/08 10 y Classifier1Classifier2Classifier3 = 0.4= 0.6= 0.8
2011/06/08 11 t t1t1 t2t2 t3t3 t4t4 titi t i+1 S1S1 S2S2 S3S3 …… S 10 C1C1 C2C2 C3C3 C 10 G9G9 E9E9 Ensemble Classifiers CiCi GkGk EkEk
S n consists of records in the form of (x, c), where c is the true label of the record. C i ’s classification error of example (x, c) is 1- Mean square error of classifier C i : 2011/06/08 12
Ensemble Classifiers A classifier predicts randomly will have mean square error: Ex: 2011/06/08 13 Classifier Class 2Class 1 P = 0.5
Ensemble Classifiers We discard classifiers whose error is equal to or larger than MSE r. Weight w i for classifier C i : w i = MSE r - MES i 2011/06/08 14
Ensemble Classifiers For cost-sensitive applications such as credit card fraud detection. 2011/06/08 15 Predict fraudPredict not fraud Actual fraudt(x)-cost0 Actual not fraud-cost0 Predict fraudPredict not fraud Actual fraud Actual not fraud-900
Instance Based Pruning Goal: – Use first k classifiers with high weights to reach the same decision when we use all K classifiers. 2011/06/08 16
Instance Based Pruning The conditions of this pipeline procedure stops: – The confident prediction can be made – No more classifiers in the pipeline 2011/06/08 17 Weight C1C1 C2C2 CkCk CKCK High Low ……
Instance Based Pruning After consulting the first k classifiers, we derive the current weighted probability: 2011/06/08 18
Instance Based Pruning Let ε k (x)=F k (x)- F K (x) be the error at stage k. We compute the mean and the variance of ε k (x) 2011/06/08 19
Experiments Two kinds of data: Synthetic Data – Create synthetic data with drifting concepts on moving hyperplane. Credit Card Fraud Data – One year and 5 million transactions 2011/06/08 20
Experiments 2011/06/08 21 Figure 4: Training Time, ChunkSize, and Error Rate
Experiments 2011/06/08 22 Figure 5: Effects of Instance Base Pruning
Experiments 2011/06/08 23 Figure 6: Average Error Rate of Single and Ensemble Decision Tree Classifiers
Experiments 2011/06/08 24 Figure 7: Averaged Benefits using Single Classifiers and Classifier Ensembles The benfits are averaged from multiple runs with different chunk size ( 3000 to transactions per chunk) Average the benefits of E k and G k ( K = 2,…8) for each fixed chunk size.
Conclusion The problem of mining data : – The tremendous amount of data is constantly evolving – Concept drift Weight ensemble classifiers is more efficient than the single classifiers 2011/06/08 25
Q & A 2011/06/08 26
THANKS FOR ATTENTATION. 2011/06/08 27