Download presentation
Presentation is loading. Please wait.
Published byRodney Anthony Modified over 9 years ago
1
Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing Gao 2, Jiawei Han 2, and Bhavani Thuraisingham 1 1 Department of Computer Science, University of Texas at Dallas 2 Department of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by
2
Presentation Overview Stream Mining Background Novel Class Detection– Concept Evolution
3
Data Streams Data streams are: ◦ Continuous flows of data Network traffic Sensor data Call center records ◦ Examples:
4
Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making Data Stream Classification Network traffic Classification model Attack traffic Firewall Block and quarantine Benign traffic Server Model update Expert analysis and labeling
5
Infinite length Concept-drift Concept-evolution (emergence of novel class) Recurrence (seasonal) class Challenges Introduction 5ICDM 2012, Brussels, Belgium12/11/2012
6
Impractical to store and use all historical data ◦ Requires infinite storage ◦ And running time Infinite Length 0 1 1 0 1 1 1 1 0 0 0 0
7
Concept-Drift Negative instance Positive instance A data chunk Current hyperplane Previous hyperplane Instances victim of concept-drift
8
Concept-Evolution X X X X X X X X X X XX X X X X X X X X Novel class y x1x1 y1y1 y2y2 x ++++ ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - Classification rules: R1. if (x > x 1 and y < y 2 ) or (x < x 1 and y < y 1 ) then class = + R2. if (x > x 1 and y > y2) or (x y 1 ) then class = - Existing classification models misclassify novel class instances A C D B y x1x1 y1y1 y2y2 x ++++ ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - A C D B
9
Background: Ensemble of Classifiers C1C1 C2C2 C3C3 x,? + + - input Classifier Individual outputs voting + Ensemble output
10
Background: Ensemble Classification of Data Streams Divide the data stream into equal sized chunks ◦ Train a classifier from each data chunk ◦ Keep the best L such classifier-ensemble ◦ Example: for L= 3 Data chunks Classifiers D1D1 C1C1 D2D2 C2C2 D3D3 C3C3 Ensemble C1C1 C2C2 C3C3 D4D4 Prediction D4D4 C4C4 C4C4 C4C4 D5D5 D5D5 C5C5 C5C5 C5C5 D6D6 Labeled chunk Unlabeled chunk Addresses infinite length and concept-drift Note: D i may contain data points from different classes
11
Examples of Recurrence and Novel Classes Twitter Stream – a stream of messages Each message may be given a category or “class” ◦ based on the topic Examples ◦ “Election 2012”, “London Olympic”, “Halloween”, “Christmas”, “Hurricane Sandy”, etc. Among these ◦ “Election 2012” or “Hurricane Sandy” are novel classes because they are new events. Also ◦ “Halloween” is recurrence class because it “recurs” every year. 11ICDM 2012, Brussels, Belgium12/11/2012 Introduction
12
Concept-Evolution and Feature Space Introduction X X X X X X X X X X XX X X X X X X X X Novel class y x1x1 y1y1 y2y2 x ++++ ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - Classification rules: R1. if (x > x 1 and y < y 2 ) or (x < x 1 and y < y 1 ) then class = + R2. if (x > x 1 and y > y2) or (x y 1 ) then class = - Existing classification models misclassify novel class instances A C D B y x1x1 y1y1 y2y2 x ++++ ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - A C D B 12ICDM 2012, Brussels, Belgium12/11/2012
13
Novel Class Detection – Prior Work Prior work 13ICDM 2012, Brussels, Belgium12/11/2012 Three steps: ◦ Training and building decision boundary ◦ Outlier detection and filtering ◦ Computing cohesion and separation
14
Training: Creating Decision Boundary ++++ ++ + + + + +++ ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ - - - - - - - - - - -- - - - - - - - - - - y x1x1 y1y1 y2y2 B C A D x -- - - - - - - - - - - +++ + + + + Raw training data Clusters are created y x1x1 y1y1 y2y2 x A D C B Pseudopoints Addresses Infinite length problem 14ICDM 2012, Brussels, Belgium12/11/2012 Prior work Training is done chunk-by-chunk (One classifier per chunk) An ensemble of classifiers are used for classification
15
Outlier Detection and Filtering x1x1 x y y1y1 y2y2 B C A D x x AND Routlier Ensemble of L models M1M1 M2M2 MLML x Test instance... True X is a filtered outlier (Foutlier) (potential novel class instance) False X is an existing class instance Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. 15ICDM 2012, Brussels, Belgium12/11/2012 Prior work
16
Computing Cohesion & Separation a(x) = mean distance from an Foutlier x to the instances in o,q (x) b min (x) = minimum among all b c (x) (e.g. b + (x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. x o,5 (x) +,5 (x) - - - + + - - - - + -,5 (x) a(x) b + (x)b - (x) 16ICDM 2012, Brussels, Belgium12/11/2012 Prior work
17
Limitation: Recurrence Class chunk 0 chunk 1 chunk 49 chunk 50 Stream chunk 51 chunk 52 chunk 99 chunk 100 chunk 101 chunk 102 chunk 149 chunk 150 17ICDM 2012, Brussels, Belgium12/11/2012 Prior work
18
Why Recurrence Classes are Forgotten? Divide the data stream into equal sized chunks ◦ Train a classifier from whole data chunk ◦ Keep the best L such classifier-ensemble ◦ Example: for L= 3 ◦ Therefore, old models are discarded ◦ Old classes are “forgotten” after a while Data chunks Classifiers D1D1 C1C1 D2D2 C2C2 D3D3 C3C3 Ensemble C1C1 C2C2 C3C3 D4D4 Prediction D4D4 C4C4 C4C4 C4C4 D5D5 D5D5 C5C5 C5C5 C5C5 D6D6 Labeled chunk Unlabeled chunk Addresses infinite length and concept-drift 18ICDM 2012, Brussels, Belgium12/11/2012 Prior work
19
CLAM: The Proposed Approach 19 ICDM 2012, Brussels, Belgium 12/11/2012 Latest Labeled chunk Stream New model Training Ensemble (M) (keeps all classes) U p d a t e Latest unlabeled instance Outlier detection Not outlier Classify using M (Existing class) Outlier Buffering and novel class detection Proposed method CLAss Based Micro-Classifier Ensemble
20
Training and Updating 20 ICDM 2012, Brussels, Belgium 12/11/2012 Proposed method Each chunk is first separated into different classes A micro-classifier is trained from each class’s data Each micro-classifier replaces one existing micro-classifier A total of L micro-classifiers make a Micro-Classifier Ensemble (MCE) C such MCE’s constitute the whole ensemble, E
21
CLAM: The Proposed Approach 21 ICDM 2012, Brussels, Belgium 12/11/2012 Latest Labeled chunk Stream New model Training Ensemble (M) (keeps all classes) Update Latest unlabeled instance Outlier detection Not outlier Classify using M (Existing class) Outlier Buffering and novel class detection Proposed method CLAss Based Micro-Classifier Ensemble
22
Outlier Detection and Classification 22 ICDM 2012, Brussels, Belgium 12/11/2012 Proposed method A test instance x is first classified with each micro-classifier ensemble Each micro-classifier ensemble gives a partial output (Y r ) and a outlier flag (boolean) If all ensembles flags x as outlier, then it is buffered and sent to novel class detector Otherwise, the partial outputs are combined and a class label is predicted
23
Evaluation Competitors: ◦ CLAM (CL) – proposed work ◦ SCANR (SC) [1] – prior work ◦ ECSMiner (EM) [2] – prior work ◦ Olindda [3]-WCE [4] (OW) – another baseline Datasets: Synthetic, KDD Cup 1999 & Forest covertype 1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176–181. 2. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): 859-874 (2011).23 3. E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, 2008. 4. H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM. 23ICDM 2012, Brussels, Belgium12/11/2012 Evaluation
24
Overall Error 24ICDM 2012, Brussels, Belgium12/11/2012 Evaluation Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD
25
Number of Recurring Classes vs Error 25ICDM 2012, Brussels, Belgium12/11/2012 Evaluation
26
Error vs Drift and Chunk Size 26ICDM 2012, Brussels, Belgium12/11/2012 Evaluation
27
Summary Table 27ICDM 2012, Brussels, Belgium12/11/2012 Evaluation
28
Conclusion Detect Recurrence Improved Accuracy Running Time Reduced Human Interaction Future work: use other base learners 28ICDM 2012, Brussels, Belgium12/11/2012
29
29ICDM 2012, Brussels, Belgium12/11/2012
30
30ICDM 2012, Brussels, Belgium12/11/2012
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.