Data Stream Classification and Novel Class Detection Mehedy Masud, Latifur Khan, Qing Chen and Bhavani Thuraisingham Department of Computer Science , University of Texas at Dallas Jing Gao, Jiawei Han Department of Computer Science , University of Illionois at Urbana Champaign Charu Aggarwal IBM T. J. Watson Good morning everyone I am …. And I am going to present my dissertation “Adaptive…..” My supervisor is Dr. Latifur Khan My research work was funded in part by NASA and Air force research lab Pis and Co-Pis of these research projects were Dr. Bhavani Thuraisingham, and Dr. Latifur Khan This work was funded in part by Masud et al. Aug 10, 2011
Outline of The Presentation Background Data Stream Classification Novel Class Detection At first, I will introduce you to data streams and data stream classifications Then I will discuss our approaches of data stream classifications, namely, Ensemble classification, Novel class detection, and Classification with limited labeled data Finally I will discuss contributions and future works. Masud et al. Aug 10, 2011
Introduction Characteristics of Data streams are: Continuous flow of data Examples: Data streams are Continuous flows of data For example, network traffic, sensor data, and call center records Network traffic Sensor data Call center records Masud et al. Aug 10, 2011
Data Stream Classification Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making Expert analysis and labeling Network traffic Classification model Attack traffic Firewall Block and quarantine Benign traffic Server Model update Data stream classification is an important branch of data stream mining Here we use past labeled data to build classification models By labeled data we mean data that has been labeled by human experts This model is used to classify future instances This helps decision making As an example, suppose we want to protect a network server from malicious traffic So we put a firewall between the server and the network traffic A classification model sits inside the server, that classifies traffic as either malicious or benign If the traffic is benign, it is allowed to pass to the server Otherwise, the traffic is blocked and quarantined Since the malicious traffic frequently changes its characteristics to avoid detection We also update the model periodically. The main challenge in data stream classification is this updating process, which we will later discuss in details Masud et al. Aug 10, 2011
Data Stream Classification (cont..) What are the applications? Security Monitoring Network monitoring and traffic engineering. Business : credit card transaction flows. Telecommunication calling records. Web logs and web page click streams. Masud et al. Aug 10, 2011
Challenges Infinite length Concept-drift Concept-evolution Feature Evolution Masud et al. Aug 10, 2011
Infinite Length Impractical to store and use all historical data Requires infinite storage And running time 1 Naturally, data streams are infinite Therefore, it is impractical to store and use all the historical data As it would require infinite storage And infinite running time to build the classification model Masud et al. Aug 10, 2011
Concept-Drift Current hyperplane Previous hyperplane A data chunk Suppose we divide the data stream into equal sized chunks This is a data chunk, where each data point is two dimensional A hyperplane separates the data points into two classes: positive and negative Suppose this is the next data chunk We see that the hyperplane has shifted As a result, some data points that were negative according to the previous hyperplane, are now positive -> this indicates concept-drift In the next data chunk, we see again the hyperplane has shifted, and some data points changed their label A data chunk Negative instance Instances victim of concept-drift Positive instance Masud et al. Aug 10, 2011
Concept-Evolution y y x1 y1 y2 x A C D B D y1 C A y2 B x1 x X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X Novel class y y x1 y1 y2 x ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - A C D B - - - - - - - - - - - - - - - D y1 C A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + y2 B + + + + + + + + x1 x Classification rules: R1. if (x > x1 and y < y2) or (x < x1 and y < y1) then class = + R2. if (x > x1 and y > y2) or (x < x1 and y > y1) then class = - Concept-evolution occurs when a new class arrives in the stream In this example, we again see a data chunk having two dimensional data points There are two classes here, + and - Suppose we train a rule-based classifier using this chunk The classification rules are shown above. These rules divide the feature space into four segments A, B, C, D Anything that falls into segment A is classified as +, and so on Suppose a new class x arrives in the stream in the next chunk If we use the same classification rules, all novel class instances will be mis-classified as either + or – So, this is a problem, and we need to solve it Existing classification models misclassify novel class instances Masud et al. Aug 10, 2011
Dynamic Features Why new features evolving Infinite data stream Normally, global feature set is unknown New features may appear Concept drift As concept drifting, new features may appear Concept evolution New type of class normally holds new set of features Different chunks may have different feature sets Masud et al. Aug 10, 2011
ith chunk and i + 1st chunk and models have different feature sets Dynamic Features ith chunk and i + 1st chunk and models have different feature sets runway, climb runway, clear, ramp i + 1st chunk Feature Set Feature Extraction & Selection ith chunk runway, ground, ramp Current model Feature Space Conversion Classification & Novel Class Detection In order to explain the limited labeled data problem, first let us see how existing data stream classification techniques work Suppose we have an existing model When a data chunk is completely labeled by human experts, it is used to update the current model This model is then used to classify unlabeled data However, in a data stream, it may not be possible to label all the instances in a data chunk Because manual labeling is costly and time consuming So, labeling cannot keep up with the stream speed Therefore, in reality most of the data points in each chunk would remain unlabeled. We need a classification technique that can work with this limited labeled data. Training New Model Existing classification models need complete fixed features and apply to all the chunks. Global features are difficult to predict. One solution is using all English words and generate vector. Dimension of the vector will be too high. Masud et al. Aug 10, 2011 11
Outline of The Presentation Introduction Data Stream Classification Novel Class Detection At first, I will introduce you to data streams and data stream classifications Then I will discuss our approaches of data stream classifications, namely, Ensemble classification, Novel class detection, and Classification with limited labeled data Finally I will discuss contributions and future works. Masud et al. Aug 10, 2011
DataStream Classification (cont..) Single Model Incremental Classification Ensemble – model based classification Supervised Semi-supervised Active learning Masud et al. Aug 10, 2011
Overview Single Model Incremental Classification Ensemble – model based classification Data Selection Semi-supervised Skewed Data I Masud et al. Aug 10, 2011
Ensemble of Classifiers + C2 + x,? + C3 input - Individual outputs voting Ensemble output Classifier Masud et al. Aug 10, 2011
Ensemble Classification of Data Streams Divide the data stream into equal sized chunks Train a classifier from each data chunk Keep the best L such classifier-ensemble Example: for L= 3 Note: Di may contain data points from different classes D1 C1 D2 C2 D5 C5 D4 C4 D3 C3 D6 D5 D4 Labeled chunk Data chunks Unlabeled chunk Before going into the details of our approach, I will illustrate the ensemble classification technique that we follow An ensemble of classifiers means a collection classifiers. Suppose L is a parameter that denotes the number of classifiers in the ensemble. And let L = 3. In this example, we see that the first three data chunks D1, D2 and D3 in the stream are used to train three classifiers, C1, C2 and C3, respectively. So, the initial ensemble is built with three classifiers. This ensemble is used to predict the class labels of the instances in the latest data chunk D4. Ensemble classification is done using majority voting among the classifiers. Chunk D4 eventually becomes labeled, and classifier C4 is trained using D4. Now we have four classifiers C1-C4. Each classifier is evaluated on the latest labeled data chunk D4, and the worst of them is discarded. In this example, we see that C2 is being discarded, and D4 takes its position. In this way, the ensemble is updated. By keeping the number of classifiers constant, we solve the infinite length problem, and by removing older classifiers, we solve the concept-drift problem Prediction C4 C5 Addresses infinite length and concept-drift Classifiers C1 C4 C2 C3 C5 Ensemble Masud et al. Aug 10, 2011
Concept-Evolution Problem ECSMiner Concept-Evolution Problem A completely new class of data arrives in the stream y1 X X X X X X X X X X XX X X X X X XX X X X X X X X X X X X X X X X XX X X X X X X X X X X X Novel class y x1 y2 x ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + - - - - - - - - - - - - - - - + + + + + + + + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - (c) (c) A Novel class (denoted by x) arrives in the stream. y x<x1 - - - - - - - - - - - - - - - D y1 ++++ ++ ++ + + ++ + +++ ++ + ++ + + + ++ + +++++ ++++ +++ + ++ + + ++ ++ + + A T F C y<y1 y<y2 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -- - - - - T F T F - + - y2 D B C B + + + + + + + + + + + + + + + I have already discussed concept-evolution problem. Just to refresh your memory Concept-evolution occurs when a completely new class arrives in the stream In this example, the decision tree is built from this data chunk. Each leaf node of the decision tree corresponds to one segment in the feature space. For example leaf node A corresponds to segment A The existing classes are + and - When a novel class x arrives, all instances of the class are mis-classified by the decision tree x1 x (a) (b) (a) A decision tree, (b) corresponding feature space partitioning Masud et al. Aug 10, 2011
ECSMiner: Overview Overview of ECSMiner algorithm Data Stream xnow Just arrived Older instances (labeled) Newer instances (unlabeled) Last labeled chunk Buffering and novel class detection Yes Update Training New model Outlier detection Buffer? Classification No Ensemble of L models M1 M2 . . . ML Now I will explain the working principle of Xminer Suppose we already have an ensemble of models. The latest labeled data chunk is used to train a new model This new model updates the existing ensemble When a new data point arrives in the stream We use the ensemble to detect whether it is an outlier If it is not an outlier, we immediately classify it Otherwise, we store it in a buffer for further inspection The buffer is analyzed periodically to check for arrival of novel classes Overview of ECSMiner algorithm Based on: Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009, pp 79-94 (extended version appeared in IEEE Transaction on Knowledge and Data Engineering (TKDE)). Masud et al. Aug 10, 2011
Algorithm Novel class detection and classification Training ECSMiner Here is the algorithm There are two main parts of the algorithm Classification and novel class detection And Training Notice that the training part calls a function: train and save decision boundary What does it do? Recall that traditional classification algorithms cannot detect novel class So we modify the existing classification techniques in such a way such that it can detect a novel class Creating and saving decision boundary during training is one component of such modification We believe that any traditional classification algorithm such as decision tree, k-nearest neighbor etc. can be modified this way to enable them to detect novel classes So how is this modification done? We will see next Masud et al. Aug 10, 2011
Novel Class Detection Non parametric Steps: ECSMiner Novel Class Detection Non parametric does not assume any underlying model of existing classes Steps: Creating and saving decision boundary during training Detecting and filtering outliers Measuring cohesion and separation among test and training instances Masud et al. Aug 10, 2011
Training: Creating Decision Boundary ECSMiner Training: Creating Decision Boundary Raw training data y x1 y1 y2 x A D C B Pseudopoints Clusters are created y - - - - - - - - - - - - - D y1 C A - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ++++ ++ + + + + +++ ++ + + + + + ++ + +++ ++ ++ +++ +++++ ++++ +++ + ++ + + ++ ++ + ++ y2 Continued from our previous example, suppose we use this data chunk to build a decision tree Each segment of the feature space represents a leaf node of the decision tree Once the decision tree has been built, we cluster the data points belonging to each leaf node using K-means clustering The total number of clusters in a data chunk is K, a constant number The number of clusters created in each leaf node is proportional to the number of data points belonging to the node Once the clusters have been created, we discard the raw data, and store only the cluster summary, in order to save memroy Each cluster summary is called a pseudopoint Each pseudopoint has a centroid and radius. Therefore, each cluster corresponds to a hypersphere in the feature space Union of all the hyperspheres in the chunk is the decision boundary So, what is the use of this decision boundary? We will see next B +++ + + + + + + + x1 x Addresses Infinite length problem Masud et al. Aug 10, 2011
Outlier Detection and Filtering ECSMiner Outlier Detection and Filtering Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Ensemble of L models M1 M2 ML x Test instance . . . y D x y1 C A Routlier Routlier Routlier x AND X is an existing class instance y2 True False B First let us see how outlier detection and filtering works This is the decision boundary that was created during training If any test instance x falls inside the boundary, it is not an outlier If it falls outside the boundary, it is called a raw outlier or Routlier Routlier does not imply a novel class, because outliers may appear as a result of concept-drift, noise, or insufficient training data. So, we filter the outliers to reduce noise as much as possible This is done as follows: each test instance x is tested with each model in the ensemble. If all the models detects it as an Routlier, then it is a potential novel class instance, and we call it an Foutlier If any model detects that it is not an Routlier, then it is assumed to be an existing class instance, and normally classified using the ensemble But a single Foutlier does not imply a novel class. We need further evidence. X is a filtered outlier (Foutlier) (potential novel class instance) x1 x Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. Masud et al. Aug 10, 2011
Novel Class Detection Test instance Ensemble of L models M1 M2 ML ECSMiner Novel Class Detection Ensemble of L models M1 M2 ML x Test instance . . . q-NSC>0 for q’>q Foutliers with all models? (Step 1) (Step 4) Routlier Routlier N Treat as existing class Routlier X is an existing class instance (Step 2) AND True False X is a filtered outlier (Foutlier) (potential novel class instance) First let us see how outlier detection and filtering works This is the decision boundary that was created during training If any test instance x falls inside the boundary, it is not an outlier If it falls outside the boundary, it is called a raw outlier or Routlier Routlier does not imply a novel class, because outliers may appear as a result of concept-drift, noise, or insufficient training data. So, we filter the outliers to reduce noise as much as possible This is done as follows: each test instance x is tested with each model in the ensemble. If all the models detects it as an Routlier, then it is a potential novel class instance, and we call it an Foutlier If any model detects that it is not an Routlier, then it is assumed to be an existing class instance, and normally classified using the ensemble But a single Foutlier does not imply a novel class. We need further evidence. Compute q-NSC with all models and other Foutliers Y Novel class found (Step 3) Masud et al. Aug 10, 2011
Computing Cohesion & Separation ECSMiner Computing Cohesion & Separation o,5(x) a(x) x -,5(x) +,5(x) b+(x) b-(x) - - + + + + + - - - + + + + a(x) = mean distance from an Foutlier x to the instances in o,q(x) bmin(x) = minimum among all bc(x) (e.g. b+(x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): Suppose the black dots represent the Foutlier instances And + and – are the existing class instances We need to know whether the Foutliers really belong to a novel class For each Foutlier x, we compute the distance from x to all its lambda-c,q neighborhoods. In this example, let q = 5, and also let lambda o,5 be the Foutlier neighborhood of x Let a(x) be the mean distance from x to its Foutlier neighborhood B+x be the mean distance from x to the lambda+ neighborhood B-x be the mean distance from x to the lambda+ neighborhood Bmin x be the minimum of all Bc x Using these values, we compute the q-neighborhood silhouette coefficient, or q-NSC Its value is between -1 and +1. If positive, x is closer to the Foutliers and far from existing classes. If we have many Foutliers with positive q-NSC, it means Foutliers have strong cohesion among themselves and separation from the existing classes If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. Masud et al. Aug 10, 2011
Speeding Up Computing N-NSC for every Foutlier instance x takes quadratic time in the number of Foutliers. In order to make the computation faster, We create Ko pseudopoints (Fpseudopoints) from Foutliers using K-means clustering, where Ko = (No/S) * K. Here S is the chunk size and No is the number of Foutliers. perform the computations on the Fpseudopoints Thus, the time complexity to compute the N-NSC of all of the Fpseudopoints is O(Ko(Ko+K)) which is constant, since both Ko and K are independent of the input size. However, by gaining speed we lose some precision, although the loss is negligible (to be analyzed shortly) Masud et al. Aug 10, 2011
Algorithm To Detect Novel Class ECSMiner Algorithm To Detect Novel Class Masud et al. Aug 10, 2011 26
“Speedup” Penalty As discussed earlier by speeding up computation in step – 3, we lose some precision since the result deviates from exact result This analysis shows that the deviation is negligible (x-i)2 i i x (i-j)2 (x-j)2 j j Figure 6. Illustrating the computation of deviation. i is an Fpseudopoint, i,e., a cluster of Foutliers, and j is an existing class Pseudopoint, i.e., a cluster of existing class instances. In this particular example, all instances in i belong to a novel class. Masud et al. Aug 10, 2011
“Speedup” Penalty Approximate: Exact: Deviation: Masud et al. Aug 10, 2011
Experiments - Datasets We evaluated our approach on two synthetic and two real datasets: SynC – Synthetic data with only concept-drift. Generated using hyperplane equation. 2 classes, 10 attributes, 250K instances SynCN – Synthetic data with concept-drift and novel class. Generated using Gaussian distribution. 20 classes, 40 attributes, 400K instances KDD cup 1999 intrusion detection (10% version) – real dataset. 23 classes, 34 attributes, 490K instances Forest cover – real dataset. 7 classes, 54 attributes, 581K instances Masud et al. Aug 10, 2011
Experiments - Setup Development: H/W: Parameter settings: Language: Java H/W: Intel P-IV with 2GB memory and 3GHz dual processor CPU. Parameter settings: K (number of pseudopoints per chunk) = 50 N (minimum number of instances required to declare novel class) = 50 M (ensemble size) = 6 S (chunk size) = 2,000 Masud et al. Aug 10, 2011
Experiments - Baseline Competing approaches: i) MineClass (MC): our approach ii) WCE-OLINDDA_Parallel (W-OP) iii) WCE-OLINDDA_Single (W-OS): Where WCE-OLINDDA is a combination of the Weighted Classifier Ensemble (WCE) and novel class detector OLINDDA, with default parameter settings for WCE and OLINDDA We use this combination since to the best of our knowledge there is no approach that Can classify and detect novel classes simultaneously OLINDDA assumes there is only one normal class, and all other classes are novel Therefore, we apply two variations – W-OP keeps parallel OLINDDA models, one for each class W-OS keeps a single model that absorbs a novel class when encountered Masud et al. Aug 10, 2011
Experiments - Results Evaluation metrics Mnew = % of novel class instances Misclassified as existing class = Fn∗100/Nc Fnew = % of existing class instances Falsely identified as novel class = Fp∗100/ (N−Nc) ERR = Total misclassification error (%)(including Mnew and Fnew) = (Fp+Fn+Fe)∗100/N where Fn = total novel class instances misclassified as existing class, Fp = total existing class instances misclassified as novel class, Fe = total existing class instances misclassified (other than Fp), Nc = total novel class instances in the stream, N = total instances the stream. Masud et al. Aug 10, 2011
Experiments - Results Forest Cover KDD cup SynCN Masud et al. Aug 10, 2011
Experiments - Results Masud et al. Aug 10, 2011
Experiments – Parameter Sensitivity Masud et al. Aug 10, 2011
Experiments – Runtime Masud et al. Aug 10, 2011
Dynamic Features Solution: Global Features Local Features Union Mohammad Masud, Qing Chen, Latifur Khan, Jing Gao, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection of Data Streams in A Dynamic Feature Space,” in Proc. of Machine Learning and Knowledge Discovery in Databases, European Conference, ECML PKDD 2010, Barcelona, Spain, Sept 2010, Springer, Page 337-352 Masud et al. Aug 10, 2011
Feature Mapping Across Models and Test Data Points Feature set varies in different chunks. Especially, when new class appears, new features should be selected and added to the feature set. Strategy 1 – Lossy fixed (Lossy-F) conversion / Global Use the same fixed feature in the entire stream. We call this a lossy conversion because future model and instances may lose important features due to this mapping. Strategy 2 – Lossy local (Lossy-L) conversion / Local We call this lossy conversion because it may loss feature values during mapping. Strategy 3 – Dimension preserving (D-Preserving) Mapping / Union Masud et al. Aug 10, 2011
Feature Space Conversion – Lossy-L Mapping (Local) Assume that each data chunk has different feature vectors When a classification model is trained, we save the feature vector with the model When an instance is tested, its feature vector is mapped (i.e., projected) to the model’s feature vector. Masud et al. Aug 10, 2011
Feature Space Conversion – Lossy-L Mapping For example, Suppose the model has two features (x,y) The instance has two features (y,z) When testing, assume the instance has two features (x,y) Where x = 0, and y value is kept as it is Masud et al. Aug 10, 2011
Conversion Strategy II – Lossy-L Mapping Graphically: Masud et al. Aug 10, 2011
Conversion Strategy III – D-Preserving Mapping When an instance is tested, both the model’s feature vector and the instance’s feature vector are mapped (i.e., projected) to the union of their feature vectors. The feature dimension is increased. In the mapping, both the features in the testing instance and model are preserved. The extra features are filled with all 0s. Masud et al. Aug 10, 2011
Conversion Strategy III – D-Preserving Mapping For example, suppose the model has three features (a,b,c) The instance has four features (b,c,d,e) When testing, we project both the model’s feature vector and the instance’s feature vector to (a,b,c,d,e) Therefore, in the model, d, and e will be considered 0s and in the instance, a will be considered 0 Masud et al. Aug 10, 2011
Conversion Strategy III – D-Preserving Mapping Previous Example Masud et al. Aug 10, 2011
Discussion Local does not favor novel class, it favors existing classes. Local features will be enough to model existing classes. Union favors novel class. New features may be discriminating for novel class, hence Union works. Masud et al. Aug 10, 2011
Comparison Which strategy is the better? Assumption: lossless conversion (union) preserves the properties of a novel class. In other words, if an instance belongs to a novel class, it remains outside the decision boundary of any model Mi of the ensemble M in the converted feature space. Lemma: If a test point x belongs to a novel class, it will be miss- classified by the ensemble M as an existing class instance under certain conditions when the Lossy-L conversion is used. Masud et al. Aug 10, 2011
Comparison Proof: Let X1,…,XL,XL+1,…,XM be the dimensions of the model and Let X1,…,XL,XM+1,…,XN be the dimensions of the test point Suppose the radius of the closest cluster (in the higher dimension) is R Also, let the test point be a novel class instance. Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN Masud et al. Aug 10, 2011
Comparison Proof (continued): Combined feature space = X1,…,XL,XL+1,…,XM,XM+1,…,XN Centroid of the cluster (original space): X1=x1,…,XL=xL,XL+1=xL+1,…,XM=xM i.e., x1,…,xL, xL+1,…,xM Centroid of the cluster (combined space): x1,…,xL, xL+1,…,xM , 0,…,0 Test point (original space): X1=x’1,…,XL=x’L,XM+1=x’M+1,…,XN=x’N i.e., x1,…,xL, x’M+1,…,x’N Test point (combined space): x’1,…,x’L, 0,…,0, x’M+1,…,x’N Masud et al. Aug 10, 2011
Comparison Proof (continued): Centroid (combined spc): x1,…,xL, xL+1,…,xM , 0 ,…, 0 Test point (combined space): x’1,…,x’L, 0,…, 0, x’M+1,…,x’N R2< ((x1 –x’1)2+,…, +(xL –x’L)2+ x2L+1+…+x2M)+ (x’2M+1+…+x’2N) R2< a2 + b2 R2 = a2 + b2 - e2 (e2 >0) a2 = R2 + (e2 – b2) a2 < R2 (provided that e2 < b2) Therefore, in Lossy-L conversion, the test point will not be an outlier Masud et al. Aug 10, 2011
Baseline Approaches WCE is Weighted Classifier Ensemble1, which addresses multi-class ensemble classifier. OLINDDA is a novel class detector 2 works only for binary class. FAE algorithm is an ensemble classifier that addresses feature evolution3 and concept drift. ECSMiner is a multi-class ensemble classifier that addresses concept drift and concept evolution4. Masud et al. Aug 10, 2011 50
Approaches Comparison Proposed techniques Challenges Infinite length Concept- drift Concept- evolution Dynamic Features OLINDDA WCE FAE ECSMiner DXMiner We list in tabular form our proposed techniques and the challenges they meet. First technique is MPC or multi-partition multi-chunk ensemble, which addresses infinite length and concept-drift ECSMiner is a novel class detector in data streams, which addresses infinite length concept-drift and concept-evolution Finally, ReaSC is another proposed approach that addresses infinite length concept-drift and limited labeled data Masud et al. Aug 10, 2011 51
Experiments: Datasets We evaluated our approach on different datasets: Data Set Concept Drift Concept Evolution Dynamic Feature # of Instance # of Class KDD 492K 7 Forest Cover 387K NASA 140K 21 Twitter 335K Masud et al. Aug 10, 2011 52
Experiments: Results Evaluation metrics: let Fn = total novel class instances misclassified as existing class, Fp = total existing class instances misclassified as novel class, Fe = total existing class instances misclassified (other than Fp), Nc = total novel class instances in the stream, N = total instances the stream Masud et al. Aug 10, 2011
Experiments: Results We use the following performance metrics to evaluate our technique: Mnew = % of novel class instances Misclassified as existing class, i.e, Fnew = % of existing class instances Falsely identified as novel class, i.e., ERR = Total misclassification error (%)(including Mnew and Fnew), i.e., Masud et al. Aug 10, 2011
Experiments: Setup Development: H/W: Parameter settings: Language: Java H/W: Intel P-IV with 3GB memory and 3GHz dual processor CPU. Parameter settings: K (number of pseudo points per chunk) = 50 q (minimum number of instances required to declare novel class) = 50 L (ensemble size) = 6 S (chunk size) = 1,000 Masud et al. Aug 10, 2011 55
Experiments: Baseline Competing approaches: i) DXMiner (DXM): our approach- 4 variations: Lossy-F conversion Lossy-L conversion D-Preserving conversion ii) FAE-WCE-OLINDDA_Parallel (W-OP) Assumes there is only one normal class, and all other classes are novel . W-OP keeps parallel OLINDDA models, one for each class We use this combination since to the best of our knowledge there is no approach that can classify and detect novel classes simultaneously with feature evolution. iii) FAE-ECSMiner Masud et al. Aug 10, 2011 56
Twitter Results Masud et al. Aug 10, 2011 57
Twitter Results D-preserving Lossy - Local Lossy- Global O-F 0.88 0.83 AUC 0.88 0.83 0.76 0.56 Masud et al. Aug 10, 2011
NASA Dataset Deviation Info Gain O-F 0.996 0.967 0.876 AUC Masud et al. Aug 10, 2011
Forest Cover Results Masud et al. Aug 10, 2011
Forest Cover Results D-preserving O-F 0.97 0.74 AUC Masud et al. Aug 10, 2011
KDD Results Masud et al. Aug 10, 2011
KDD Results D-preserving FAE-Olindda 0.98 0.96 AUC Masud et al. Aug 10, 2011
Summary Results Masud et al. Aug 10, 2011
Improved Outlier Detection and Multiple Novel Class Detection Proposed Methods Improved Outlier Detection and Multiple Novel Class Detection Challenges High false positive (FP) (existing classes detected as novel) and false negative (FN) (missed novel classes) rates Two or more novel classes arrive at a time Solutions1 Dynamic decision boundary – based on previous mistakes Inflate the decision boundary if high FP, deflate if high FN Build statistical model to filter out noise data and concept drift from the outliers. Multiple novel classes are detected by Constructing a graph where outlier cluster is a vertex Merging the vertices based on silhouette coefficient Counting the number of connected components in the resultant (i.e., merged) graph 1 Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Charu Aggarwal, Jiawei Han, and Bhavani Thuraisingham, Addressing Concept-Evolution in Concept-Drifting Data Streams, In Proc ICDM ’10, Sydney, Australia, Dec 14-17, 2010. Masud et al. Aug 10, 2011
Outlier Threshold (OUTTH) Proposed Methods Outlier Threshold (OUTTH) To declare a testing instance being an outlier, using cluster radius r is not enough because of the data noise So, beyond the radius r, a threshold (OUTTH) will be setup, so that most noisy data around model cluster will be classified immediately o,5(x) a(x) x b+(x) +,5(x) + + + + + + Masud et al. Aug 10, 2011
Outlier Threshold (OUTTH) Proposed Methods Outlier Threshold (OUTTH) Every instance outside the cluster range has a weight If wt(x) >= OUTTH, this instance will be consider as existing class. If wt(x) < OUTTH, this instance will be an outlier. Pros: Noisy data will be classified immediately Cons OUTTH is hard to be determined Noisy data and novel class instance may occur simultaneously Different dataset may have different OUTTH Masud et al. Aug 10, 2011
Outlier Threshold (OUTTH) Proposed Methods o,5(x) a(x) x b+(x) +,5(x) + + + + + + OUTTH = ? If threshold is too high, noisy data may become outlier FP rate will go up If threshold is too low, novel class instance will be labeled as existing class FN rate will go up We need to balance on these two Masud et al. Aug 10, 2011
Data Stream Classification Introduction Data Stream Classification Clustering Novel Class Detection Finer Grain Novel Class Detection At first, I will introduce you to data streams and data stream classifications Then I will discuss our approaches of data stream classifications, namely, Ensemble classification, Novel class detection, and Classification with limited labeled data Finally I will discuss contributions and future works. Dynamic Novel Class Detection Multiple Novel Class Detection Masud et al. Aug 10, 2011
Dynamic threshold setting Proposed Methods a(x) Marginal FN x + + + + + + Marginal FP Defer approach After a testing chunk has been labeled, based on the marginal FP and FN rate of the this testing chunk update the OUTTH, and then apply the new OUTTH to the next testing chunk Eager approach What is marginal FP or marginal FN Once a marginal FP or marginal FN instance detected, update OUTTH with step function, and apply the updated OUTTH to the next testing instance Masud et al. Aug 10, 2011
Dynamic threshold setting Proposed Methods Dynamic threshold setting Masud et al. Aug 10, 2011
Defer approach and Eager approach comparison Proposed Methods Defer approach and Eager approach comparison In Defer approach, OUTTH updates after a data chunk is labeled Too late – In the testing chunk, many marginal FP or FN may occur due to an improper OUTTH threshold Overreact – If there are many marginal FP or FN instances in the labeled testing chunk, the OUTTH update may overreact for the next testing chunk In Eager approach, OUTTH updates aggressively whenever marginal FP or FN happens. The model is more tolerate to noisy data and concept drift. The model is more sensitive to novel class instances. Masud et al. Aug 10, 2011
Proposed Methods Outliers Statistics For each outlier instance, we calculate the novelty probability Pnov If Pnov is large (close to 1), indicates that the outlier has a high probability of being a novel instance. Pnov contains two parts The first part measures how far the outlier being away from the model cluster The second part Psc is the Silhouette Coefficient, measures the cohesion and separation to the model cluster of the q-Neighbors of the outlier Masud et al. Aug 10, 2011
Outliers Statistics Noise Data Concept Drift Novel Class Proposed Methods Outliers Statistics Noise Data Concept Drift Novel Class Three scenarios may occur simultaneously Masud et al. Aug 10, 2011
Outlier Statistics Gini Analysis Proposed Methods Outlier Statistics Gini Analysis The Gini coefficient is a measure of statistical inequality. The discrete Gini coefficient is: If we divide 0~1 into n equal size bin, and put all outlier Pnov into corresponding bin, then we can get cdf yi If all Pnov is very low, to an extreme cdf yi = 1 If all Pnov are very high, to an extreme cdf yi =0; except yn=1 Masud et al. Aug 10, 2011
Outlier Statistics Gini Analysis Proposed Methods Outlier Statistics Gini Analysis If all outlier Pnov distribute evenly, yi =i/n After get the outlier Pnov distribution, calculate G(s) If G(s)> , declare novel class If G(s) <= , classified the outlier as existing class instance. When n ∞, 0.33 Masud et al. Aug 10, 2011
Outlier Statistics Gini Analysis Limitation Proposed Methods Outlier Statistics Gini Analysis Limitation To an extreme, it is impossible the differentiate concept drift and concept evolution by Gini coefficient, when concept drift is just “looks like” concept evolution. Masud et al. Aug 10, 2011
Data Stream Classification Introduction Data Stream Classification Clustering Novel Class Detection Finer Grain Novel Class Detection At first, I will introduce you to data streams and data stream classifications Then I will discuss our approaches of data stream classifications, namely, Ensemble classification, Novel class detection, and Classification with limited labeled data Finally I will discuss contributions and future works. Dynamic Novel Class Detection Multiple Novel Class Detection Masud et al. Aug 10, 2011
Multi Novel Class Detection Proposed Methods Multi Novel Class Detection Positive Instance Data Stream Novel class A Negative Instance Novel class B Novel Instance y y1 y2 y2 y2 x1 x x1 x If we always assume novel instances belong to one novel type, one type of novel instances, either A or B, will be misclassified. Masud et al. Aug 10, 2011
Multi Novel Class Detection Proposed Methods Multi Novel Class Detection The main idea in detecting multiple novel classes is to construct a graph, and identify the connected components in the graph. The number of connected components determines the number of novel classes. Masud et al. Aug 10, 2011
Multi Novel Class Detection Proposed Methods Multi Novel Class Detection Two Phases: Building the connected graph Build directed nearest neighbor graph. From each vertex (outlier cluster), add edge from this vertex to its nearest neighbor. Silhouette coefficient from the vertex to its nearest neighbor is larger than some threshold, the edge will be removed. Problem: Linkage Circle Component merging phase Gaussian distribution centric decision Masud et al. Aug 10, 2011
Multi Novel Class Detection Proposed Methods Multi Novel Class Detection Component merging phase In probability theory, “the normal (or Gaussian) distribution, is a continuous probability distribution that is often used as a first approximation to describe real-valued random variables that tend to cluster around a single mean value” 1 If two Gaussian Distribution variables (g1, g2) can be separated, the following condition will be hold: Since μ is proportion to σ, if the two variables (components) will remain separated; otherwise, these two components will be merged. Amari Shunichi, Nagaoka Hiroshi. Methods of information geometry. Oxford University Press. ISBN 0-8218-0531-2, 2000. Masud et al. Aug 10, 2011
Experiments: Datasets Experiment Results Experiments: Datasets We evaluated our approach on different datasets: Data Set Concept Drift Concept Evolution Dynamic Feature # of Instance # of Class KDD 492K 7 Forest Cover 387K NASA 140K 21 Twitter 335K SynED 400K 20 Masud et al. Aug 10, 2011 85
Experiments: Setup Development: H/W: Parameter settings: Experiment Results Experiments: Setup Development: Language: Java H/W: Intel P-IV with 3GB memory and 3GHz dual processor CPU. Parameter settings: K (number of pseudo points per chunk) = 50 q (minimum number of instances required to declare novel class) = 50 L (ensemble size) = 6 S (chunk size) = 1,000 Masud et al. Aug 10, 2011 86
Experiments: Baseline Competing approaches: i) DEMminer our approach- 5 variations: Lossy-F conversion Lossy-L conversion Lossless conversion - DEMminer Dynamic OUTTH + Lossless conversion - DEMminer-Ex (without Gini) Dynamic OUTTH + Gini + Lossless conversion - DEMminer-Ex ii) WCE-OLINDDA (O-W) iii) FAE-WCE-OLINDDA_Parallel (O-F) We use this combination since to the best of our knowledge there is no approach that can classify and detect novel classes simultaneously with feature evolution. Masud et al. Aug 10, 2011 87
Experiments: Results Evaluation metrics: Experiment Results Experiments: Results Evaluation metrics: Fn = total novel class instances misclassified as existing class, Fp = total existing class instances misclassified as novel class, Fe = total existing class instances misclassified (other than Fp), Nc = total novel class instances in the stream, N = total instances the stream Masud et al. Aug 10, 2011 88
Experiment Results Twitter Results Masud et al. Aug 10, 2011 89
Twitter Results DEMminer Lossy -L Lossy-F O-F 0.88 0.83 0.76 0.56 AUC Experiment Results Twitter Results DEMminer Lossy -L Lossy-F O-F AUC 0.88 0.83 0.76 0.56 Masud et al. Aug 10, 2011
Experiment Results Twitter Results Masud et al. Aug 10, 2011 91
Twitter Results DEMminer-Ex DEMminer OW 0.94 0.88 0.56 AUC Experiment Results Twitter Results DEMminer-Ex DEMminer OW AUC 0.94 0.88 0.56 Masud et al. Aug 10, 2011
Experiment Results Forest Cover Results Masud et al. Aug 10, 2011
Forest Cover Results DEMminer DEMminer-Ex (without Gini) OW 0.97 0.99 Experiment Results Forest Cover Results DEMminer DEMminer-Ex (without Gini) OW AUC 0.97 0.99 0.74 Masud et al. Aug 10, 2011
Experiment Results NASA Dataset Masud et al. Aug 10, 2011
NASA Dataset Deviation Info Gain FAE 0.996 0.967 0.876 AUC Experiment Results NASA Dataset Deviation Info Gain FAE AUC 0.996 0.967 0.876 Masud et al. Aug 10, 2011
Experiment Results KDD Results Masud et al. Aug 10, 2011
KDD Results DEMminer O-F 0.98 0.96 AUC Experiment Results Masud et al. Aug 10, 2011
Result Summary Experiment Results Dataset Method ERR Mnew Fnew AUC FP FN Twitter DEMminer Lossy-F Lossy-L O-F 4.2 30.5 0.8 32.5 0.0 32.6 1.6 82.0 0.0 3.4 96.7 1.6 0.877 0.834 0.764 0.557 - - ASRS DEMminer(info-gain) 0.02 - - 1.4 - - 3.4 - - 0.996 0.967 0.876 0.00 0.1 0.04 10.3 0.00 24.7 Forest Cover 3.6 8.4 1.3 5.9 20.6 1.1 0.973 0.743 KDD 1.2 5.9 0.9 4.7 9.6 4.4 0.986 Masud et al. Aug 10, 2011
Result Summary Experiment Results Dataset Method ERR Mnew Fnew AUC Twitter DEMminer DEMminer-Ex OW 4.2 30.5 0.8 1.8 0.7 0.6 3.4 96.7 1.6 0.877 0.944 0.557 Forest Cover 3.6 8.4 1.3 3.1 4.0 0.68 5.9 20.6 1.1 0.974 0.990 0.743 Masud et al. Aug 10, 2011
Running Time Comparison Experiment Results Running Time Comparison Dataset Time(sec)1/K Points/sec Speed gain DEMminer Lossy-F O-F DEMminer over O-F Twitter 23 3.5 66.7 43 289 15 2.9 ASRS 21 4.3 38.5 47 233 26 1.8 Forest Cover 1.0 4.7 967 1003 212 KDD 1.2 3.3 858 812 334 2.5 Masud et al. Aug 10, 2011
Multi Novel Detection Results Experiment Results Multi Novel Detection Results Masud et al. Aug 10, 2011
Multi Novel Detection Results Experiment Results Multi Novel Detection Results Masud et al. Aug 10, 2011
Conclusion Our data stream classification technique addresses Experiment Results Conclusion Our data stream classification technique addresses Infinite length Concept-drift Concept-evolution Feature-evolution Existing approaches only address first two issues Applicable to many domains such as Intrusion/malware detection Text categorization Fault detection etc. Masud et al. Aug 10, 2011
References J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Loh. : BOAT-Optimistic Decision Tree Construction. In Proc. SIGMOD, 1999. P. Domingos and G. Hulten, “Mining high-speed data streams”. In Proc. SIGKDD, pages 71- 80, 2000. Wenerstrom, B., Giraud-Carrier, C., “Temporal data mining in dynamic feature spaces”. In: Perner, P. (ed.) ICDM 2006. LNCS (LNAI), vol. 4065, pp. 1141.1145. Springer, Heidelberg (2006) E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. “Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks”. In Proc. 2008 ACM symposium on Applied computing, pages 976–980, (2008). M. Scholz and R. Klinkenberg. “An ensemble classifier for drifting concepts.” In Proc. ICML/PKDD Workshop in Knowledge Discovery in Data Streams., 2005. Masud et al. Aug 10, 2011
References (contd.) Brutlag, J.(2000). “Aberrant behavior detection in time series for network monitoring.” In: Proc. Usenix Fourteenth System Admin. Conf. LISA XIV, New Orleans, LA. (Dec 2000) Eskin, E., Arnold, A., Prerau, M., Portnoy, L., Stolfo, S.: “A geometric framework for unsupervised anomaly detection: Detection intrusions in unlabeled data.” Applications of Data Mining in Computer Security, Kluwer (2002). Fan, W. “Systematic data selection to mine concept-drifting data streams.” In Proc. KDD 04 Gao, J, Wei Fan, and Jiawei Han. (2007a). "On Appropriate Assumptions to Mine Data Streams” Gao, J. Wei Fan, Jiawei Han, Philip S. Yu. (2007b). “A General Framework for Mining Concept- Drifting Data Streams with Skewed Distributions.” SDM 2007 Goebel, J. and T. Holz. Rishi: “Identify bot contaminated hosts by irc nickname evaluation. In Usenix/Hotbots” ’07 Workshop, 2007. Grizzard, J. B., V. Sharma, C. Nunnery, B. B. Kang, and D. Dagon (2007). “Peer-to-peer botnets: Overview and case study.” In Usenix/Hotbots ’07 Workshop. Masud et al. Aug 10, 2011
References (contd.) Keogh & Pazzani, (2000) E.J., J., P.M.: “Scaling up dynamic time warping for data mining applications.” In: ACM SIGKDD. (2000) Lemos, R. (2006): Bot software looks to improve peerage. SecurityFocus. http://www.securityfocus.com/news/11390 (2006). Livadas, C., B.Walsh, D. Lapsley, and T. Strayer (2006) “Using machine learning techniques to identify botnet traffic.” In 2nd IEEE LCN Workshop on Network Security (WoNS’2006), November 2006. LURHQ Threat Intelligence Group (2004). Sinit p2p trojan analysis. http://www.lurhq.com/sinit.html (2004) Rajab, M. A. J. Zarfoss, F. Monrose, and A. Terzis (2006) “A multifaceted approach to understanding the botnet phenomenon.” In Proceedings of the 6th ACM SIGCOMM on Internet Measurement Conference (IMC), 2006. Kagan Tumar and Joydeep ghosh (1996).“Error correlation and error reduction in ensemble classifiers” (Connection sciece), 8(3-4):385-403 Masud et al. Aug 10, 2011
References (contd.) Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A Multi-Partition Multi-Chunk Ensemble Technique to Classify Concept-Drifting Data Streams.” In Proc, of 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD- 09), Page: 363-375, Bangkok, Thailand, April 2009. Mohammad Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data.” In Proc. of 2008 IEEE International Conference on Data Mining (ICDM 2008), Pisa, Italy, Page 929-934, December, 2008. Clay Woolam, Mohammed Masud, and Latifur Khan , “Lacking Labels In The Stream: Classifying Evolving Stream Data With Few Labels”. In Proc. of 18th International Symposium on Methodologies for Intelligent Systems (ISMIS), Page 552-562, September 2009 Prague, Czech Republic Masud et al. Aug 10, 2011
References (contd.) Mohammad Masud, Qing Chen, Latifur Khan, Charu Aggarwal, Jing Gao, Jiawei Han, and Bhavani Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data Streams”. In Proc. of 2010 10th IEEE International Conference on Data Mining (ICDM 2010), Sydney, Australia, Dec 2010. Mohammad M. Masud, Qing Chen, Jing Gao, Latifur Khan, Jiawei Han, Bhavani Thuraisingham , “Classification and Novel Class Detection of Data Streams in a Dynamic Feature Space”. In Proc. of European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD 2010, Barcelona, Spain, September 20- 24, 2010, Springer 2010, ISBN 978-3-642- 15882-7, Page: 337-352. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection in Data Streams with Active Mining”. In Proc of 14th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 21-24 June, 2010, Page 311-324, - Hyderabad, India. Masud et al. Aug 10, 2011
References (contd.) Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham, “Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints" , IEEE Transactions on Knowledge & Data Engineering (TKDE), 2011, IEEE Computer Society, June 2011, Vol. 23, No. 6, Page 859-874. Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu, “A Framework for Clustering Evolving Data streams” Published in Proceedings VLDB ’03 proceedings of the 29th international conference on Very Large Data Bases-Volume 29 H. Wang, W. Fan, P. S. Yu, and J. Han. “Mining concept-drifting data streams using ensemble classifiers”. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, 2003. ACM. Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani Thuraisingham. “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams”. In Proceedings of 2009 European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD’09), Bled, Slovenia, 7-11 Sept, 2009. Masud et al. Aug 10, 2011
Questions Masud et al. Aug 10, 2011