Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at Dallas Jing Gao Jiawei Han University of Illinois at Urbana-Champaign To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008 Funded by: Air Force
Data Stream Classification Techniques Data stream classification is a challenging task because of two reasons ◦ Infinite length – can’t use all historical data for training ◦ Concept-drift – old models become outdated Solutions: ◦ Single model classification with incremental learning ◦ Ensemble classification ◦ Ensemble techniques can be updated more efficiently, and handles concept-drift more effectively. Our solution: ensemble approach
Ensemble Classification Ensemble techniques build an ensemble of M classifiers. ◦ The data stream is divided into equal-sized chunks ◦ A new classifier is trained from each labeled chunk ◦ The new classifier replaces one old classifier (if required) Last labeled data chunk Data stream Last unlabeled data chunk L1L1 L2L2 LMLM Train new Classifier Ensemble Ensemble classification Update ensemble 1 2
Limited Labeled Data Problem Ensemble techniques assume that the entire data chunk is labeled during training. This assumption is impractical because ◦ Data labeling is both time consuming and costly ◦ It may not be possible to label all the data ◦ Specially in an streaming environment, where data is being produced at a high speed Our solution: ◦ Train classifiers with limited amount of labeled data ◦ Assuming a fraction of the data chunk is labeled ◦ We obtain better result compared to other techniques that train classifiers with fully labeled data chunk
Training With Partially-Labeled Chunk Train new classifier using semi-supervised clustering If a new class has arrived in the stream, refine the existing classifiers Update the ensemble Last partially - labeled chunk Data stream Last unlabeled data chunk L1L1 L2L2 LMLM Train a classifier using Semi-supervised Clustering Ensemble Ensemble classification refine ensemble Update ensemble
Semi-Supervised Clustering Overview: Only a few instances in the training data are labeled We apply impurity-based clustering The goal is to minimize cluster impurity A cluster is completely pure if all the labeled data in that cluster is from the same class Objective function: K: number of clusters, X i: data points belonging to cluster i, L i : labeled data points belonging to cluster i Imp i : Impurity of cluster i = Entropy * dissimilarity count
Semi-Supervised Clustering Using E-M Constrained initialization: Initialize K seeds For each class C j, Select k j seeds from the labeled instances of C j using farthest-first traversal heuristic where k j = (N j /N) * K N j = number of instances in class C j, N = total number of labeled instances Repeat E-step and M-step until convergence: E-step Assign clusters to each instance So that the objective function is minimized Apply Iterative Conditional Mode (ICM) until convergence M-step Re-compute cluster centroids
Saving Cluster Summary as Micro-Cluster For each of the K clusters created using the semi-supervised clustering, save the followings Centroid n: total number of points L: total number of labeled points Lt[ ]: array containing the total number of labeled points belonging to each class. e.g. : Lt[j]: total number of data points belonging to class C j Sum[ ]: array containing the sum of the attribute values of each dimension of all the data points e.g. : Sum[r]: sum of the r th dimension of all data points in the cluster
Using the Micro-Clusters as Classification Model We remove all the raw data points after saving the micro-clusters The set of K such micro-clusters built from a data chunk serves as a classification model To classify a test data point x using the model: Find the Q nearest micro-clusters (by computing the distance between x and their centroids) For each class C j, Compute the “cumulative normalized frequency (CNFrq[j])”, where CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters Output the class C j with the highest CNFrq[j]
Experiments Data sets: Synthetic data – simulates evolving data streams Real data – botnet data, collected from real botnet traffic generated in a controlled environment Baseline: On-demand Stream C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, For training, we use 5% labeled data in each chunk. So, if there are 100 instances in a chunk Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training On Deman Stream uses 100 labeled instances for training Environment S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core, 2GB RAM
Results Each time unit = 1,000 data points (botnet) and 1,600 data points (synthetic)
Conclusion We address a more practical approach in classifying evolving data streams – training with limited amount of labeled data. Our technique applies semi-supervised clustering to train classification models. Our technique outperforms other state-of-the- art stream classification techniques, which use 20 times more labeled data for training than our technique.