Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at.

Slides:



Advertisements
Similar presentations
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.
Advertisements

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Clustering Data Streams Chun Wei Dept Computer & Information Technology Advisor: Dr. Sprague.
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
January 23 rd, Document classification task We are interested to solve a task of Text Classification, i.e. to automatically assign a given document.
Christoph F. Eick Questions and Topics Review Dec. 10, Compare AGNES /Hierarchical clustering with K-means; what are the main differences? 2. K-means.
Clustering Beyond K-means
K Means Clustering , Nearest Cluster and Gaussian Mixture
The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.
Pallabi Parveen, Nate McDaniel, Varun S. Hariharan, Bhavani Thuraisingham and Latifur Khan Department of Computer Science at The University of Texas at.
A Probabilistic Framework for Semi-Supervised Clustering
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Focused Reducts Janusz A. Starzyk and Dale Nelson.
Clustering.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
A Hybrid Model to Detect Malicious Executables Mohammad M. Masud Latifur Khan Bhavani Thuraisingham Department of Computer Science The University of Texas.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Text Clustering.
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
Semi-Supervised Learning with Graph Transduction Project 2 Due Nov. 7, 2012, 10pm EST Class presentation on Nov. 12, 2012.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Modeling Annotated Data (SIGIR 2003) David M. Blei, Michael I. Jordan Univ. of California, Berkeley Presented by ChengXiang Zhai, July 10, 2003.
PEER TO PEER BOTNET DETECTION FOR CYBER- SECURITY (DEFENSIVE OPERATION): A DATA MINING APPROACH Masud, M. M. 1, Gao, J. 2, Khan, L. 1, Han, J. 2, Thuraisingham,
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Department of Computer Science Sir Syed University of Engineering & Technology, Karachi-Pakistan. Presentation Title: DATA MINING Submitted By.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
What Is Cluster Analysis?
Semi-Supervised Clustering
Constrained Clustering -Semi Supervised Clustering-
PEBL: Web Page Classification without Negative Examples
Community Distribution Outliers in Heterogeneous Information Networks
A Framework for Clustering Evolving Data Streams
Knowledge Transfer via Multiple Model Local Structure Mapping
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
An Adaptive Nearest Neighbor Classification Algorithm for Data Streams
CSE572: Data Mining by H. Liu
Data Mining CSCI 307, Spring 2019 Lecture 24
Introduction to Machine learning
Presentation transcript:

Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at Dallas Jing Gao Jiawei Han University of Illinois at Urbana-Champaign To appear in IEEE International Conference on Data Mining, (ICDM) Pisa, Italy, Dec 15-19, 2008 Funded by: Air Force

Data Stream Classification Techniques Data stream classification is a challenging task because of two reasons ◦ Infinite length – can’t use all historical data for training ◦ Concept-drift – old models become outdated Solutions: ◦ Single model classification with incremental learning ◦ Ensemble classification ◦ Ensemble techniques can be updated more efficiently, and handles concept-drift more effectively. Our solution: ensemble approach

Ensemble Classification Ensemble techniques build an ensemble of M classifiers. ◦ The data stream is divided into equal-sized chunks ◦ A new classifier is trained from each labeled chunk ◦ The new classifier replaces one old classifier (if required) Last labeled data chunk Data stream Last unlabeled data chunk L1L1 L2L2 LMLM Train new Classifier Ensemble Ensemble classification Update ensemble 1 2

Limited Labeled Data Problem Ensemble techniques assume that the entire data chunk is labeled during training. This assumption is impractical because ◦ Data labeling is both time consuming and costly ◦ It may not be possible to label all the data ◦ Specially in an streaming environment, where data is being produced at a high speed Our solution: ◦ Train classifiers with limited amount of labeled data ◦ Assuming a fraction of the data chunk is labeled ◦ We obtain better result compared to other techniques that train classifiers with fully labeled data chunk

Training With Partially-Labeled Chunk Train new classifier using semi-supervised clustering If a new class has arrived in the stream, refine the existing classifiers Update the ensemble Last partially - labeled chunk Data stream Last unlabeled data chunk L1L1 L2L2 LMLM Train a classifier using Semi-supervised Clustering Ensemble Ensemble classification refine ensemble Update ensemble

Semi-Supervised Clustering Overview: Only a few instances in the training data are labeled We apply impurity-based clustering The goal is to minimize cluster impurity A cluster is completely pure if all the labeled data in that cluster is from the same class Objective function: K: number of clusters, X i: data points belonging to cluster i, L i : labeled data points belonging to cluster i Imp i : Impurity of cluster i = Entropy * dissimilarity count

Semi-Supervised Clustering Using E-M Constrained initialization: Initialize K seeds For each class C j, Select k j seeds from the labeled instances of C j using farthest-first traversal heuristic where k j = (N j /N) * K N j = number of instances in class C j, N = total number of labeled instances Repeat E-step and M-step until convergence: E-step Assign clusters to each instance So that the objective function is minimized Apply Iterative Conditional Mode (ICM) until convergence M-step Re-compute cluster centroids

Saving Cluster Summary as Micro-Cluster For each of the K clusters created using the semi-supervised clustering, save the followings Centroid n: total number of points L: total number of labeled points Lt[ ]: array containing the total number of labeled points belonging to each class. e.g. : Lt[j]: total number of data points belonging to class C j Sum[ ]: array containing the sum of the attribute values of each dimension of all the data points e.g. : Sum[r]: sum of the r th dimension of all data points in the cluster

Using the Micro-Clusters as Classification Model We remove all the raw data points after saving the micro-clusters The set of K such micro-clusters built from a data chunk serves as a classification model To classify a test data point x using the model: Find the Q nearest micro-clusters (by computing the distance between x and their centroids) For each class C j, Compute the “cumulative normalized frequency (CNFrq[j])”, where CNFrq[j] = sum of Lt[j]/L of all the Q micro-clusters Output the class C j with the highest CNFrq[j]

Experiments Data sets: Synthetic data – simulates evolving data streams Real data – botnet data, collected from real botnet traffic generated in a controlled environment Baseline: On-demand Stream C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for on-demand classification of evolving data streams. IEEE Transactions on Knowledge and Data Engineering, 18(5):577–589, For training, we use 5% labeled data in each chunk. So, if there are 100 instances in a chunk Our technique (SmSCluster) use only 5 labeled and 95 unlabeled instances for training On Deman Stream uses 100 labeled instances for training Environment S/W: Java, OS: Windows XP, H/W: Pentium-IV, 3GHz dual core, 2GB RAM

Results Each time unit = 1,000 data points (botnet) and 1,600 data points (synthetic)

Conclusion We address a more practical approach in classifying evolving data streams – training with limited amount of labeled data. Our technique applies semi-supervised clustering to train classification models. Our technique outperforms other state-of-the- art stream classification techniques, which use 20 times more labeled data for training than our technique.