Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing.

Slides:



Advertisements
Similar presentations
Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao Wei Fan Jing JiangJiawei Han University of Illinois at Urbana-Champaign IBM T. J.
Advertisements

Knowledge Transfer via Multiple Model Local Structure Mapping Jing Gao, Wei Fan, Jing Jiang, Jiawei Han l Motivate Solution Framework Data Sets Synthetic.
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Florida International University COP 4770 Introduction of Weka.
Random Forest Predrag Radenković 3237/10
On-line learning and Boosting
Third International Workshop on Knowledge Discovery from Data Streams, 2006 Classification of Changes in Evolving Data Streams using Online Clustering.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Mining in Anticipation for Concept Change: Proactive-Reactive Prediction in Data Streams YING YANG, XINDONG WU, XINGQUAN ZHU Data Mining and Knowledge.
Data Stream Classification: Training with Limited Amount of Labeled Data Mohammad Mehedy Masud Latifur Khan Bhavani Thuraisingham University of Texas at.
Pallabi Parveen, Nate McDaniel, Varun S. Hariharan, Bhavani Thuraisingham and Latifur Khan Department of Computer Science at The University of Texas at.
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
Data Mining and Intrusion Detection
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
On Appropriate Assumptions to Mine Data Streams: Analyses and Solutions Jing Gao† Wei Fan‡ Jiawei Han† †University of Illinois at Urbana-Champaign ‡IBM.
1 An Adaptive Nearest Neighbor Classification Algorithm for Data Streams Yan-Nei Law & Carlo Zaniolo University of California, Los Angeles PKDD, Porto,
Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabukçu Sefa Kılıç Ferhat Kutlu Teoman Toraman 1/29.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao † Wei Fan ‡ Yizhou Sun † Jiawei Han † †University of Illinois at Urbana-Champaign.
Machine Learning as Applied to Intrusion Detection By Christine Fossaceca.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Evolving Insider Threat Detection
Data Mining for Intrusion Detection: A Critical Review Klaus Julisch From: Applications of data Mining in Computer Security (Eds. D. Barabara and S. Jajodia)
A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data Authors: Eleazar Eskin, Andrew Arnold, Michael Prerau,
BotNet Detection Techniques By Shreyas Sali
Network Intrusion Detection Using Random Forests Jiong Zhang Mohammad Zulkernine School of Computing Queen's University Kingston, Ontario, Canada.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
Graph-based Consensus Maximization among Multiple Supervised and Unsupervised Models Jing Gao 1, Feng Liang 2, Wei Fan 3, Yizhou Sun 1, Jiawei Han 1 1.
Recent Trends in Text Mining Girish Keswani
1 ENTROPY-BASED CONCEPT SHIFT DETECTION PETER VORBURGER, ABRAHAM BERNSTEIN IEEE ICDM 2006 Speaker: Li HueiJyun Advisor: Koh JiaLing Date:2007/11/6 1.
Differentially Private Data Release for Data Mining Noman Mohammed*, Rui Chen*, Benjamin C. M. Fung*, Philip S. Yu + *Concordia University, Montreal, Canada.
Page 1 Ming Ji Department of Computer Science University of Illinois at Urbana-Champaign.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
Data Stream Mining and Incremental Discretization John Russo CS561 Final Project April 26, 2007.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :
Consensus Group Stable Feature Selection
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.
Xiangnan Kong,Philip S. Yu An Ensemble-based Approach to Fast Classification of Multi-label Data Streams Dept. of Computer Science University of Illinois.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
1 Systematic Data Selection to Mine Concept-Drifting Data Streams Wei Fan Proceedings of the 2004 ACM SIGKDD international conference on Knowledge discovery.
PEER TO PEER BOTNET DETECTION FOR CYBER- SECURITY (DEFENSIVE OPERATION): A DATA MINING APPROACH Masud, M. M. 1, Gao, J. 2, Khan, L. 1, Han, J. 2, Thuraisingham,
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International.
Mining Concept-Drifting Data Streams Using Ensemble Classifiers Haixun Wang Wei Fan Philip S. YU Jiawei Han Proc. 9 th ACM SIGKDD Internal Conf. Knowledge.
Technische Universität München Yulia Gembarzhevskaya LARGE-SCALE MALWARE CLASSIFICATON USING RANDOM PROJECTIONS AND NEURAL NETWORKS Technische Universität.
Recent Trends in Text Mining
Semi-Supervised Clustering
Data Stream Classification and Novel Class Detection
Active Learning Intrusion Detection using k-Means Clustering Selection
Introductory Seminar on Research: Fall 2017
Basic machine learning background with Python scikit-learn
An Enhanced Support Vector Machine Model for Intrusion Detection
Mining Dynamics of Data Streams in Multi-Dimensional Space
Sangeeta Devadiga CS 157B, Spring 2007
PEBL: Web Page Classification without Negative Examples
Community Distribution Outliers in Heterogeneous Information Networks
Detecting Targeted Attacks Using Shadow Honeypots
Introduction to Data Mining, 2nd Edition
Evolving Insider Threat Detection
Knowledge Transfer via Multiple Model Local Structure Mapping
Binghui Wang, Le Zhang, Neil Zhenqiang Gong
Physics-guided machine learning for milling stability:
Presentation transcript:

Classification and Novel Class Detection in Data Streams Classification and Novel Class Detection in Data Streams Mehedy Masud 1, Latifur Khan 1, Jing Gao 2, Jiawei Han 2, and Bhavani Thuraisingham 1 1 Department of Computer Science, University of Texas at Dallas 2 Department of Computer Science, University of Illinois at Urbana Champaign This work was funded in part by

Presentation Overview Stream Mining Background Novel Class Detection– Concept Evolution

Data Streams Data streams are: ◦ Continuous flows of data Network traffic Sensor data Call center records ◦ Examples:

Uses past labeled data to build classification model Predicts the labels of future instances using the model Helps decision making Data Stream Classification Network traffic Classification model Attack traffic Firewall Block and quarantine Benign traffic Server Model update Expert analysis and labeling

Infinite length Concept-drift Concept-evolution (emergence of novel class) Recurrence (seasonal) class Challenges Introduction 5ICDM 2012, Brussels, Belgium12/11/2012

Impractical to store and use all historical data ◦ Requires infinite storage ◦ And running time Infinite Length

Concept-Drift Negative instance Positive instance A data chunk Current hyperplane Previous hyperplane Instances victim of concept-drift

Concept-Evolution X X X X X X X X X X XX X X X X X X X X Novel class y x1x1 y1y1 y2y2 x Classification rules: R1. if (x > x 1 and y < y 2 ) or (x < x 1 and y < y 1 ) then class = + R2. if (x > x 1 and y > y2) or (x y 1 ) then class = - Existing classification models misclassify novel class instances A C D B y x1x1 y1y1 y2y2 x A C D B

Background: Ensemble of Classifiers C1C1 C2C2 C3C3 x,? input Classifier Individual outputs voting + Ensemble output

Background: Ensemble Classification of Data Streams Divide the data stream into equal sized chunks ◦ Train a classifier from each data chunk ◦ Keep the best L such classifier-ensemble ◦ Example: for L= 3 Data chunks Classifiers D1D1 C1C1 D2D2 C2C2 D3D3 C3C3 Ensemble C1C1 C2C2 C3C3 D4D4 Prediction D4D4 C4C4 C4C4 C4C4 D5D5 D5D5 C5C5 C5C5 C5C5 D6D6 Labeled chunk Unlabeled chunk Addresses infinite length and concept-drift Note: D i may contain data points from different classes

Examples of Recurrence and Novel Classes Twitter Stream – a stream of messages Each message may be given a category or “class” ◦ based on the topic Examples ◦ “Election 2012”, “London Olympic”, “Halloween”, “Christmas”, “Hurricane Sandy”, etc. Among these ◦ “Election 2012” or “Hurricane Sandy” are novel classes because they are new events. Also ◦ “Halloween” is recurrence class because it “recurs” every year. 11ICDM 2012, Brussels, Belgium12/11/2012 Introduction

Concept-Evolution and Feature Space Introduction X X X X X X X X X X XX X X X X X X X X Novel class y x1x1 y1y1 y2y2 x Classification rules: R1. if (x > x 1 and y < y 2 ) or (x < x 1 and y < y 1 ) then class = + R2. if (x > x 1 and y > y2) or (x y 1 ) then class = - Existing classification models misclassify novel class instances A C D B y x1x1 y1y1 y2y2 x A C D B 12ICDM 2012, Brussels, Belgium12/11/2012

Novel Class Detection – Prior Work Prior work 13ICDM 2012, Brussels, Belgium12/11/2012 Three steps: ◦ Training and building decision boundary ◦ Outlier detection and filtering ◦ Computing cohesion and separation

Training: Creating Decision Boundary y x1x1 y1y1 y2y2 B C A D x Raw training data Clusters are created y x1x1 y1y1 y2y2 x A D C B Pseudopoints Addresses Infinite length problem 14ICDM 2012, Brussels, Belgium12/11/2012 Prior work Training is done chunk-by-chunk (One classifier per chunk) An ensemble of classifiers are used for classification

Outlier Detection and Filtering x1x1 x y y1y1 y2y2 B C A D x x AND Routlier Ensemble of L models M1M1 M2M2 MLML x Test instance... True X is a filtered outlier (Foutlier) (potential novel class instance) False X is an existing class instance Test instance inside decision boundary (not outlier) Test instance outside decision boundary Raw outlier or Routlier Routliers may appear as a result of novel class, concept-drift, or noise. Therefore, they are filtered to reduce noise as much as possible. 15ICDM 2012, Brussels, Belgium12/11/2012 Prior work

Computing Cohesion & Separation a(x) = mean distance from an Foutlier x to the instances in o,q (x) b min (x) = minimum among all b c (x) (e.g. b + (x) in figure) q-Neighborhood Silhouette Coefficient (q-NSC): If q-NSC(x) is positive, it means x is closer to Foutliers than any other class. x o,5 (x) +,5 (x) ,5 (x) a(x) b + (x)b - (x) 16ICDM 2012, Brussels, Belgium12/11/2012 Prior work

Limitation: Recurrence Class chunk 0 chunk 1 chunk 49 chunk 50 Stream chunk 51 chunk 52 chunk 99 chunk 100 chunk 101 chunk 102 chunk 149 chunk ICDM 2012, Brussels, Belgium12/11/2012 Prior work

Why Recurrence Classes are Forgotten? Divide the data stream into equal sized chunks ◦ Train a classifier from whole data chunk ◦ Keep the best L such classifier-ensemble ◦ Example: for L= 3 ◦ Therefore, old models are discarded ◦ Old classes are “forgotten” after a while Data chunks Classifiers D1D1 C1C1 D2D2 C2C2 D3D3 C3C3 Ensemble C1C1 C2C2 C3C3 D4D4 Prediction D4D4 C4C4 C4C4 C4C4 D5D5 D5D5 C5C5 C5C5 C5C5 D6D6 Labeled chunk Unlabeled chunk Addresses infinite length and concept-drift 18ICDM 2012, Brussels, Belgium12/11/2012 Prior work

CLAM: The Proposed Approach 19 ICDM 2012, Brussels, Belgium 12/11/2012 Latest Labeled chunk Stream New model Training Ensemble (M) (keeps all classes) U p d a t e Latest unlabeled instance Outlier detection Not outlier Classify using M (Existing class) Outlier Buffering and novel class detection Proposed method CLAss Based Micro-Classifier Ensemble

Training and Updating 20 ICDM 2012, Brussels, Belgium 12/11/2012 Proposed method Each chunk is first separated into different classes A micro-classifier is trained from each class’s data Each micro-classifier replaces one existing micro-classifier A total of L micro-classifiers make a Micro-Classifier Ensemble (MCE) C such MCE’s constitute the whole ensemble, E

CLAM: The Proposed Approach 21 ICDM 2012, Brussels, Belgium 12/11/2012 Latest Labeled chunk Stream New model Training Ensemble (M) (keeps all classes) Update Latest unlabeled instance Outlier detection Not outlier Classify using M (Existing class) Outlier Buffering and novel class detection Proposed method CLAss Based Micro-Classifier Ensemble

Outlier Detection and Classification 22 ICDM 2012, Brussels, Belgium 12/11/2012 Proposed method A test instance x is first classified with each micro-classifier ensemble Each micro-classifier ensemble gives a partial output (Y r ) and a outlier flag (boolean) If all ensembles flags x as outlier, then it is buffered and sent to novel class detector Otherwise, the partial outputs are combined and a class label is predicted

Evaluation Competitors: ◦ CLAM (CL) – proposed work ◦ SCANR (SC) [1] – prior work ◦ ECSMiner (EM) [2] – prior work ◦ Olindda [3]-WCE [4] (OW) – another baseline Datasets: Synthetic, KDD Cup 1999 & Forest covertype 1. M. M. Masud, T. M. Al-Khateeb, L. Khan, C. C. Aggarwal, J. Gao, J. Han, and B. M. Thuraisingham, Detecting recurring and novel classes in concept-drifting data streams,” in Proc. ICDM ’11, Dec. 2011, pp. 1176– Mohammad M. Masud, Jing Gao, Latifur Khan, Jiawei Han, and Bhavani M. Thuraisingham.Classification and novel class detection in concept-drifting data streams under time constraints. In Preprints, IEEE Transactions on Knowledge and Data Engineering (TKDE), 23(6): (2011) E. J. Spinosa, A. P. de Leon F. de Carvalho, and J. Gama. Cluster-based novel concept detection in data streams applied to intrusion detection in computer networks. In Proc ACM symposium on Applied computing, pages 976–980, H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. In Proc. ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 226–235, Washington, DC, USA, Aug, ACM. 23ICDM 2012, Brussels, Belgium12/11/2012 Evaluation

Overall Error 24ICDM 2012, Brussels, Belgium12/11/2012 Evaluation Error rates on (a) SynC20, (b)SynC40, (c)Forest and (d) KDD

Number of Recurring Classes vs Error 25ICDM 2012, Brussels, Belgium12/11/2012 Evaluation

Error vs Drift and Chunk Size 26ICDM 2012, Brussels, Belgium12/11/2012 Evaluation

Summary Table 27ICDM 2012, Brussels, Belgium12/11/2012 Evaluation

Conclusion Detect Recurrence Improved Accuracy Running Time Reduced Human Interaction Future work: use other base learners 28ICDM 2012, Brussels, Belgium12/11/2012

29ICDM 2012, Brussels, Belgium12/11/2012

30ICDM 2012, Brussels, Belgium12/11/2012