Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

Slides:



Advertisements
Similar presentations
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois.
Advertisements

A Data Mining Course for Computer Science and non Computer Science Students Jamil Saquer Computer Science Department Missouri State University Springfield,
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Random Forest Predrag Radenković 3237/10
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach Xiaoli Zhang Fern, Carla E. Brodley ICML’2003 Presented by Dehong Liu.
Clustering V. Outline Validating clustering results Randomization tests.
Date : 21 st of May, Shri Ramdeo Baba College of Engineering and Management Presentation By : Rimjhim Singh Under the Guidance of: Dr. M.B. Chandak.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Sampling from Large Graphs. Motivation Our purpose is to analyze and model social networks –An online social network graph is composed of millions of.
Unsupervised Intrusion Detection Using Clustering Approach Muhammet Kabukçu Sefa Kılıç Ferhat Kutlu Teoman Toraman 1/29.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
FACE RECOGNITION, EXPERIMENTS WITH RANDOM PROJECTION
Ensemble-based Adaptive Intrusion Detection Wei Fan IBM T.J.Watson Research Salvatore J. Stolfo Columbia University.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Introduction to Data Mining Engineering Group in ACL.
Data Mining Techniques
Intrusion Detection Jie Lin. Outline Introduction A Frame for Intrusion Detection System Intrusion Detection Techniques Ideas for Improving Intrusion.
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Predictive Modeling with Heterogeneous Sources Xiaoxiao Shi 1 Qi Liu 2 Wei Fan 3 Qiang Yang 4 Philip S. Yu 1 1 University of Illinois at Chicago 2 Tongji.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Automated Social Hierarchy Detection through Network Analysis (SNAKDD07) Ryan Rowe, Germ´an Creamer, Shlomo Hershkop, Salvatore J Stolfo 1 Advisor:
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
An Overview of Intrusion Detection Using Soft Computing Archana Sapkota Palden Lama CS591 Fall 2009.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
1 A System for Outlier Detection and Cluster Repair Ying Liu Dr. Sprague Oct 21, 2005.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Collaborative Filtering Zaffar Ahmed
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
A new clustering tool of Data Mining RAPID MINER.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Anomaly Detection. Network Intrusion Detection Techniques. Ştefan-Iulian Handra Dept. of Computer Science Polytechnic University of Timișoara June 2010.
Parameter Reduction for Density-based Clustering on Large Data Sets Elizabeth Wang.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Efficient Image Classification on Vertically Decomposed Data
QianZhu, Liang Chen and Gagan Agrawal
A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.
Outlier Discovery/Anomaly Detection
Efficient Image Classification on Vertically Decomposed Data
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Data Mining Anomaly/Outlier Detection
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CSE572, CBS572: Data Mining by H. Liu
FLOSCAN: An Artificial Life Based Data Mining Algorithm
Microarray Data Set The microarray data set we are dealing with is represented as a 2d numerical array.
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie SACI May 20 11

Contents 1. Introduction 2. Anomaly detection classical approaches 3. Filtering-and-refinement 4. Hybrid method 5. Experimental results 6. Conclusions and Further Development 7. Bibliography 1/19

Anomaly detection :  the process of finding individual objects that are different from the normal objects Applications :  critical safe systems, insurance, health, electronic and bank fraud detection, military surveillance of enemy activities, data mining 1. Introduction 2/19

2. Classical techniques The Nearest Neighbor approach: - calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor - sparse instances are considered anomalies, dense instances are considered normal instances The Density based Local Outliers approach: - assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood - average density of the instance is compared with the average density of its nearest neighbors 3/19

2. Classical techniques The DBSCAN algorithm: - well known clustering algorithm - based on the density-reachability and density- connectivity concepts - it does not assign all the entries to a cluster - weaknesses: lacks scalability and fast response capabilities 4/19

2. Classical techniques The Random Forest approach: - ensemble of individual tree predictors - each tree depends on the values of a random vector sampled independently with the same distribution in all the trees - advantage: discovers new patterns that the Euclidian distance does not - weakness: working with labeled data and calculation speed 5/19

3. Filtering-and-refinement - classical methods focus on normal instances for detecting anomalies - F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances - two stage approach 6/19

3. Filtering-and-refinement 7/19 -Filtering stage: - removes majority of normal instances Refinement stage: - examines data with different density based measures

3. Filtering-and-refinement Advantages: - saves the majority of the processing time by only analyzing the remaining data in the second step - flexible and combinable with different density based algorithms Disadvantage: not really tested in practice 8/19

4. Hybrid method - combination between Filtering-and- refinement and DBSCAN - filtering stage : using average value - refinement stage : using DBSCAN - JAVA routines for filtering stage - WEKA processing for refinement stage 9/19

4. Hybrid method 10/19 Two separate implementations: - F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%) - F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)

4. Hybrid method 11/19 - automatically generated anomalies - we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances - 3 separate runs to compare the results (F&R1, F&R2, normal)

5. Experimental results 12/ Data sets used - 24 variations of data sets each containing over entries - data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to - for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection

5. Experimental results 13/ Results

5. Experimental results 14/19

5. Experimental results - for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s 15/19 ApproachBest Time(s)Worst Time(s) FR1329 FR28156 Normal Results

- both F&R approaches are more accurate compared to the classical approach - F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters 6.1. Conclusions 6. Conclusions and Further Development 16/19

- overall enormous speed gain compared to classical methods - saves disk space and processing resources - the hybrid method spends the majority of the time processing anomalies and not normal instances 6.1. Conclusions 6. Conclusions and Further Development 17/19

- adaptation of algorithm to different domains - use “filtered out” instances for training parallel neural networks - experiment with a hybrid method between the RF predictor and the F&R approach 6.2. Further Development 6. Conclusions and Further Development 18/19

- Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two- Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan Bibliography 19/19

SACI Thank you for your attention! May 2011