Download presentation
Presentation is loading. Please wait.
Published byRoland Todd Modified over 9 years ago
1
Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie SACI May 20 11
2
Contents 1. Introduction 2. Anomaly detection classical approaches 3. Filtering-and-refinement 4. Hybrid method 5. Experimental results 6. Conclusions and Further Development 7. Bibliography 1/19
3
Anomaly detection : the process of finding individual objects that are different from the normal objects Applications : critical safe systems, insurance, health, electronic and bank fraud detection, military surveillance of enemy activities, data mining 1. Introduction 2/19
4
2. Classical techniques The Nearest Neighbor approach: - calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor - sparse instances are considered anomalies, dense instances are considered normal instances The Density based Local Outliers approach: - assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood - average density of the instance is compared with the average density of its nearest neighbors 3/19
5
2. Classical techniques The DBSCAN algorithm: - well known clustering algorithm - based on the density-reachability and density- connectivity concepts - it does not assign all the entries to a cluster - weaknesses: lacks scalability and fast response capabilities 4/19
6
2. Classical techniques The Random Forest approach: - ensemble of individual tree predictors - each tree depends on the values of a random vector sampled independently with the same distribution in all the trees - advantage: discovers new patterns that the Euclidian distance does not - weakness: working with labeled data and calculation speed 5/19
7
3. Filtering-and-refinement - classical methods focus on normal instances for detecting anomalies - F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances - two stage approach 6/19
8
3. Filtering-and-refinement 7/19 -Filtering stage: - removes majority of normal instances Refinement stage: - examines data with different density based measures
9
3. Filtering-and-refinement Advantages: - saves the majority of the processing time by only analyzing the remaining data in the second step - flexible and combinable with different density based algorithms Disadvantage: not really tested in practice 8/19
10
4. Hybrid method - combination between Filtering-and- refinement and DBSCAN - filtering stage : using average value - refinement stage : using DBSCAN - JAVA routines for filtering stage - WEKA processing for refinement stage 9/19
11
4. Hybrid method 10/19 Two separate implementations: - F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%) - F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)
12
4. Hybrid method 11/19 - automatically generated anomalies - we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances - 3 separate runs to compare the results (F&R1, F&R2, normal)
13
5. Experimental results 12/19 5.1. Data sets used - 24 variations of data sets each containing over 20.000 entries - data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to - for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection
14
5. Experimental results 13/19 5.2. Results
15
5. Experimental results 14/19
16
5. Experimental results - for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s 15/19 ApproachBest Time(s)Worst Time(s) FR1329 FR28156 Normal9081070 5.2. Results
17
- both F&R approaches are more accurate compared to the classical approach - F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters 6.1. Conclusions 6. Conclusions and Further Development 16/19
18
- overall enormous speed gain compared to classical methods - saves disk space and processing resources - the hybrid method spends the majority of the time processing anomalies and not normal instances 6.1. Conclusions 6. Conclusions and Further Development 17/19
19
- adaptation of algorithm to different domains - use “filtered out” instances for training parallel neural networks - experiment with a hybrid method between the RF predictor and the F&R approach 6.2. Further Development 6. Conclusions and Further Development 18/19
20
- Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two- Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, 2009 - Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008 - Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, 2006. - Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan 2008 7. Bibliography 19/19
21
SACI Thank you for your attention! May 2011
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.