Presentation is loading. Please wait.

Presentation is loading. Please wait.

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

Similar presentations


Presentation on theme: "Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie."— Presentation transcript:

1 Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie SACI May 20 11

2 Contents 1. Introduction 2. Anomaly detection classical approaches 3. Filtering-and-refinement 4. Hybrid method 5. Experimental results 6. Conclusions and Further Development 7. Bibliography 1/19

3 Anomaly detection :  the process of finding individual objects that are different from the normal objects Applications :  critical safe systems, insurance, health, electronic and bank fraud detection, military surveillance of enemy activities, data mining 1. Introduction 2/19

4 2. Classical techniques The Nearest Neighbor approach: - calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor - sparse instances are considered anomalies, dense instances are considered normal instances The Density based Local Outliers approach: - assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood - average density of the instance is compared with the average density of its nearest neighbors 3/19

5 2. Classical techniques The DBSCAN algorithm: - well known clustering algorithm - based on the density-reachability and density- connectivity concepts - it does not assign all the entries to a cluster - weaknesses: lacks scalability and fast response capabilities 4/19

6 2. Classical techniques The Random Forest approach: - ensemble of individual tree predictors - each tree depends on the values of a random vector sampled independently with the same distribution in all the trees - advantage: discovers new patterns that the Euclidian distance does not - weakness: working with labeled data and calculation speed 5/19

7 3. Filtering-and-refinement - classical methods focus on normal instances for detecting anomalies - F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances - two stage approach 6/19

8 3. Filtering-and-refinement 7/19 -Filtering stage: - removes majority of normal instances Refinement stage: - examines data with different density based measures

9 3. Filtering-and-refinement Advantages: - saves the majority of the processing time by only analyzing the remaining data in the second step - flexible and combinable with different density based algorithms Disadvantage: not really tested in practice 8/19

10 4. Hybrid method - combination between Filtering-and- refinement and DBSCAN - filtering stage : using average value - refinement stage : using DBSCAN - JAVA routines for filtering stage - WEKA processing for refinement stage 9/19

11 4. Hybrid method 10/19 Two separate implementations: - F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%) - F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)

12 4. Hybrid method 11/19 - automatically generated anomalies - we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances - 3 separate runs to compare the results (F&R1, F&R2, normal)

13 5. Experimental results 12/19 5.1. Data sets used - 24 variations of data sets each containing over 20.000 entries - data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to - for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection

14 5. Experimental results 13/19 5.2. Results

15 5. Experimental results 14/19

16 5. Experimental results - for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s 15/19 ApproachBest Time(s)Worst Time(s) FR1329 FR28156 Normal9081070 5.2. Results

17 - both F&R approaches are more accurate compared to the classical approach - F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters 6.1. Conclusions 6. Conclusions and Further Development 16/19

18 - overall enormous speed gain compared to classical methods - saves disk space and processing resources - the hybrid method spends the majority of the time processing anomalies and not normal instances 6.1. Conclusions 6. Conclusions and Further Development 17/19

19 - adaptation of algorithm to different domains - use “filtered out” instances for training parallel neural networks - experiment with a hybrid method between the RF predictor and the F&R approach 6.2. Further Development 6. Conclusions and Further Development 18/19

20 - Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two- Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, 2009 - Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, 2008 - Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, 2006. - Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan 2008 7. Bibliography 19/19

21 SACI Thank you for your attention! May 2011


Download ppt "Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie."

Similar presentations


Ads by Google