Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie SACI May 20 11
Contents 1. Introduction 2. Anomaly detection classical approaches 3. Filtering-and-refinement 4. Hybrid method 5. Experimental results 6. Conclusions and Further Development 7. Bibliography 1/19
Anomaly detection : the process of finding individual objects that are different from the normal objects Applications : critical safe systems, insurance, health, electronic and bank fraud detection, military surveillance of enemy activities, data mining 1. Introduction 2/19
2. Classical techniques The Nearest Neighbor approach: - calculates the distance between every analyzed instance from the data set and the k-th nearest neighbor - sparse instances are considered anomalies, dense instances are considered normal instances The Density based Local Outliers approach: - assigns local outlier factor to describe the degree in which the instance is outlier to a local neighborhood - average density of the instance is compared with the average density of its nearest neighbors 3/19
2. Classical techniques The DBSCAN algorithm: - well known clustering algorithm - based on the density-reachability and density- connectivity concepts - it does not assign all the entries to a cluster - weaknesses: lacks scalability and fast response capabilities 4/19
2. Classical techniques The Random Forest approach: - ensemble of individual tree predictors - each tree depends on the values of a random vector sampled independently with the same distribution in all the trees - advantage: discovers new patterns that the Euclidian distance does not - weakness: working with labeled data and calculation speed 5/19
3. Filtering-and-refinement - classical methods focus on normal instances for detecting anomalies - F&R introduces a change of paradigm: it focuses on the anomalies and not on the normal instances - two stage approach 6/19
3. Filtering-and-refinement 7/19 -Filtering stage: - removes majority of normal instances Refinement stage: - examines data with different density based measures
3. Filtering-and-refinement Advantages: - saves the majority of the processing time by only analyzing the remaining data in the second step - flexible and combinable with different density based algorithms Disadvantage: not really tested in practice 8/19
4. Hybrid method - combination between Filtering-and- refinement and DBSCAN - filtering stage : using average value - refinement stage : using DBSCAN - JAVA routines for filtering stage - WEKA processing for refinement stage 9/19
4. Hybrid method 10/19 Two separate implementations: - F&R1 : in the filtering stage we removed the largest possible percentage of normal instances (~85%) - F&R2 : in the filtering stage we removed a consistent percentage of normal instances (~65%)
4. Hybrid method 11/19 - automatically generated anomalies - we modeled the data set in JAVA to be able to differentiate the anomalies from the normal instances - 3 separate runs to compare the results (F&R1, F&R2, normal)
5. Experimental results 12/ Data sets used - 24 variations of data sets each containing over entries - data sets consisting of one letter column followed by 16 numeric features columns describing the letter they belong to - for each run the generated anomalies are stored also in separate data sets for validation of the anomaly detection
5. Experimental results 13/ Results
5. Experimental results 14/19
5. Experimental results - for F&R1 and F&R2 the most costly execution for the filtering stage was ~ 10 s 15/19 ApproachBest Time(s)Worst Time(s) FR1329 FR28156 Normal Results
- both F&R approaches are more accurate compared to the classical approach - F&R approach can also be applied to clustering algorithms which do not assign all the instances with strange properties to clusters 6.1. Conclusions 6. Conclusions and Further Development 16/19
- overall enormous speed gain compared to classical methods - saves disk space and processing resources - the hybrid method spends the majority of the time processing anomalies and not normal instances 6.1. Conclusions 6. Conclusions and Further Development 17/19
- adaptation of algorithm to different domains - use “filtered out” instances for training parallel neural networks - experiment with a hybrid method between the RF predictor and the F&R approach 6.2. Further Development 6. Conclusions and Further Development 18/19
- Xiao Yu, Lu An Tang, Jiawei Han. “Filtering and Refinement: A Two- Stage Approach for Efficient and Effective Anomaly Detection.” In Ninth IEEE International Conference on Data Mining, Liu F.T., Ting K.M., and Zhou Z. “Isolation forest.” In ICDM’08, Shi T. and Horvath S. “Unsupervised learning with random forest predictors.” In J. Computational and Graphical Statistics, Wenke Lee, Salvatore J. Stolfo, Philip K. Chan, Eleazar Eskin, Wei Fan, Matthew Miller, Shlomo Hershkop and Junxin Zhang, “Real Time Data Mining-based Intrusion Detection”, Conference paper of the North Carolina State University at Raleigh Department of Computer Science, Jan Bibliography 19/19
SACI Thank you for your attention! May 2011