Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) ETHEM ALPAYDIN © The MIT Press, 2010
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Lecture 3 Nonparametric density estimation and classification
SASH Spatial Approximation Sample Hierarchy
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Overview Of Clustering Techniques D. Gunopulos, UCR.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Feature Extraction for Outlier Detection in High- Dimensional Spaces Hoang Vu Nguyen Vivekanand Gopalkrishnan.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
An Unbiased Distance-based Outlier Detection Approach for High Dimensional Data DASFAA 2011 By Hoang Vu Nguyen, Vivekanand Gopalkrishnan and Ira Assent.
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Indexing structures for files D ƯƠ NG ANH KHOA-QLU13082.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Lecture 2b: Performance Metrics. Performance Metrics Measurable characteristics of a computer system: Count of an event Duration of a time interval Size.
An Efficient Approach to Clustering in Large Multimedia Databases with Noise Alexander Hinneburg and Daniel A. Keim.
Clustering Uncertain Data Speaker: Ngai Wang Kay.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Fall Final Topics by “Notecard”.
September, 2002 Efficient Bitmap Indexes for Very Large Datasets John Wu Ekow Otoo Arie Shoshani Lawrence Berkeley National Laboratory.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
Anomaly Detection for Scientific Data Mark Schwabacher NASA ARC, Code TI (formerly IC, TC) ROSES Code S & T Workshop February 17, 2005.
Graph preprocessing. Framework for validating data cleaning techniques on binary data.
PMR: Point to Mesh Rendering, A Feature-Based Approach Tamal K. Dey and James Hudson
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Privacy Preserving Outlier Detection using Locality Sensitive Hashing
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Scalability of Local Image Descriptors Björn Þór Jónsson Department of Computer Science Reykjavík University Joint work with: Laurent Amsaleg (IRISA-CNRS)
Distance-Based Outlier Detection: Consolidation and Renewed Bearing Gustavo Henrique Orair Federal University of Minas Gerais Wagner Meira Jr. Federal.
Data Transformation: Normalization
SIMILARITY SEARCH The Metric Space Approach
Database Management System
Efficient Image Classification on Vertically Decomposed Data
Ch8: Nonparametric Methods
Clustering Uncertain Taxi data
Selectivity Estimation of Big Spatial Data
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Spatio-temporal Pattern Queries
Spatial Online Sampling and Aggregation
Outlier Discovery/Anomaly Detection
Efficient Image Classification on Vertically Decomposed Data
6. Introduction to nonparametric clustering
Communication and Memory Efficient Parallel Decision Tree Construction
Igor V. Cadez, Padhraic Smyth, Geoff J. Mclachlan, Christine and E
Data Mining Anomaly/Outlier Detection
Fast and Exact K-Means Clustering
Evaluation of Relational Operations: Other Techniques
Data Mining Anomaly Detection
Data Mining Anomaly Detection
Presentation transcript:

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for the Study of Learning and Expertise 2 NASA Ames Research Center

Motivation Detecting outliers or anomalies is an important KDD task with many practical applications and fast algorithms are needed for large databases. In this talk, I will –Show that very simple modifications of a basic algorithm lead to extremely good performance –Explain why this approach works well –Discuss limitations of this approach

Distance-Based Outliers The main idea is to find points in low density regions of the feature space x d V is the total volume within radius d N is the total number of examples k is the number of examples in sphere Distance measure determines proximity and scaling.

Outlier Definitions Outliers are the examples for which there are fewer than p other examples within distance d –Knorr & Ng Outliers are the top n examples whose distance to the kth nearest neighbor is greatest –Ramaswamy, Rastogi, & Shim Outliers are the top n examples whose average distance to the k nearest neighbors is greatest –Angiulli & Pizzuti, Eskin et al. These definitions all relate to

Existing Methods Nested Loops –For each example, find it’s nearest neighbors with a sequential scan O(N 2 ) Index Trees –For each example, find it’s nearest neighbors with an index tree Potentially N log N, in practice can be worse than NL Partitioning Methods –For each example, find it’s nearest neighbors given that the examples are stored in bins (e.g., cells, clusters) Cell-based methods potentially N, in practice worse than NL for more than 5 dimensions (Knorr & Ng) Cluster based methods appear sub-quadratic

Our Algorithm Based on Nested loops –For each example, find it’s nearest neighbors with a sequential scan Two modifications –Randomize order of examples Can be done with a disk-based algorithm in linear time –While performing the sequential scan, Keep track of closest neighbors found so far prune examples once the neighbors found so far indicate that the example cannot be a top outlier Process examples in blocks Worst case O(N 2 ) distance computations, O(N 2 /B) disk accesses

Pruning Outliers based on distance to the 3rd nearest neighbor (k=3) x d sequential scan d is distance to 3 rd nearest neighbor for the weakest top outlier

Experimental Setup 6 data sets varying from 68K to 5M examples Mixture of discrete and continuous features (23- 55) Wall time reported (CPU + IO) –Time does not include randomization No special caching of records Pentium 4, 1.5 Ghz, 1GB Ram Memory footprint ~3MB Mined top 30 outliers, k=5, block size = 1000, average distance

Scaling with N

Scaling Summary Data SetSlope Corel Histogram Covertype KDDCup 1999 Household 1990 Person 1990 Normal 30D Slope of regression fit relating log time to log N

Scaling with k 1 million records used for both Person and Normal 30D

Average Case Analysis Consider operation of the algorithm at moment in time –Outliers defined by distance to kth neighbor –Current cutoff distance is d –Randomization + sequential scan = I.I.D. sampling of pdf x d Let p(x) = prob. randomly drawn example lies within distance d How many examples do we need to look at?

For non-outliers, number of samples follows a negative binomial distribution. Let P(Y=y) be probability of obtaining kth success on step y Expectation of number of samples with infinite data is

How does the cutoff change during program execution? Person Percent of Data Set Processed Cutoff 50K 100K 1M 5M

Uniform 3D Household Covertype Person Corel Histogram Normal 30D KDDCup Mixed 3D Scaling Rate b Versus Cutoff Ratio Polynomial scaling b Relative change in cutoff (50K/5K) as N increases

Limitations Failure modes –examples not in random order –examples not independent –no outliers in data

Method fails when there are no outliers Examples drawn from a uniform distribution in 3 dimensions b=1.76

However, the method is efficient if there are at least a few outliers Examples drawn from 99% uniform, 1% Gaussian distribution b=1.11

Future Work Pruning eliminates examples when they cannot be a top outlier. Can we prune examples when they are almost certain to be an outlier? How many examples is enough? Do we need to do the full N 2 comparisons? How do algorithm settings affect performance and do they interact with data set characteristics? How do we deal with dependent data points?

Summary & Conclusions Presented a nested loop approach to finding distance-based outliers Efficient and allows scaling to larger data sets with millions of examples and many features Easy to implement and should be the new strawman for research in speeding up distance-based outliers

Resources Executables available from Comparison with GritBot on Census data Datasets are public and are available by request

Scaling Summary Nb=1.13b=1.32NlogN

How big a sample do we need? It depends…