Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Midterm topics Chapter 2 Data Data preprocessing Measures of similarity/dissimilarity Chapter.
Data Mining Anomaly Detection
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
What is Statistical Modeling
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Visual Recognition Tutorial
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Jeff Howbert Introduction to Machine Learning Winter Anomaly Detection Some slides taken or adapted from: “Anomaly Detection: A Tutorial” Arindam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
1 CSE 980: Data Mining Lecture 17: Density-based and Other Clustering Algorithms.
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
CSE 185 Introduction to Computer Vision Feature Matching.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
DATA MINING: CLUSTER ANALYSIS (3) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Data Mining: Basic Cluster Analysis
IMAGE PROCESSING RECOGNITION AND CLASSIFICATION
Ch8: Nonparametric Methods
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Critical Issues with Respect to Clustering
Data Mining Anomaly/Outlier Detection
Lecture 14: Anomaly Detection
Data Mining Classification: Alternative Techniques
Parametric Methods Berlin Chen, 2005 References:
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Presentation transcript:

Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 05/02/2006 1

11/29/2007 Introduction to Data Mining 2 Anomaly/Outlier Detection l What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data l Natural implication is that anomalies are relatively rare –One in a thousand occurs often if you have lots of data –Context is important, e.g., freezing temps in July l Can be important or a nuisance –10 foot tall 2 year old –Unusually high blood pressure

11/29/2007 Introduction to Data Mining 3 Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources:

11/29/2007 Introduction to Data Mining 4 Causes of Anomalies l Data from different classes –Measuring the weights of oranges, but a few grapefruit are mixed in l Natural variation –Unusually tall people l Data errors –200 pound 2 year old

11/29/2007 Introduction to Data Mining 5 Distinction Between Noise and Anomalies l Noise is erroneous, perhaps random, values or contaminating objects –Weight recorded incorrectly –Grapefruit mixed in with the oranges l Noise doesn’t necessarily produce unusual values or objects l Noise is not interesting l Anomalies may be interesting if they are not a result of noise l Noise and anomalies are related but distinct concepts

11/29/2007 Introduction to Data Mining 6 General Issues: Number of Attributes l Many anomalies are defined in terms of a single attribute –Height –Shape –Color l Can be hard to find an anomaly using all attributes –Noisy or irrelevant attributes –Object is only anomalous with respect to some attributes l However, an object may not be anomalous in any one attribute

11/29/2007 Introduction to Data Mining 7 General Issues: Anomaly Scoring l Many anomaly detection techniques provide only a binary categorization –An object is an anomaly or it isn’t –This is especially true of classification-based approaches l Other approaches assign a score to all points –This score measures the degree to which an object is an anomaly –This allows objects to be ranked l In the end, you often need a binary decision –Should this credit card transaction be flagged? –Still useful to have a score l How many anomalies are there?

11/29/2007 Introduction to Data Mining 8 Other Issues for Anomaly Detection l Find all anomalies at once or one at a time –Swamping –Masking l Evaluation –How do you measure performance? –Supervised vs. unsupervised situations l Efficiency l Context –Professional basketball team

11/29/2007 Introduction to Data Mining 9 Variants of Anomaly Detection Problems l Given a data set D, find all data points x  D with anomaly scores greater than some threshold t l Given a data set D, find all data points x  D having the top-n largest anomaly scores l Given a data set D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D

11/29/2007 Introduction to Data Mining 10 Model-Based Anomaly Detection l Build a model for the data and see –Unsupervised  Anomalies are those points that don’t fit well  Anomalies are those points that distort the model  Examples: –Statistical distribution –Clusters –Regression –Geometric –Graph –Supervised  Anomalies are regarded as a rare class  Need to have training data

11/29/2007 Introduction to Data Mining 11 Additional Anomaly Detection Techniques l Proximity-based –Anomalies are points far away from other points –Can detect this graphically in some cases l Density-based –Low density points are outliers l Pattern matching –Create profiles or templates of atypical but important events or objects –Algorithms to detect these patterns are usually simple and efficient

11/29/2007 Introduction to Data Mining 12 Graphical Approaches l Boxplots or scatter plots l Limitations –Not automatic –Subjective

11/29/2007 Introduction to Data Mining 13 Convex Hull Method l Extreme points are assumed to be outliers l Use convex hull method to detect extreme values l What if the outlier occurs in the middle of the data?

11/29/2007 Introduction to Data Mining 14 Statistical Approaches Probabilistic definition of an outlier: An outlier is an object that has a low probability with respect to a probability distribution model of the data. l Usually assume a parametric model describing the distribution of the data (e.g., normal distribution) l Apply a statistical test that depends on –Data distribution –Parameters of distribution (e.g., mean, variance) –Number of expected outliers (confidence limit) l Issues –Identifying the distribution of a data set  Heavy tailed distribution –Number of attributes –Is the data a mixture of distributions?

11/29/2007 Introduction to Data Mining 15 Normal Distributions One-dimensional Gaussian Two-dimensional Gaussian

11/29/2007 Introduction to Data Mining 16 Grubbs’ Test l Detect outliers in univariate data l Assume data comes from normal distribution l Detects one outlier at a time, remove the outlier, and repeat –H 0 : There is no outlier in data –H A : There is at least one outlier l Grubbs’ test statistic: l Reject H 0 if:

11/29/2007 Introduction to Data Mining 17 Statistical-based – Likelihood Approach l Assume the data set D contains samples from a mixture of two probability distributions: –M (majority distribution) –A (anomalous distribution) l General Approach: –Initially, assume all the data points belong to M –Let L t (D) be the log likelihood of D at time t –For each point x t that belongs to M, move it to A  Let L t+1 (D) be the new log likelihood.  Compute the difference,  = L t (D) – L t+1 (D)  If  > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A

11/29/2007 Introduction to Data Mining 18 Statistical-based – Likelihood Approach l Data distribution, D = (1 – ) M + A l M is a probability distribution estimated from data –Can be based on any modeling method (naïve Bayes, maximum entropy, etc) l A is initially assumed to be uniform distribution l Likelihood at time t:

11/29/2007 Introduction to Data Mining 19 Strengths/Weaknesses of Statistical Approaches l Firm mathematical foundation l Can be very efficient l Good results if distribution is known l In many cases, data distribution may not be known l For high dimensional data, it may be difficult to estimate the true distribution l Anomalies can distort the parameters of the distribution

11/29/2007 Introduction to Data Mining 20 Distance-Based Approaches l Several different techniques l An object is an outlier if a specified fraction of the objects is more than a specified distance away (Knorr, Ng 1998) –Some statistical definitions are special cases of this l The outlier score of an object is the distance to its kth nearest neighbor

11/29/2007 Introduction to Data Mining 21 One Nearest Neighbor - One Outlier Outlier Score

11/29/2007 Introduction to Data Mining 22 One Nearest Neighbor - Two Outliers Outlier Score

11/29/2007 Introduction to Data Mining 23 Five Nearest Neighbors - Small Cluster Outlier Score

11/29/2007 Introduction to Data Mining 24 Five Nearest Neighbors - Differing Density Outlier Score

11/29/2007 Introduction to Data Mining 25 Strengths/Weaknesses of Distance-Based Approaches l Simple l Expensive – O(n 2 ) l Sensitive to parameters l Sensitive to variations in density l Distance becomes less meaningful in high- dimensional space

11/29/2007 Introduction to Data Mining 26 Density-Based Approaches l Density-based Outlier: The outlier score of an object is the inverse of the density around the object. –Can be defined in terms of the k nearest neighbors –One definition: Inverse of distance to kth neighbor –Another definition: Inverse of the average distance to k neighbors –DBSCAN definition l If there are regions of different density, this approach can have problems

11/29/2007 Introduction to Data Mining 27 Relative Density l Consider the density of a point relative to that of its k nearest neighbors

11/29/2007 Introduction to Data Mining 28 Relative Density Outlier Scores Outlier Score

11/29/2007 Introduction to Data Mining 29 Density-based: LOF approach l For each point, compute the density of its local neighborhood l Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors l Outliers are points with largest LOF value p 2  p 1  In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers

11/29/2007 Introduction to Data Mining 30 Strengths/Weaknesses of Density-Based Approaches l Simple l Expensive – O(n 2 ) l Sensitive to parameters l Density becomes less meaningful in high- dimensional space

11/29/2007 Introduction to Data Mining 31 Clustering-Based Approaches l Density-based Outlier: An object is a cluster-based outlier if it does not strongly belong to any cluster –For prototype-based clusters, an object is an outlier if it is not close enough to a cluster center –For density-based clusters, an object is an outlier if its density is too low –For graph-based clusters, an object is an outlier if it is not well connected l Other issues include the impact of outliers on the clusters and the number of clusters

11/29/2007 Introduction to Data Mining 32 Distance of Points from Closest Centroids Outlier Score

11/29/2007 Introduction to Data Mining 33 Relative Distance of Points from Closest Centroid Outlier Score

11/29/2007 Introduction to Data Mining 34 Strengths/Weaknesses of Distance-Based Approaches l Simple l Many clustering techniques can be used l Can be difficult to decide on a clustering technique l Can be difficult to decide on number of clusters l Outliers can distort the clusters