© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Anomaly/Outlier Detection l What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data l Variants of Anomaly/Outlier Detection Problems –Given a database D, find all the data points x  D with anomaly scores greater than some threshold t –Given a database D, find all the data points x  D having the top- n largest anomaly scores f(x) –Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D l Applications: –Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as noise by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Anomaly Detection l Challenges –How many outliers are there in the data? –Method is unsupervised  Validation can be quite challenging (just like for clustering) –Finding needle in a haystack l Working assumption: –There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Anomaly Detection Schemes l General Steps –Build a profile of the “normal” behavior  Profile can be patterns or summary statistics for the overall population –Use the “normal” profile to detect anomalies  Anomalies are observations whose characteristics differ significantly from the normal profile l Types of anomaly detection schemes –Graphical & Statistical-based –Distance-based –Model-based

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Convex Hull Method l Extreme points are assumed to be outliers l Use convex hull method to detect extreme values l What if the outlier occurs in the middle of the data?

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Statistical Approaches l Assume a parametric model describing the distribution of the data (e.g., normal distribution) l Anomaly: objects that do not fit the model well l Apply a statistical test that depends on –Data distribution –Parameter of distribution (e.g., mean, variance) –Number of expected outliers (confidence limit)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Grubbs’ Test l Detect outliers in univariate data l Assume data comes from normal distribution l Detects one outlier at a time, remove the outlier, and repeat –H 0 : There is no outlier in data –H A : There is at least one outlier

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Grubbs’ Test (cont’d) l Grubbs’ test statistic: Sample mean: Sample standard deviation: l Reject H 0 if: is the critical value of the t-distribution with (N- 2) degrees of freedom and a significance level of

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Statistical-based – Likelihood Approach l Assume the data set D contains samples from a mixture of two probability distributions: –M (majority distribution) –A (anomalous distribution) l General Approach: –Initially, assume all the data points belong to M –Let L t (D) be the log likelihood of D at time t –For each point x t that belongs to M, move it to A  Let L t+1 (D) be the new log likelihood.  Compute the difference,  = L t (D) – L t+1 (D)  If  > c (some threshold), then x t is declared as an anomaly and moved permanently from M to A

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Statistical-based – Likelihood Approach l Data distribution, D = (1 – ) M + A l M is a probability distribution estimated from data –Can be based on any modeling method (naïve Bayes, maximum entropy, etc) l A is initially assumed to be uniform distribution l Likelihood and log likelihood at time t:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Limitations of Statistical Approaches l Most of the tests are for a single attribute l In many cases, data distribution may not be known l For high dimensional data, it may be difficult to estimate the true distribution

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Distance-based Approaches l Data is represented as a vector of features l Three major approaches –Nearest-neighbor based –Density based –Clustering based

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Nearest-Neighbor Based Approach l Approach: –Compute the distance between every pair of data points –There are various ways to define outliers:  Data points for which there are fewer than p neighboring points within a distance D  The top n data points whose distance to the kth nearest neighbor is greatest  The top n data points whose average distance to the k nearest neighbors is greatest

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Outliers in Lower Dimensional Projection l In high-dimensional space, data is sparse and notion of proximity becomes meaningless –Every point is an almost equally good outlier from the perspective of proximity-based definitions l Lower-dimensional projection methods –A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Outliers in Lower Dimensional Projection l Divide each attribute into  equal-depth intervals –Each interval contains a fraction f = 1/  of the records l Consider a k-dimensional cube created by picking grid ranges from k different dimensions –If attributes are independent, we expect region to contain a fraction f k of the records –If there are N points, we can measure sparsity of a cube D as: –Negative sparsity indicates cube contains smaller number of points than expected

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Density-based: LOF approach l For each point, compute the density of its local neighborhood l Compute local outlier factor (LOF) of a sample p as the ratio of the density of sample x and the average density of its nearest neighbors l Outliers are points with largest LOF value

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Density-based: LOF approach (cont’d) l Example: p 2  p 1  In the k-NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Clustering-Based l Basic idea: –Cluster the data into dense groups –Choose points in small cluster as candidate outliers –Compute the distance between candidate points and non-candidate clusters.  If candidate points are far from all other non-candidate points, they are outliers

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Clustering-Based: Use Objective Function l Use the objective function to assess how well an object belongs to a cluster l If the elimination of an object results in a substantial improvement in the objective function, for example, SSE, the object is classified as an outlier.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Clustering-Based: Strengths and Weaknesses l Clusters and outliers are complementary, so this approach can find valid clusters and outliers at the same time. l The outliers and their scores heavily depend on the clustering parameters, e.g., the number of clusters, density, etc.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Summary l Understanding and calculation –K-means –Hierarchical clustering –Their advantages and disadvantages l Understanding –Density-based clustering –Clustering evaluation –Anomaly detection

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

Similar presentations

Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.

Similar presentations

Presentation on theme: "© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by."— Presentation transcript:

Similar presentations

About project

Feedback