Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology
2 Anomaly/Outlier Detection l What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data l Variants of Anomaly/Outlier Detection Problems –Given a database D, find all the data points x D with anomaly scores greater than some threshold t –Given a database D, find all the data points x D having the top-n largest anomaly scores f(x) –Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D l Applications: –Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection
3 Anomaly Detection l Challenges –How many outliers are there in the data? –Method is unsupervised Validation can be quite challenging (just like for clustering) –Finding needle in a haystack l Working assumption: –There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data
4 Anomaly Detection Schemes l General Steps –Build a profile of the “normal” behavior Profile can be patterns or summary statistics for the overall population –Use the “normal” profile to detect anomalies Anomalies are observations whose characteristics differ significantly from the normal profile l Types of anomaly detection schemes –Graphical & Statistical-based –Distance-based –Model-based
5 Graphical Approaches l Box plot (1-D), Scatter plot (2-D), Spin plot (3-D) l Limitations –Time consuming –Subjective
6 Convex Hull Method l Extreme points are assumed to be outliers l Use convex hull method to detect extreme values l What if the outlier occurs in the middle of the data?
7 Statistical Approaches l Assume a parametric model describing the distribution of the data (e.g., normal distribution) l Apply a statistical test that depends on –Data distribution –Parameter of distribution (e.g., mean, variance) –Number of expected outliers (confidence limit)
8 Limitations of Statistical Approaches l Most of the tests are for a single attribute l In many cases, data distribution may not be known l For high dimensional data, it may be difficult to estimate the true distribution
9 Distance-based Approaches l Data is represented as a vector of features l Three major approaches –Nearest-neighbor based –Density based –Clustering based
10 Nearest-Neighbor Based Approach l Approach: –Compute the distance between every pair of data points –There are various ways to define outliers: Data points for which there are fewer than p neighboring points within a distance D The top n data points whose distance to the kth nearest neighbor is greatest The top n data points whose average distance to the k nearest neighbors is greatest
11 Outliers in Lower Dimensional Projection l In high-dimensional space, data is sparse and notion of proximity becomes meaningless –Every point is an almost equally good outlier from the perspective of proximity-based definitions l Lower-dimensional projection methods –A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density
12 Outliers in Lower Dimensional Projection l Divide each attribute into equal-depth intervals –Each interval contains a fraction f = 1/ of the records l Consider a k-dimensional cube created by picking grid ranges from k different dimensions –If attributes are independent, we expect region to contain a fraction f k of the records –If there are N points, we can measure sparsity of a cube D as: –Negative sparsity indicates cube contains smaller number of points than expected
13 Example l N=100, = 5, f = 1/5 = 0.2, N f 2 = 4
14 Density-based: LOF approach l For each point, compute the density of its local neighborhood l Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors l Outliers are points with largest LOF value p 2 p 1 In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers
15 Local Outlier Factor (LOF)* l For each data point p compute the distance to the k-th nearest neighbor (k-distance) l Compute reachability distance (reach-dist) for each data example p with respect to data example o as: reach-dist(p, o) = max{k-distance(o), d(p,o)} l Compute local reachability density (lrd) of data example p as inverse of the average reachabaility distance based on the MinPts nearest neighbors of data example o lrd(p) = l Compaute LOF(q) as ratio of average local reachability density of q’s k- nearest neighbors and local reachability density of the data record q LOF(p) = * - Breunig, et al, LOF: Identifying Density-Based Local Outliers, KDD 2000.
Advantages of Density based Techniques l Local Outlier Factor (LOF) approach –Example: p 2 p 1 In the NN approach, p 2 is not considered as outlier, while the LOF approach find both p 1 and p 2 as outliers NN approach may consider p 3 as outlier, but LOF approach does not p3 p3 Distance from p 3 to nearest neighbor Distance from p 2 to nearest neighbor
17 Clustering-Based Techniques l Basic idea: –Cluster the data into groups of different density –Choose points in small cluster as candidate outliers –Compute the distance between candidate points and non-candidate clusters. If candidate points are far from all other non-candidate points, they are outliers
Clustering Based Techniques l Advantages: –No need to be supervised –Easily adaptable to on-line / incremental mode suitable for anomaly detection from temporal data l Drawbacks –Computationally expensive Using indexing structures (k-d tree, R* tree) may alleviate this problem –If normal points do not create any clusters the techniques may fail –In high dimensional spaces, data is sparse and distances between any two data records may become quite similar. Clustering algorithms may not give any meaningful clusters
Simple Application of Clustering Radius of proximity is specified Two points x 1 and x 2 are “near” if d(x 1, x 2 ) Define N(x) – number of points that are within of x Time Complexity O(n 2 ) approximation of the algorithm l Fixed-width clustering is first applied –The first point is a center of a cluster –If every subsequent point is “near” add to a cluster Otherwise create a new cluster –Approximate N(x) with N(c) –Time Complexity – O(cn), c - # of clusters l Points in small clusters - anomalies * E. Eskin et al., A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data, 2002
20 Frequent Pattern Based Outlier Detection - Research Motivation l First motivation –Not suitable for high dimensional space In the high dimensional space, the concept of proximity may not be qualitatively meaningful. Cures of dimensionality. –Failed to find outliers in the subsets of dimensions. l Second motivation –Most studies on outliers detection are focused only on identifying outliers. In real applications, the reasons on why the identified outliers are abnormal also need to be given.
21 FPOutlier-Proposed Method l Definition of the frequent item-sets. – Let I = {i 1, i 2, …, i m } be a set of m literals called items and the database D = {t 1, t 2, …, t n } a set of n transactions, each consisting of a set of items from I. An item-set X is a non-empty subset of I. The length of item-set X is the number of items in X. An item-set of length k is called a k-itemset. A transaction t ∈ D is said to contain item-set X if X t. –The support of itemset X is defined as the percentage of transactions in D contain X, i.e., support (X) = || {t ∈ D | X t}|| / ||{t ∈ D}||.
22 FPOutlier-Proposed Method l Definition of frequent item-sets –Given a user defined threshold mini-support, find all item-sets with supports greater or equal to mini- support. Frequent item-sets are also called frequent patterns. –The set of all frequent patterns is denoted by FPS (D, mini-support), i.e., FPS (D, mini-support) = {X | support (X) ≤mini-support}.
23 FPOutlier-Proposed Method l Explication –From the viewpoint of knowledge discovery, frequent patterns reflect the “common patterns” that apply to many objects, or to large percentage of objects in the dataset. In contrast, outlier detection focuses on a very small percentage of data objects.
24 FPOutlier-Definition 1 l (FPOF-Frequent Pattern Outlier Factor) Let D = {t 1, t 2, …, t n } be a database containing a set of n transactions with items I. Given a threshold mini- support, FPS (D, mini-support) is the set of all frequent patterns. For each transaction t, the Frequent Pattern Outlier Factor of t is defined as:
25 FPOutlier-Definition 1 l Interpretation of the formula –If a transaction t contains more frequent patterns, its FPOF value will be big, which indicates that it is unlikely to be outlier. –The FPOF value is between 0 and 1.
26 Definition 2 l For each transaction t, an item-set X is said to be contradictive to t if X t. The contradict-ness of X to t is defined as:
27 Interpretation for definition l The greater the support of the item-set X, the greater the value of contradict-ness of X to t, since a large support value of X suggests a big deviation. l Secondly, longer item-sets give better description than that of short ones. l With definition 2, it is possible to identify the contribution of each item-set to the outlying-ness of the specified transaction. l The frequent pattern outlier factor given in definition 1 is used as the basic measure for identifying outliers.
28 Definition 3 l Reason: –Since it is not feasible to list all the contradict item-sets, it is preferable to present only the top k contradict frequent patterns to the end user. l (TKCFP-Top K Contradict Frequent Pattern) –Given D,I, mini-support and FPS(D, mini-support) as defined in Definition 1, For each transaction t, the item- set X is said to be a top k contradict frequent pattern if there exist no more than (k-1) item-sets whose contradict-ness is higher than that of X, where X ∈ FPS (D, mini-support).
29 Final task l Our task is to mine top-n outliers with regard to the value of request pattern outlier factor. For each identified outlier, its top k contradict frequent patterns will also be discovered for the purpose of description.