Data Mining Anomaly Detection

Slides:



Advertisements
Similar presentations
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.
Advertisements

Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Classification: Alternative Techniques
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Evaluating Hypotheses
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Inferences About Process Quality
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
C82MCP Diploma Statistics School of Psychology University of Nottingham 1 Overview Central Limit Theorem The Normal Distribution The Standardised Normal.
TESTING A HYPOTHESIS RELATING TO THE POPULATION MEAN 1 This sequence describes the testing of a hypothesis at the 5% and 1% significance levels. It also.
EC220 - Introduction to econometrics (review chapter)
Sample Size Determination Ziad Taib March 7, 2014.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Overview Definition Hypothesis
Jeff Howbert Introduction to Machine Learning Winter Anomaly Detection Some slides taken or adapted from: “Anomaly Detection: A Tutorial” Arindam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
Estimation of Statistical Parameters
PARAMETRIC STATISTICAL INFERENCE
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Statistics - methodology for collecting, analyzing, interpreting and drawing conclusions from collected data Anastasia Kadina GM presentation 6/15/2015.
1 E. Fatemizadeh Statistical Pattern Recognition.
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Introduction to the Practice of Statistics Fifth Edition Chapter 6: Introduction to Inference Copyright © 2005 by W. H. Freeman and Company David S. Moore.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Anomaly Detection.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Chapter 7: The Distribution of Sample Means
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
Anomaly Detection Nathan Dautenhahn CS 598 Class Lecture March 3, 2011.
Ch8: Nonparametric Methods
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Classification: Alternative Techniques
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Data Mining Anomaly/Outlier Detection
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Lecture 14: Anomaly Detection
Parametric Methods Berlin Chen, 2005 References:
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Presentation transcript:

Data Mining Anomaly Detection Master Soft Computing y Sistemas Inteligentes Curso: Modelos avanzados en Minería de Datos Universidad de Granada Juan Carlos Cubero JC.Cubero@decsai.ugr.es Transparencias realizadas a partir de las confeccionadas por: Tan, Steinbach, Kumar: Introduction to Data Mining http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item2 Lazarevic et al: http://videolectures.net/ecmlpkdd08_lazarevic_dmfa/ Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Más ejemplos de abnormal regularities Que computen en clase algunos valores de distancias falta gráfico medidas repetidas laboratorios Cochran

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Anomaly Detection Bacon, writing in Novum Organum about 400 years ago said: "Errors of Nature, Sports and Monsters correct the understanding in regard to ordinary things, and reveal general forms. For whoever knows the ways of Nature will more easily notice her deviations; and, on the other hand, whoever knows her deviations will more accurately describe her ways." What are anomalies/outliers? The set of data points that are considerably different than the remainder of the data

Anomaly Detection Working assumption: Challenges There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data Challenges How many outliers are there in the data? Finding needle in a haystack

Anomaly/Outlier Detection: Applications Credit Card Fraud An abnormally high purchase made on a credit card Cyber Intrusions A web server involved in ftp traffic

Anomaly/Outlier Detection: Categorization Unsupervised Methods  Each data input does not have such label. It is considered as an outlier, depending on its relation with the rest of data. X Y N1 N2 o1 o2 O3

Anomaly/Outlier Detection: Categorization Supervised Methods  Each data input includes a label stating if such data is an anomaly or not

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Anomaly/Outlier Detection: Supervised Supervised methods  Classification of a class attribute with very rare class values (the outliers) Key issue: Unbalanced datasets (more with Paco Herrera) Suppose a intrusion detection problem. Two classes: normal (99.9%) and intrusion (0.1%) The default classifier, always labeling each new entry as normal, would have 99.9% accuracy!

Anomaly/Outlier Detection : Supervised Managing the problem of Classification with rare classes: We need other evaluation measures as alternatives to accuracy (Recall, Precision, F-measure, ROC-curves) Some methods manipulate the data input, oversampling those tuples with the outlier label (the rare class value) Cost-sensitive methods (assigning high cost to the rare class value) Variants on rule based methods, neural networks, SVM’s. etc.

Anomaly/Outlier Detection : Supervised anomaly class – C normal class – NC Recall (R) = TP/(TP + FN)‏ Precision (P) = TP/(TP + FP)‏ F – measure = 2*R*P/(R+P) = 11

Base Rate Fallacy (Axelsson, 1999) Suppose that your physician performs a test that is 99% accurate, i.e. when the test was administered to a test population all of which had the disease, 99% of the tests indicated disease, and likewise, when the test population was known to be 100% free of the disease, 99% of the test results were negative. Upon visiting your physician to learn of the results he tells you he has good news and bad news. The bad news is that indeed you tested positive for the disease. The good news however, is that out of the entire population the rate of incidence is only 1=10000, i.e. only 1 in 10000 people have this ailment. What, given the above information, is the probability of you having the disease?

Base Rate Fallacy Bayes theorem: More generally:

Base Rate Fallacy Call S=Sick, Pt=Positive P(S)=1/10000 P(Pt|S)=0.99 P(Pt|¬S)=1- P(¬Pt|¬S) Compute P(S|P) Even though the test is 99% certain, your chance of having the disease is 1/100, because the population of healthy people is much larger than sick people

Base Rate Fallacy in Outlier Detection Outlier detection as a Classification System: Two classes: Outlier, Not an outlier A typical problem: Intrusion Detection I : real intrusive behavior, ¬I : non-intrusive behavior A : alarm (outlier detected) ¬A : no alarm A good classification system will have: - A high Detection rate (true positive rate): P(A|I) - A low False alarm rate: P(A|¬I) We should also obtain high values of: Bayesian detection rate, P(I|A) (If the alarm fires, its an intrusion) P(¬I| ¬A) (if the alarm does not fire, it is not an intrusion)

Base Rate Fallacy in Outlier Detection In intrusion (outlier in general) detection systems, we have very low P(I) values (10-5). So, P(¬I) is very high The final value of P(I|A) is dominated by the false alarm rate P(A|¬I). P(A|¬I) should have a very low value (as to 10-5) to compensate 0.99998. BUT even a very good classification system, does not have such a false alarm rate. 

Base Rate Fallacy in Outlier Detection Conclusion: Outlier Classification systems must be carefully designed when applied to data with a very low positive rate (outlier). Consider a classification with the best possible accuracy P(A|I)=1and an extremely good false alarm rate of 0.001 In this case, P(I|A)=0.02 (the scale is logarithmic) So, If the alarm fires 50 times, only one is a real intrusion

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Anomaly/Outlier Detection: Unsupervised X Y N1 N2 o1 o2 O3

Anomaly/Outlier Detection: Unsupervised General Steps Build a profile of the “normal” behavior Profile can be patterns or summary statistics for the overall population Use the “normal” profile to detect anomalies Anomalies are observations whose characteristics differ significantly from the normal profile Types of anomaly detection schemes Point anomalies Non-point anomalies

Anomaly/Outlier Detection: Unsupervised Point anomalies

Anomaly/Outlier Detection: Unsupervised Variants of Point anomalies Detection Problems Given a database D, find all the data points x  D with anomaly scores greater than some threshold t Given a database D, find all the data points x  D having the top-n largest anomaly scores f(x) Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D Point anomalies: Graphical & Statistical-based Distance-based Clustering-based Others

Anomaly/Outlier Detection: Unsupervised Non-Point anomalies: Contextual Normal Anomaly

Anomaly/Outlier Detection: Unsupervised Non-Point anomalies: Collective Anomalous Subsequence

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Graphical Approaches Limitations Time consuming Subjective

Convex Hull Method Extreme points are assumed to be outliers Use convex hull method to detect extreme values What if the outlier occurs in the middle of the data?

Statistical Approaches Without assuming a parametric model describing the distribution of the data (and only 1 variable) IQR = Q3 - Q1 P is an Outlier if P > Q3 + 1.5 IQR P is an Outlier if P < Q1 - 1.5 IQR P is an Extreme Outlier if P > Q3 + 3 IQR P is an Extreme Outlier if P < Q1 - 3 IQR

Statistical Approaches Assume a parametric model describing the distribution of the data (e.g., normal distribution) Apply a statistical test that depends on Data distribution Parameter of distribution (e.g., mean, variance) Number of expected outliers (confidence limit)

Grubbs’ Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat H0: There is no outlier in data HA: There is at least one outlier Grubbs’ test statistic: Reject H0 if: http://www.graphpad.com/quickcalcs/Grubbs1.cfm

Multivariate Normal Distribution Working with several dimensions

Multivariate Normal Distribution

Limitations of Statistical Approaches Most of the tests are for a single attribute In many cases, data distribution may not be known For high dimensional data, it may be difficult to estimate the true distribution

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Distance-based Approaches (DB) Data is represented as a vector of features. We have a distance measure to evaluate nearness between two points Two major approaches Nearest-neighbor based Density based The first two methods work directly with the data.

Nearest-Neighbor Based Approach Compute the distance (proximity) between every pair of data points Fix a magic number k representing the k-th nearest point to another point For a given point P, compute its outlier score as the distance of P to its k-nearest neighbor. There are no clusters. Neighbor refers to a point Consider as outliers those points with high outlier score.

Nearest-Neighbor Based Approach k = 5 This distance is the outlier score of C P This distance is the outlier score of P

Nearest-Neighbor Based Approach All these points are closed (k=4), and thus have a low outlier score  k = 4 This point is far away from his 4-nearest neighbors. Thus, he has a high outlier score 

Nearest-Neighbor Based Approach k = 1 Low outlier scores  Greater outlier score than C  Choice of k is problematic

Nearest-Neighbor Based Approach High outlier scores  Choice of k is problematic k = 5 All the points in any isolated natural cluster with fewer points than k, have high outlier score Medium-High outlier score  (Could be greater) We could mitigate the problem by taking the average distance to the k-nearest neighbors but is still poor

Nearest-Neighbor Based Approach Density should be taken into account C has a high outlier score  for every k D has a low outlier score  for every k A has a medium-high outlier score  for every k

Density-based Approach Density should be taken into account Let us define the k-density around a point as: Alternative a) k-density of a point is the inverse of the average sum of the distances to its k-nearest neighbors. Alternative b) d-density of a point P is the number Pi of points which are d-close to P (distance(Pi ,P) ≤ d) Used in DBSCAN Choice of d is problematic

Density-based Approach Density should be taken into account Define the k-relative density of a point P as the ratio between its k-density and the average k-densities of its k-nearest neigbhors The outlier score of a point P (called LOF for this method) is its k-relative density. LOF is implemented in the R Statistical suite

Density-based Approach C has a extremely low k-density and a very high k-relative density for every k, and thus a very high LOF outlier score  Density should be taken into account A has a very low k-density  but a medium-low k-relative density for every k, and thus a medium-low LOF outlier score  D has a medium-low k-density  but a medium-high k-relative density for every k, and thus a medium-high LOF outlier score 

Distance Measure B is closest to the centroid C than A, but its Euclidean distance is higher A C B

Distance Measure

Distance Measure Replace Euclidean distance by Mahalanobis distance: Usually, V is unknown and is replaced by the sample Covariance matrix

Outliers in High Dimensional Problems In high-dimensional space, data is sparse and notion of proximity becomes meaningless Every point is an almost equally good outlier from the perspective of proximity-based definitions Lower-dimensional projection methods A point is an outlier if in some lower dimensional projection, it is present in a local region of abnormally low density

Outliers in High Dimensional Problems Approach by Aggarwal and Yu. Divide each attribute into  equal-depth intervals Each interval contains a fraction f = 1/ of the records Consider a k-dimensional cube created by picking grid ranges from k different dimensions If attributes are independent, we expect region to contain a fraction fk of the records If there are N points, we can measure sparsity of a cube D as:

Outliers in High Dimensional Problems k=2, N=100,  = 5, f = 1/5 = 0.2, N  f2 = 4

Outliers in High Dimensional Problems Algorithm: - Try every k-projection (k=1,2,...Dim) - Compute the sparsity of every Cube in such k – projection - Retain the cubes with the most negative sparsity The authors use a genetic algorithm to compute it This is still an open problem for future research

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Clustering-Based Approach Basic idea: A set of clusters has already been constructed by any clustering method. An object is a cluster- based outlier if the object does not strongly belong to any cluster. How do we measure it?

Clustering-Based Approach D its near to its centroid, and thus it has a low outlier score  Alternative a) By measuring the distance to its closest centroid

Clustering-Based Approach Alternative b) By measuring the relative distance to its closest centroid. Relative distance is the ratio of the point’s distance from the centroid to the median distance of all the points in the cluster from the centroid. D has a medium-high relative distance to its centroid, and thus a medium-high outlier score  A has a medium-low relative distance to its centroid, and thus a medium-low outlier score 

Clustering-Based Approach Choice of k is problematic (k is now the number of clusters) Usually, it’s better to work with a large number of small clusters. An object identified as outlier when there is a large number of small clusters, it’s likely to be a true outlier.

Data Mining: Anomaly Detection Motivation and Introduction Supervised Methods Unsupervised Methods: Graphical and Statistical Approaches Distance-based Approaches - Nearest Neighbor - Density-based Clustering-based Abnormal regularities Ampliación: Regularidades estadísticas (EM). Mahalanobis. Software: http://www.vectoranomaly.com/index.php Itemsets infrecuentes

Abnormal Regularities What are anomalies/outliers? The set of data points that are considerably different than the remainder of the data It could be better to talk about: Outlier: A point is an outlier if it’s considerably different than the remainder of the data Abnormal regularity: A small set of closed points which are considerably different than the remainder of the data

Abnormal Regularities Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

Abnormal Regularities Some definitions of abnormal regularities: Peculiarities: Association rules between infrequent items (Zhong et al) Exceptions: Occur when a value interacts with another one, in such a way that changes the behavior of an association rule (Suzuki et al) Anomalous Association Rules: Occur when there are two behaviors: the typical one, and the abnormal one.