Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.

Slides:



Advertisements
Similar presentations
Christoph F. Eick Questions and Topics Review Nov. 30, Give an example of a problem that might benefit from feature creation 2.How does DENCLUE.
Advertisements

Descriptive Measures MARE 250 Dr. Jason Turner.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Classification and Decision Boundaries
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Advanced Concepts and Algorithms Figures for Chapter 9 Introduction.
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Lecture Notes for Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Cluster Analysis: Basic Concepts and Algorithms Figures for Chapter 8 Introduction.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Anomaly Detection Figures for Chapter 10 Introduction to Data Mining by Tan,
BCOR 1020 Business Statistics
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Classification: Alternative Techniques Figures for Chapter 5 Introduction to.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining: Exploring Data Figures for Chapter 3 Introduction to Data Mining by Tan, Steinbach,
Inferences About Process Quality
© Tan,Steinbach, Kumar Introduction to Data Mining 1/17/ Data Mining Association Analysis: Advanced Concepts Figures for Chapter 7 Introduction to.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Jeff Howbert Introduction to Machine Learning Winter Anomaly Detection Some slides taken or adapted from: “Anomaly Detection: A Tutorial” Arindam.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Enter these data into your calculator!!!
Exploratory Data Analysis. Computing Science, University of Aberdeen2 Introduction Applying data mining (InfoVis as well) techniques requires gaining.
Objectives 1.2 Describing distributions with numbers
Outlier Detection Using k-Nearest Neighbour Graph Ville Hautamäki, Ismo Kärkkäinen and Pasi Fränti Department of Computer Science University of Joensuu,
N. GagunashviliRAVEN Workshop Heidelberg Nikolai Gagunashvili (University of Akureyri, Iceland) Data mining methods in RAVEN network.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 3 Section 4 – Slide 1 of 23 Chapter 3 Section 4 Measures of Position.
COMP5331 Outlier Prepared by Raymond Wong Presented by Raymond Wong
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Chapter 6: Interpreting the Measures of Variability.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Chapter 6: Descriptive Statistics. Learning Objectives Describe statistical measures used in descriptive statistics Compute measures of central tendency.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
BAE 6520 Applied Environmental Statistics
BAE 5333 Applied Water Resources Statistics
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 12 —
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Classification: Alternative Techniques
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Lecture 2 Chapter 3. Displaying and Summarizing Quantitative Data
Data Mining Anomaly/Outlier Detection
Lecture 14: Anomaly Detection
Day 52 – Box-and-Whisker.
CSE572: Data Mining by H. Liu
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Data Mining Anomaly Detection
Presentation transcript:

Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Anomaly/Outlier Detection l What are anomalies/outliers? –The set of data points that are considerably different than the remainder of the data l Variants of Anomaly/Outlier Detection Problems –Given a database D, find all the data points x  D with anomaly scores greater than some threshold t –Given a database D, find all the data points x  D having the top- n largest anomaly scores f(x) –Given a database D, containing mostly normal (but unlabeled) data points, and a test point x, compute the anomaly score of x with respect to D l Applications: –Credit card fraud detection, telecommunication fraud detection, network intrusion detection, fault detection

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Importance of Anomaly Detection Ozone Depletion History l In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels l Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? l The ozone concentrations recorded by the satellite were so low they were being treated as outliers by a computer program and discarded! Sources:

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Anomaly Detection l Challenges –How many outliers are there in the data? –Method is unsupervised  Validation can be quite challenging (just like for clustering) –Finding needle in a haystack l Working assumption: –There are considerably more “normal” observations than “abnormal” observations (outliers/anomalies) in the data

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Anomaly Detection Schemes l General Steps –Build a profile of the “normal” behavior  Profile can be patterns or summary statistics for the overall population –Use the “normal” profile to detect anomalies  Anomalies are observations whose characteristics differ significantly from the normal profile l Types of anomaly detection schemes –Graphical –Model-based –Distance-based –Clustering-based

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Graphical Approaches l Boxplot (1-D), Scatter plot (2-D), Spin plot (3-D) l Limitations –Time consuming –Subjective e.g.: data are outliers if the more than 1.5 IQR (or 3 IQR) away from the box!

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ R Box Plots (different from textbook) l R Box Plots –Invented by J. Tukey –Another way of displaying the distribution of data –Following figure shows the basic part of a box plot outlier Minimum (or at most 1.5 IQR off the 25 th percentile) 25 th percentile 75 th percentile 50 th percentile Maximum (or at most 1.5 IQR off the 75 th percentile) IQR Corrected on: 9/9/2013

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Convex Hull Method l Extreme points are assumed to be outliers l Use convex hull method to detect extreme values l l Approach: Fit a polygon to a point set; outliers are determined by their distance to the polygon boundary

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Statistical Approaches---Model-based l Fit a parametric model M describing the distribution of the data (e.g., normal distribution) l Assess the probability of each point M l The lower a point’s probability the more likely the point is an outlier. l Outlier Detection Approach: –Sort points by the probability –Determine Outliers  based on a probability threshold  take the bottom x percent as outliers.

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Limitations of Statistical Approaches l Many methods have been designed for a single attribute and are not easy to generalize to multi- dimensional data l In many cases, data distribution/model may not be known, and might not match the assumoptions of the employed density function (e.g. not symmetric, not Gaussian,…) l For high dimensional data, it may be difficult to estimate the true distribution, but there is new work based on EM and mixtures of Gaussians

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Distance-based Approaches l Data is represented as a vector of features l Three major approaches –Nearest-neighbor based –Density based –Clustering based

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Nearest-Neighbor Based Approach l Approach: –Compute the distance between every pair of data points –There are various ways to define outliers:  Data points for which there are fewer than p neighboring points within a distance D  The top n data points whose distance to the kth nearest neighbor is greatest  The top n data points whose average distance to the k nearest neighbors is greatest

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Density-based: LOF approach l For each point, compute the density of its local neighborhood; e.g. use DBSCAN’s approach l Compute local outlier factor (LOF) of a sample p as the average of the ratios of the density of sample p and the density of its nearest neighbors l Outliers are points with lowest LOF value p 2  p 1  In the NN approach, p 2 is not considered as outlier, while LOF approach find both p 1 and p 2 as outliers Alternative approach: directly use density function; e.g. DENCLUE’s density function

© Tan,Steinbach, Kumar Outlier Detection (edited by Ch. Eick) 4/8/ Clustering-Based l Idea: Use a clustering algorithm that has some notion of outliers! l Problem what parameters should I choose for the algorithm; e.g. DBSCAN? l Rule of Thumb: Less than x% of the data should be outliers (with x typically chosen between 0.1 and 10); x might be determined with other methods; e.g. statistical tests or domain knowledge. l Once you have an idea, how much outliers you want, the parameter selection problem becomes easier.