Lecture 14: Anomaly Detection

Slides:



Advertisements
Similar presentations
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Minqi Zhou © Tan,Steinbach, Kumar Introduction to Data Mining.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
1 BUS 297D: Data Mining Professor David Mease Lecture 8 Agenda: 1) Reminder about HW #4 (due Thursday, 10/15) 2) Lecture over Chapter 10 3) Discuss final.
BCOR 1020 Business Statistics
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Anomaly Detection brief review of my prospectus Ziba Rostamian CS590 – Winter 2008.
Anomaly Detection. Anomaly/Outlier Detection  What are anomalies/outliers? The set of data points that are considerably different than the remainder.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Chapter 5 Data mining : A Closer Look.
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Overview DM for Business Intelligence.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Describing distributions with numbers
Chapter 3 Data Exploration and Dimension Reduction 1.
Describing distributions with numbers
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly Detection © Tan,Steinbach, Kumar Introduction to Data Mining.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Lecture 7: Outlier Detection Introduction to Data Mining Yunming Ye Department of Computer Science Shenzhen Graduate School Harbin Institute of Technology.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
Chapter 3, Part B Descriptive Statistics: Numerical Measures n Measures of Distribution Shape, Relative Location, and Detecting Outliers n Exploratory.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to.
Data Mining Anomaly/Outlier Detection Lecture Notes for Chapter 10 Introduction to Data Mining by Tan, Steinbach, Kumar.
BotCop: An Online Botnet Traffic Classifier 鍾錫山 Jan. 4, 2010.
Ch. Eick: Some Ideas for Task4 Project2 Ideas on Creating Summaries that Characterize Clustering Results Focus: Primary Focus Cluster Summarization (what.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Anomaly Detection Carolina Ruiz Department of Computer Science WPI Slides based on Chapter 10 of “Introduction to Data Mining” textbook by Tan, Steinbach,
1 CSE 881: Data Mining Lecture 22: Anomaly Detection.
One-Variable Statistics. Descriptive statistics that analyze one characteristic of one sample  Where’s the middle?  How spread out is it?  How do different.
Descriptive Statistics ( )
SUR-2250 Error Theory.
BAE 6520 Applied Environmental Statistics
One-Variable Statistics
DECISION TREES An internal node represents a test on an attribute.
Evaluating Classifiers
BAE 5333 Applied Water Resources Statistics
Data Mining: Concepts and Techniques
How Good is a Model? How much information does AIC give us?
PCB 3043L - General Ecology Data Analysis.
Worm Origin Identification Using Random Moonwalks
Unsupervised Learning - Clustering 04/03/17
Description of Data (Summary and Variability measures)
Unsupervised Learning - Clustering
Lecture Notes for Chapter 9 Introduction to Data Mining, 2nd Edition
Data Mining Classification: Alternative Techniques
Numerical Descriptive Measures
Data Mining Anomaly Detection
Outlier Discovery/Anomaly Detection
Topic 5: Exploring Quantitative data
A survey of network anomaly detection techniques
Data Mining Anomaly/Outlier Detection
The rise of statistics Statistics is the science of collecting, organizing and interpreting data. The goal of statistics is to gain understanding from.
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
DATA MINING Introductory and Advanced Topics Part II - Clustering
Summary (Week 1) Categorical vs. Quantitative Variables
(-4)*(-7)= Agenda Bell Ringer Bell Ringer
St. Edward’s University
Data Mining Anomaly Detection
Data Mining Anomaly/Outlier Detection
Exploiting the Power of Group Differences to Solve Data Analysis Problems Outlier & Intrusion Detection Guozhu Dong, PhD, Professor CSE
Modeling IDS using hybrid intelligent systems
Data Mining Anomaly Detection
Presentation transcript:

Lecture 14: Anomaly Detection CSE 482: Big Data Analysis Lecture 14: Anomaly Detection

Problem Definition Given a collection of data instances Task: Each instance is by characterized by an attribute set x Task: Find a subset of instances whose characteristics are considerably different than the remainder of the data Problem is also known as outlier, deviation or novelty detection

Importance of Detecting Anomalies Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as noise by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html

Example Applications Applying anomaly detection to detect deforestation using remote sensing data Brazil accounts for almost 50% of all humid tropical forest clearing, nearly 4 times that of the next highest country, which accounts for 12.8% of the total. Amazon rainforest

Example Applications Intelligent transportation system Congestion detection Smart Home/Building Water theft detection Pipe burst detection

Other Example Applications Instance Attribute set, x Anomaly Detection Task Credit card Transaction Item purchased, amount, location, time, credit limit, balance, etc Finding fraudulent transactions Network traffic flow Source and destination IP, port numbers, # bytes, etc Identifying malware and other malicious activities Component to be tested Sensor measurements Detecting failures in components

Challenges in Anomaly Detection Finding needle in a haystack Anomalies are rare compared to other observations Number of anomalies are usually unknown Method is unsupervised Validation is challenging (just like for clustering)

Output of Anomaly Detection Continuous-valued output Every data instance is assigned an anomaly score Given a database D, find all instances having the top-k largest anomaly/outlier scores, where k is a user-specified parameter Binary-valued output A threshold is needed to convert the anomaly score into a binary label - anomaly or normal

Basic Strategy in Anomaly Detection Assumption: there are more “normal” than “anomalous” instances in the given data General Approach Build a profile of the “normal” behavior A profile is a set of patterns or summary statistics characterizing the overall population Use the “normal” profile to flag the anomalies Anomalies are observations whose characteristics differ significantly from the normal profile

Graphical Approach (1-D Boxplot) outlier 10th percentile 25th percentile 75th percentile 50th percentile 90th percentile Also known as box and whisker plot Inter-quartile range is the difference between the 3rd quartile (75th percentile) and 1st quartile (25th percentile) This allows us to find outliers

Graphical Approach (2-D Scatter Plot)

Z-Score Approach Assume the data follows a Gaussian distribution Outlier score for a data point x Where  is the mean and  or  is a measure of dispersion (std deviation or covariance matrix)

Distance-Based Approach Input: data: the set of data points k: number of nearest neighbors Approach: Compute the distance between every pair of data points Anomaly score of a data point is given by its distance to the k-th nearest neighbor The larger the distance, the more anomalous is the data point

Python Example Synthetic control sample data 55 time series of length 60 each #6 #11 #46 #51 Outlier

Python Example

Python Example The first 50 time series The last 5 time series

Sorted Distance (for each row) Python Example d N N Input data Distance matrix Sorted Distance (for each row) N N N N N N Outlier score K-th smallest distance

Python Example Distance matrix (Y) Input data 60 1 55 55 pdist 𝟓𝟓×𝟓𝟒 𝟐 squareform 55

Python Example knnDist N Distance matrix (Y) N N Outliers

Python Example Sort and argsort functions would sort knnDist in increasing values flipud will “flip” the data frame upside down (i.e., the last row becomes the first row, the second last row becomes the second row, etc column_stack will merge 2 columns

Python Example (using scikit-learn)

Python Example (using scikit-learn)

Model-based Approach A model-based approach for anomaly detection Fit a model to the data Most models tend to fit the general characteristics of the data Apply the model to each data instance The more anomalous is the data instance the easier it is to isolate the instance from the model

Isolation Forest Model: Decision Tree Question: If a data point is an anomaly, what will likely be its position in the tree? X < t1 For outliers, their path lengths to the root node tend to be small

Isolation Forest Outliers are easier to be isolated from the rest of the data; they tend to reside at shallower depths of the tree

Isolation Forest Approach: Repeat Randomly sample a subset of the data Build a tree from random sample Each tree is generated by randomly choosing a splitting attribute and the split point Tree is grown either until maxdepth is reached, only 1 point remains, or all attributes have the same values

Python Example For the synthetic control data n_estimators: number of trees to generate (default: 100) max_samples: max size of random sample to generate each tree contamination: percent of samples to detect as outliers

Evaluation of Anomaly Detection Methods Need ground truth labels (anomaly or normal) Compare the prediction of anomaly detection methods against the ground truth Similar to external measures for cluster validation

Evaluation of Anomaly Detection Methods If output is binary-valued: anomaly (+), normal (-) PREDICTED ACTUAL Anomaly(+) Normal (-) TP FN FP TN TP: True Positive FN: False Negative FP: False Positive TN: True Negative

Summary This lecture: Next lecture Anomaly detection problem Techniques for anomaly detection Python examples Next lecture Collaborative filtering