Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 14: Anomaly Detection

Similar presentations


Presentation on theme: "Lecture 14: Anomaly Detection"— Presentation transcript:

1 Lecture 14: Anomaly Detection
CSE 482: Big Data Analysis Lecture 14: Anomaly Detection

2 Problem Definition Given a collection of data instances Task:
Each instance is by characterized by an attribute set x Task: Find a subset of instances whose characteristics are considerably different than the remainder of the data Problem is also known as outlier, deviation or novelty detection

3 Importance of Detecting Anomalies
Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as noise by a computer program and discarded! Sources:

4 Example Applications Applying anomaly detection to detect deforestation using remote sensing data Brazil accounts for almost 50% of all humid tropical forest clearing, nearly 4 times that of the next highest country, which accounts for 12.8% of the total. Amazon rainforest

5 Example Applications Intelligent transportation system
Congestion detection Smart Home/Building Water theft detection Pipe burst detection

6 Other Example Applications
Instance Attribute set, x Anomaly Detection Task Credit card Transaction Item purchased, amount, location, time, credit limit, balance, etc Finding fraudulent transactions Network traffic flow Source and destination IP, port numbers, # bytes, etc Identifying malware and other malicious activities Component to be tested Sensor measurements Detecting failures in components

7 Challenges in Anomaly Detection
Finding needle in a haystack Anomalies are rare compared to other observations Number of anomalies are usually unknown Method is unsupervised Validation is challenging (just like for clustering)

8 Output of Anomaly Detection
Continuous-valued output Every data instance is assigned an anomaly score Given a database D, find all instances having the top-k largest anomaly/outlier scores, where k is a user-specified parameter Binary-valued output A threshold is needed to convert the anomaly score into a binary label - anomaly or normal

9 Basic Strategy in Anomaly Detection
Assumption: there are more “normal” than “anomalous” instances in the given data General Approach Build a profile of the “normal” behavior A profile is a set of patterns or summary statistics characterizing the overall population Use the “normal” profile to flag the anomalies Anomalies are observations whose characteristics differ significantly from the normal profile

10 Graphical Approach (1-D Boxplot)
outlier 10th percentile 25th percentile 75th percentile 50th percentile 90th percentile Also known as box and whisker plot Inter-quartile range is the difference between the 3rd quartile (75th percentile) and 1st quartile (25th percentile) This allows us to find outliers

11 Graphical Approach (2-D Scatter Plot)

12 Z-Score Approach Assume the data follows a Gaussian distribution
Outlier score for a data point x Where  is the mean and  or  is a measure of dispersion (std deviation or covariance matrix)

13 Distance-Based Approach
Input: data: the set of data points k: number of nearest neighbors Approach: Compute the distance between every pair of data points Anomaly score of a data point is given by its distance to the k-th nearest neighbor The larger the distance, the more anomalous is the data point

14 Python Example Synthetic control sample data
55 time series of length 60 each #6 #11 #46 #51 Outlier

15 Python Example

16 Python Example The first 50 time series The last 5 time series

17 Sorted Distance (for each row)
Python Example d N N Input data Distance matrix Sorted Distance (for each row) N N N N N N Outlier score K-th smallest distance

18 Python Example Distance matrix (Y) Input data 60 1 55 55 pdist 𝟓𝟓×𝟓𝟒 𝟐
squareform 55

19 Python Example knnDist N Distance matrix (Y) N N Outliers

20 Python Example Sort and argsort functions would sort knnDist in increasing values flipud will “flip” the data frame upside down (i.e., the last row becomes the first row, the second last row becomes the second row, etc column_stack will merge 2 columns

21 Python Example (using scikit-learn)

22 Python Example (using scikit-learn)

23 Model-based Approach A model-based approach for anomaly detection
Fit a model to the data Most models tend to fit the general characteristics of the data Apply the model to each data instance The more anomalous is the data instance the easier it is to isolate the instance from the model

24 Isolation Forest Model: Decision Tree Question:
If a data point is an anomaly, what will likely be its position in the tree? X < t1 For outliers, their path lengths to the root node tend to be small

25 Isolation Forest Outliers are easier to be isolated from the rest of the data; they tend to reside at shallower depths of the tree

26 Isolation Forest Approach: Repeat Randomly sample a subset of the data
Build a tree from random sample Each tree is generated by randomly choosing a splitting attribute and the split point Tree is grown either until maxdepth is reached, only 1 point remains, or all attributes have the same values

27 Python Example For the synthetic control data
n_estimators: number of trees to generate (default: 100) max_samples: max size of random sample to generate each tree contamination: percent of samples to detect as outliers

28 Evaluation of Anomaly Detection Methods
Need ground truth labels (anomaly or normal) Compare the prediction of anomaly detection methods against the ground truth Similar to external measures for cluster validation

29 Evaluation of Anomaly Detection Methods
If output is binary-valued: anomaly (+), normal (-) PREDICTED ACTUAL Anomaly(+) Normal (-) TP FN FP TN TP: True Positive FN: False Negative FP: False Positive TN: True Negative

30 Summary This lecture: Next lecture Anomaly detection problem
Techniques for anomaly detection Python examples Next lecture Collaborative filtering


Download ppt "Lecture 14: Anomaly Detection"

Similar presentations


Ads by Google