Lecture 14: Anomaly Detection CSE 482: Big Data Analysis Lecture 14: Anomaly Detection
Problem Definition Given a collection of data instances Task: Each instance is by characterized by an attribute set x Task: Find a subset of instances whose characteristics are considerably different than the remainder of the data Problem is also known as outlier, deviation or novelty detection
Importance of Detecting Anomalies Ozone Depletion History In 1985 three researchers (Farman, Gardinar and Shanklin) were puzzled by data gathered by the British Antarctic Survey showing that ozone levels for Antarctica had dropped 10% below normal levels Why did the Nimbus 7 satellite, which had instruments aboard for recording ozone levels, not record similarly low ozone concentrations? The ozone concentrations recorded by the satellite were so low they were being treated as noise by a computer program and discarded! Sources: http://exploringdata.cqu.edu.au/ozone.html http://www.epa.gov/ozone/science/hole/size.html
Example Applications Applying anomaly detection to detect deforestation using remote sensing data Brazil accounts for almost 50% of all humid tropical forest clearing, nearly 4 times that of the next highest country, which accounts for 12.8% of the total. Amazon rainforest
Example Applications Intelligent transportation system Congestion detection Smart Home/Building Water theft detection Pipe burst detection
Other Example Applications Instance Attribute set, x Anomaly Detection Task Credit card Transaction Item purchased, amount, location, time, credit limit, balance, etc Finding fraudulent transactions Network traffic flow Source and destination IP, port numbers, # bytes, etc Identifying malware and other malicious activities Component to be tested Sensor measurements Detecting failures in components
Challenges in Anomaly Detection Finding needle in a haystack Anomalies are rare compared to other observations Number of anomalies are usually unknown Method is unsupervised Validation is challenging (just like for clustering)
Output of Anomaly Detection Continuous-valued output Every data instance is assigned an anomaly score Given a database D, find all instances having the top-k largest anomaly/outlier scores, where k is a user-specified parameter Binary-valued output A threshold is needed to convert the anomaly score into a binary label - anomaly or normal
Basic Strategy in Anomaly Detection Assumption: there are more “normal” than “anomalous” instances in the given data General Approach Build a profile of the “normal” behavior A profile is a set of patterns or summary statistics characterizing the overall population Use the “normal” profile to flag the anomalies Anomalies are observations whose characteristics differ significantly from the normal profile
Graphical Approach (1-D Boxplot) outlier 10th percentile 25th percentile 75th percentile 50th percentile 90th percentile Also known as box and whisker plot Inter-quartile range is the difference between the 3rd quartile (75th percentile) and 1st quartile (25th percentile) This allows us to find outliers
Graphical Approach (2-D Scatter Plot)
Z-Score Approach Assume the data follows a Gaussian distribution Outlier score for a data point x Where is the mean and or is a measure of dispersion (std deviation or covariance matrix)
Distance-Based Approach Input: data: the set of data points k: number of nearest neighbors Approach: Compute the distance between every pair of data points Anomaly score of a data point is given by its distance to the k-th nearest neighbor The larger the distance, the more anomalous is the data point
Python Example Synthetic control sample data 55 time series of length 60 each #6 #11 #46 #51 Outlier
Python Example
Python Example The first 50 time series The last 5 time series
Sorted Distance (for each row) Python Example d N N Input data Distance matrix Sorted Distance (for each row) N N N N N N Outlier score K-th smallest distance
Python Example Distance matrix (Y) Input data 60 1 55 55 pdist 𝟓𝟓×𝟓𝟒 𝟐 squareform 55
Python Example knnDist N Distance matrix (Y) N N Outliers
Python Example Sort and argsort functions would sort knnDist in increasing values flipud will “flip” the data frame upside down (i.e., the last row becomes the first row, the second last row becomes the second row, etc column_stack will merge 2 columns
Python Example (using scikit-learn)
Python Example (using scikit-learn)
Model-based Approach A model-based approach for anomaly detection Fit a model to the data Most models tend to fit the general characteristics of the data Apply the model to each data instance The more anomalous is the data instance the easier it is to isolate the instance from the model
Isolation Forest Model: Decision Tree Question: If a data point is an anomaly, what will likely be its position in the tree? X < t1 For outliers, their path lengths to the root node tend to be small
Isolation Forest Outliers are easier to be isolated from the rest of the data; they tend to reside at shallower depths of the tree
Isolation Forest Approach: Repeat Randomly sample a subset of the data Build a tree from random sample Each tree is generated by randomly choosing a splitting attribute and the split point Tree is grown either until maxdepth is reached, only 1 point remains, or all attributes have the same values
Python Example For the synthetic control data n_estimators: number of trees to generate (default: 100) max_samples: max size of random sample to generate each tree contamination: percent of samples to detect as outliers
Evaluation of Anomaly Detection Methods Need ground truth labels (anomaly or normal) Compare the prediction of anomaly detection methods against the ground truth Similar to external measures for cluster validation
Evaluation of Anomaly Detection Methods If output is binary-valued: anomaly (+), normal (-) PREDICTED ACTUAL Anomaly(+) Normal (-) TP FN FP TN TP: True Positive FN: False Negative FP: False Positive TN: True Negative
Summary This lecture: Next lecture Anomaly detection problem Techniques for anomaly detection Python examples Next lecture Collaborative filtering