Jia-Bin Huang Virginia Tech

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Linear Regression.
Pattern Recognition and Machine Learning
Clustering Beyond K-means
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
G. Valenzise *, L. Gerosa, M. Tagliasacchi *, F. Antonacci *, A. Sarti * IEEE Int. Conf. On Advanced Video and Signal-based Surveillance, 2007 * Dipartimento.
Classification and risk prediction
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
Ensemble Learning (2), Tree and Forest
Evaluating Classifiers
Anomaly detection Problem motivation Machine Learning.
Jay Stokes, Microsoft Research John Platt, Microsoft Research Joseph Kravis, Microsoft Network Security Michael Shilman, ChatterPop, Inc. ALADIN: Active.
Anomaly detection with Bayesian networks Website: John Sandiford.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
Machine learning system design Prioritizing what to work on
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Manu Chandran. Outline Background and motivation Over view of techniques Cross validation Bootstrap method Setting up the problem Comparing AIC,BIC,Crossvalidation,Bootstrap.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Experience Report: System Log Analysis for Anomaly Detection
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Evaluating Classifiers
Deep Feedforward Networks
Architecture & System Performance
Architecture & System Performance
Ch8: Nonparametric Methods
Classification of unlabeled data:
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Linear regression project
Dipartimento di Ingegneria «Enzo Ferrari»,
Damiano Bolzoni, Sandro Etalle, Pieter H. Hartel
Special Topics In Scientific Computing
Machine Learning Feature Creation and Selection
A Tutorial on HOG Human Detection
Outlier Discovery/Anomaly Detection
Baselining PMU Data to Find Patterns and Anomalies
A survey of network anomaly detection techniques
Text Categorization Rong Jin.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
10701 / Machine Learning Today: - Cross validation,
What is Regression Analysis?
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Application-level logs: visualization and anomaly detection
Learning Algorithm Evaluation
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Pattern Recognition and Machine Learning
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Intro to Machine Learning
Support Vector Machine I
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Model generalization Brief summary of methods
Machine Learning in Practice Lecture 27
Analysis for Predicting the Selling Price of Apartments Pratik Nikte
Learning From Observed Data
Feature Selection Methods
Evaluating Classifiers
Machine Learning – a Probabilistic Perspective
Semi-Supervised Learning
Using Clustering to Make Prediction Intervals For Neural Networks
Jia-Bin Huang Virginia Tech
Probabilistic Surrogate Models
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Jia-Bin Huang Virginia Tech Anomaly Detection Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative

Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Anomaly detection example Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) New engine: 𝑥 text Aircraft engine features: 𝑥 1 = heat generated 𝑥 2 = vibration intensity 𝑥 2 vibration 𝑥 1 heat

Density estimation Model 𝒑 𝒙 𝑥 2 𝑥 1 Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) Is 𝑥 text anomalous? Model 𝒑 𝒙 𝑝 𝑥 text <𝜖→ flag anomaly 𝑝 𝑥 text ≥𝜖→ OK 𝑥 2 vibration 𝑥 1 heat

Anomaly detection example Fraud detection 𝑥 (𝑖) = features of user I’s activities Model 𝑝 𝑥 from data Identify unusual users by checking which have 𝑝 𝑥 <𝜖 Manufacturing Monitoring computers in a data center 𝑥 (𝑖) = features of machine i 𝑥 1 = memory use, 𝑥 2 = number of disk accesses/sec 𝑥 3 = CPU load, 𝑥 4 = CPU load/network traffic

Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Gaussian (normal) distribution Say 𝑥∈𝑅. If 𝑥 is a distributed Gaussian with mean 𝜇, variance 𝜎 2 . 𝑥∼𝑁(𝜇, 𝜎 2 ) 𝜎 standard deviation 𝑝 𝑥;𝜇, 𝜎 2 = 1 2𝜋 𝜎 exp − 𝑥−𝜇 2 2 𝜎 2

Gaussian distribution examples

Parameter estimation Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) 𝑥∼𝑁(𝜇, 𝜎 2 ) 𝑥∼𝑁(𝜇, 𝜎 2 ) Maximum likelihood estimation 𝜇 = 1 𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜎 2 = 1 𝑚 𝑖=1 𝑚 ( 𝑥 𝑖 − 𝜇 ) 2

Density estimation Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) Each example 𝑥∈ 𝑅 𝑛 𝑝 𝑥 =𝑝( 𝑥 1 ; 𝜇 1 , 𝜎 1 2 ) 𝑝 𝑥 2 ; 𝜇 2 , 𝜎 2 2 ⋯𝑝 𝑥 𝑛 ; 𝜇 𝑛 , 𝜎 𝑛 2 = Π 𝑗 𝑝( 𝑥 𝑗 ; 𝜇 𝑗 , 𝜎 𝑗 2 )

Anomaly detection algorithm Choose features 𝑥 𝑖 that you think might be indicative of anomalous examples Fit parameters 𝜇 1 , 𝜇 2, ⋯, 𝜇 𝑛 , 𝜎 1 2 , 𝜎 2 2 , ⋯, 𝜎 𝑛 2 𝜇 𝑗 = 1 𝑚 𝑖=1 𝑚 𝑥 𝑗 (𝑖) 𝜎 𝑗 2 = 1 𝑚 𝑖=1 𝑚 ( 𝑥 𝑗 𝑖 − 𝜇 𝑗 ) 2 Given new example 𝑥, compute 𝑝 𝑥 𝑝 𝑥 = Π 𝑗 𝑝( 𝑥 𝑗 ; 𝜇 𝑗 , 𝜎 𝑗 2 ) Anomaly if 𝑝 𝑥 <𝜖

Evaluation Assume we have some labeled data, of anomalous and non- anomalous examples (𝑦=0 if normal, 𝑦=1 if anomalous) Training set 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) (assume normal examples) Cross-validation set: ( 𝑥 𝑐𝑣 (1) , 𝑦 𝑐𝑣 (1) ),( 𝑥 𝑐𝑣 (2) , 𝑦 𝑐𝑣 (2) ), ⋯,( 𝑥 𝑐𝑣 ( 𝑚 𝑐𝑣 ) , 𝑦 𝑐𝑣 ( 𝑚 𝑐𝑣 ) ) Test set: ( 𝑥 𝑡𝑒𝑠𝑡 (1) , 𝑦 𝑡𝑒𝑠𝑡 (1) ),( 𝑥 𝑡𝑒𝑠𝑡 (2) , 𝑦 𝑡𝑒𝑠𝑡 (2) ), ⋯,( 𝑥 𝑡𝑒𝑠𝑡 ( 𝑚 𝑡𝑒𝑠𝑡 ) , 𝑦 𝑡𝑒𝑠𝑡 ( 𝑚 𝑡𝑒𝑠𝑡 ) )

Aircraft engines motivating example 10000 good (normal) engines 20 flawed engines (anomalous) Training set: 6000 good engines CV: 2000 good engines (y = 0), 10 anomalous (y = 1) Test: 2000 good engines (y = 0), 10 anomalous (y = 1)

Algorithm evaluation Fit model 𝑝(𝑥) on training set { 𝑥 1 , ⋯, 𝑥 𝑚 } On a cross-validation/test example 𝑥, predict 𝑦= 1 if 𝑝 𝑥 <𝜖 (anomaly) 0 if 𝑝 𝑥 ≥𝜖 (normal) Possible evaluation metrics: True positive, false positive, false negative, true negative Precision/Recall F1-score Can use cross-validation set to choose parameter 𝜖

Evaluation metric How about accuracy? Assume only 0.1% of the engines are anomaly (skewed classes) Declare every example as normal -> 99.9% accuracy!

Precision/Recall F1 score: 2 𝑃𝑅 𝑃+𝑅

Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Anomaly detection Very small number of positive examples (y=1) (0-20 is common) Large number of negative (y=0) examples Many different types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like Future anomalies may look nothing like any of the anomalous examples we have seen so far Supervised learning Large number of positive and negative examples Enough positive examples for algorithm to get a sense of what positive are like, future positive examples likely to be similar to ones in training set.

Anomaly detection Fraud detection Manufacturing Monitoring machines in a data center Supervised learning Email spam classification Weather prediction Cancer classification

Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Non-Gaussian features log 𝑥

Error analysis for anomaly detection Want 𝑝(𝑥) large for normal examples 𝑥 𝑝(𝑥) small for anomalous examples 𝑥 Most common problem: 𝑝(𝑥) is comparable (say both large) for normal and anomalous examples

Monitoring computers in a data center Choose features that might take on unusually large or small values in the event of an anomaly 𝑥 1 = memory use of computer 𝑥 2 = number of dis accesses/sec 𝑥 3 = CPU load 𝑥 4 = network traffic 𝑥 5 = CPU load network traffic 𝑥 5 = CPU load^2 network traffic

Anomaly Detection Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Motivating example: Monitoring machines in a data center 𝑥 2 (Memory use) 𝑥 1 (CPU load) 𝑥 2 (Memory use) 𝑥 1 (CPU load)

Multivariate Gaussian (normal) distribution 𝑥∈ 𝑅 𝑛 . Don’t model 𝑝 𝑥 1 ,𝑝 𝑥 2 , ⋯ separately Model 𝑝 𝑥 all in one go. Parameters: 𝜇∈ 𝑅 𝑛 , Σ∈ 𝑅 𝑛×𝑛 (covariance matrix) 𝑝 𝑥;𝜇, Σ = 1 2𝜋 𝑛/2 Σ 1/2 exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇)

Multivariate Gaussian (normal) examples Σ = 1 0 0 1 Σ = 0.6 0 0 0.6 Σ = 2 0 0 2 𝑥 2 𝑥 2 𝑥 2 𝑥 1 𝑥 1 𝑥 1

Multivariate Gaussian (normal) examples Σ = 1 0 0 1 Σ = 0.6 0 0 1 Σ = 2 0 0 1 𝑥 2 𝑥 2 𝑥 2 𝑥 1 𝑥 1 𝑥 1

Multivariate Gaussian (normal) examples Σ = 1 0 0 1 Σ = 1 0.5 0.5 1 Σ = 1 0.8 0.8 1 𝑥 2 𝑥 2 𝑥 2 𝑥 1 𝑥 1 𝑥 1

Anomaly detection using the multivariate Gaussian distribution Fit model 𝑝 𝑥 by setting 𝜇= 1 𝑚 𝑖=1 𝑚 𝑥 (𝑖) Σ= 1 𝑚 𝑖=1 𝑚 (𝑥 (𝑖) −𝜇)(𝑥 (𝑖) − 𝜇) ⊤ 2 Give a new example 𝑥, compute 𝑝 𝑥;𝜇, Σ = 1 2𝜋 𝑛/2 Σ 1/2 exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇) Flag an anomaly if 𝑝 𝑥 <𝜖

Automatically captures correlations between features Original model 𝑝 𝑥 1 ; 𝜇 1 , 𝜎 1 2 𝑝 𝑥 2 ; 𝜇 2 , 𝜎 2 2 ⋯𝑝 𝑥 𝑛 ; 𝜇 𝑛 , 𝜎 𝑛 2 Manually create features to capture anomalies where 𝑥 1 , 𝑥 2 take unusual combinations of values Computationally cheaper (alternatively, scales better) OK even if training set size is small Original model 𝑝 𝑥;𝜇, Σ = 1 2𝜋 𝑛/2 Σ 1/2 exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇) exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇) Automatically captures correlations between features Computationally more expensive Must have 𝑚>𝑛 or else Σ is non- invertible

Things to remember Motivation Developing an anomaly detection system Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution