Jia-Bin Huang Virginia Tech

Jia-Bin Huang Virginia Tech
Anomaly Detection Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

Administrative

Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

Anomaly detection example
Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) New engine: 𝑥 text Aircraft engine features: 𝑥 1 = heat generated 𝑥 2 = vibration intensity 𝑥 2 vibration 𝑥 1 heat

Density estimation Model 𝒑 𝒙 𝑥 2 𝑥 1 Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚)
Is 𝑥 text anomalous? Model 𝒑 𝒙 𝑝 𝑥 text <𝜖→ flag anomaly 𝑝 𝑥 text ≥𝜖→ OK 𝑥 2 vibration 𝑥 1 heat

Anomaly detection example
Fraud detection 𝑥 (𝑖) = features of user I’s activities Model 𝑝 𝑥 from data Identify unusual users by checking which have 𝑝 𝑥 <𝜖 Manufacturing Monitoring computers in a data center 𝑥 (𝑖) = features of machine i 𝑥 1 = memory use, 𝑥 2 = number of disk accesses/sec 𝑥 3 = CPU load, 𝑥 4 = CPU load/network traffic

Gaussian (normal) distribution
Say 𝑥∈𝑅. If 𝑥 is a distributed Gaussian with mean 𝜇, variance 𝜎 2 . 𝑥∼𝑁(𝜇, 𝜎 2 ) 𝜎 standard deviation 𝑝 𝑥;𝜇, 𝜎 2 = 1 2𝜋 𝜎 exp − 𝑥−𝜇 𝜎 2

Gaussian distribution examples

Parameter estimation Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) 𝑥∼𝑁(𝜇, 𝜎 2 )
𝑥∼𝑁(𝜇, 𝜎 2 ) Maximum likelihood estimation 𝜇 = 1 𝑚 𝑖=1 𝑚 𝑥 (𝑖) 𝜎 2 = 1 𝑚 𝑖=1 𝑚 ( 𝑥 𝑖 − 𝜇 ) 2

Density estimation Dataset 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚)
Each example 𝑥∈ 𝑅 𝑛 𝑝 𝑥 =𝑝( 𝑥 1 ; 𝜇 1 , 𝜎 1 2 ) 𝑝 𝑥 2 ; 𝜇 2 , 𝜎 2 2 ⋯𝑝 𝑥 𝑛 ; 𝜇 𝑛 , 𝜎 𝑛 2 = Π 𝑗 𝑝( 𝑥 𝑗 ; 𝜇 𝑗 , 𝜎 𝑗 2 )

Anomaly detection algorithm
Choose features 𝑥 𝑖 that you think might be indicative of anomalous examples Fit parameters 𝜇 1 , 𝜇 2, ⋯, 𝜇 𝑛 , 𝜎 1 2 , 𝜎 2 2 , ⋯, 𝜎 𝑛 2 𝜇 𝑗 = 1 𝑚 𝑖=1 𝑚 𝑥 𝑗 (𝑖) 𝜎 𝑗 2 = 1 𝑚 𝑖=1 𝑚 ( 𝑥 𝑗 𝑖 − 𝜇 𝑗 ) 2 Given new example 𝑥, compute 𝑝 𝑥 𝑝 𝑥 = Π 𝑗 𝑝( 𝑥 𝑗 ; 𝜇 𝑗 , 𝜎 𝑗 2 ) Anomaly if 𝑝 𝑥 <𝜖

Evaluation Assume we have some labeled data, of anomalous and non- anomalous examples (𝑦=0 if normal, 𝑦=1 if anomalous) Training set 𝑥 (1) , 𝑥 (2) , ⋯, 𝑥 (𝑚) (assume normal examples) Cross-validation set: ( 𝑥 𝑐𝑣 (1) , 𝑦 𝑐𝑣 (1) ),( 𝑥 𝑐𝑣 (2) , 𝑦 𝑐𝑣 (2) ), ⋯,( 𝑥 𝑐𝑣 ( 𝑚 𝑐𝑣 ) , 𝑦 𝑐𝑣 ( 𝑚 𝑐𝑣 ) ) Test set: ( 𝑥 𝑡𝑒𝑠𝑡 (1) , 𝑦 𝑡𝑒𝑠𝑡 (1) ),( 𝑥 𝑡𝑒𝑠𝑡 (2) , 𝑦 𝑡𝑒𝑠𝑡 (2) ), ⋯,( 𝑥 𝑡𝑒𝑠𝑡 ( 𝑚 𝑡𝑒𝑠𝑡 ) , 𝑦 𝑡𝑒𝑠𝑡 ( 𝑚 𝑡𝑒𝑠𝑡 ) )

Aircraft engines motivating example
10000 good (normal) engines flawed engines (anomalous) Training set: 6000 good engines CV: 2000 good engines (y = 0), 10 anomalous (y = 1) Test: 2000 good engines (y = 0), 10 anomalous (y = 1)

Algorithm evaluation Fit model 𝑝(𝑥) on training set { 𝑥 1 , ⋯, 𝑥 𝑚 }
On a cross-validation/test example 𝑥, predict 𝑦= if 𝑝 𝑥 <𝜖 (anomaly) if 𝑝 𝑥 ≥𝜖 (normal) Possible evaluation metrics: True positive, false positive, false negative, true negative Precision/Recall F1-score Can use cross-validation set to choose parameter 𝜖

Evaluation metric How about accuracy?
Assume only 0.1% of the engines are anomaly (skewed classes) Declare every example as normal -> 99.9% accuracy!

Precision/Recall F1 score: 2 𝑃𝑅 𝑃+𝑅

Anomaly detection Very small number of positive examples (y=1) (0-20 is common) Large number of negative (y=0) examples Many different types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like Future anomalies may look nothing like any of the anomalous examples we have seen so far Supervised learning Large number of positive and negative examples Enough positive examples for algorithm to get a sense of what positive are like, future positive examples likely to be similar to ones in training set.

Anomaly detection Fraud detection Manufacturing Monitoring machines in a data center Supervised learning spam classification Weather prediction Cancer classification

Non-Gaussian features
log 𝑥

Error analysis for anomaly detection
Want 𝑝(𝑥) large for normal examples 𝑥 𝑝(𝑥) small for anomalous examples 𝑥 Most common problem: 𝑝(𝑥) is comparable (say both large) for normal and anomalous examples

Monitoring computers in a data center
Choose features that might take on unusually large or small values in the event of an anomaly 𝑥 1 = memory use of computer 𝑥 2 = number of dis accesses/sec 𝑥 3 = CPU load 𝑥 4 = network traffic 𝑥 5 = CPU load network traffic 𝑥 5 = CPU load^2 network traffic

Motivating example: Monitoring machines in a data center
𝑥 2 (Memory use) 𝑥 1 (CPU load) 𝑥 2 (Memory use) 𝑥 1 (CPU load)

Multivariate Gaussian (normal) distribution
𝑥∈ 𝑅 𝑛 . Don’t model 𝑝 𝑥 1 ,𝑝 𝑥 2 , ⋯ separately Model 𝑝 𝑥 all in one go. Parameters: 𝜇∈ 𝑅 𝑛 , Σ∈ 𝑅 𝑛×𝑛 (covariance matrix) 𝑝 𝑥;𝜇, Σ = 1 2𝜋 𝑛/2 Σ 1/2 exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇)

Multivariate Gaussian (normal) examples
Σ = Σ = Σ = 𝑥 2 𝑥 2 𝑥 2 𝑥 1 𝑥 1 𝑥 1

Anomaly detection using the multivariate Gaussian distribution
Fit model 𝑝 𝑥 by setting 𝜇= 1 𝑚 𝑖=1 𝑚 𝑥 (𝑖) Σ= 1 𝑚 𝑖=1 𝑚 (𝑥 (𝑖) −𝜇)(𝑥 (𝑖) − 𝜇) ⊤ 2 Give a new example 𝑥, compute 𝑝 𝑥;𝜇, Σ = 1 2𝜋 𝑛/2 Σ 1/2 exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇) Flag an anomaly if 𝑝 𝑥 <𝜖

Automatically captures correlations between features
Original model 𝑝 𝑥 1 ; 𝜇 1 , 𝜎 1 2 𝑝 𝑥 2 ; 𝜇 2 , 𝜎 2 2 ⋯𝑝 𝑥 𝑛 ; 𝜇 𝑛 , 𝜎 𝑛 2 Manually create features to capture anomalies where 𝑥 1 , 𝑥 2 take unusual combinations of values Computationally cheaper (alternatively, scales better) OK even if training set size is small Original model 𝑝 𝑥;𝜇, Σ = 1 2𝜋 𝑛/2 Σ 1/2 exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇) exp − 𝑥−𝜇 ⊤ Σ −1 (𝑥−𝜇) Automatically captures correlations between features Computationally more expensive Must have 𝑚>𝑛 or else Σ is non- invertible

Things to remember Motivation Developing an anomaly detection system

Jia-Bin Huang Virginia Tech

Similar presentations

Presentation on theme: "Jia-Bin Huang Virginia Tech"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Jia-Bin Huang Virginia Tech

Similar presentations

Presentation on theme: "Jia-Bin Huang Virginia Tech"— Presentation transcript:

Similar presentations

About project

Feedback