Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jia-Bin Huang Virginia Tech

Similar presentations


Presentation on theme: "Jia-Bin Huang Virginia Tech"β€” Presentation transcript:

1 Jia-Bin Huang Virginia Tech
Anomaly Detection Jia-Bin Huang Virginia Tech ECE-5424G / CS-5824 Spring 2019

2 Administrative

3 Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

4 Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

5 Anomaly detection example
Dataset π‘₯ (1) , π‘₯ (2) , β‹―, π‘₯ (π‘š) New engine: π‘₯ text Aircraft engine features: π‘₯ 1 = heat generated π‘₯ 2 = vibration intensity π‘₯ 2 vibration π‘₯ 1 heat

6 Density estimation Model 𝒑 𝒙 π‘₯ 2 π‘₯ 1 Dataset π‘₯ (1) , π‘₯ (2) , β‹―, π‘₯ (π‘š)
Is π‘₯ text anomalous? Model 𝒑 𝒙 𝑝 π‘₯ text <πœ–β†’ flag anomaly 𝑝 π‘₯ text β‰₯πœ–β†’ OK π‘₯ 2 vibration π‘₯ 1 heat

7 Anomaly detection example
Fraud detection π‘₯ (𝑖) = features of user I’s activities Model 𝑝 π‘₯ from data Identify unusual users by checking which have 𝑝 π‘₯ <πœ– Manufacturing Monitoring computers in a data center π‘₯ (𝑖) = features of machine i π‘₯ 1 = memory use, π‘₯ 2 = number of disk accesses/sec π‘₯ 3 = CPU load, π‘₯ 4 = CPU load/network traffic

8 Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

9 Gaussian (normal) distribution
Say π‘₯βˆˆπ‘…. If π‘₯ is a distributed Gaussian with mean πœ‡, variance 𝜎 2 . π‘₯βˆΌπ‘(πœ‡, 𝜎 2 ) 𝜎 standard deviation 𝑝 π‘₯;πœ‡, 𝜎 2 = 1 2πœ‹ 𝜎 exp βˆ’ π‘₯βˆ’πœ‡ 𝜎 2

10 Gaussian distribution examples

11 Parameter estimation Dataset π‘₯ (1) , π‘₯ (2) , β‹―, π‘₯ (π‘š) π‘₯βˆΌπ‘(πœ‡, 𝜎 2 )
π‘₯βˆΌπ‘(πœ‡, 𝜎 2 ) Maximum likelihood estimation πœ‡ = 1 π‘š 𝑖=1 π‘š π‘₯ (𝑖) 𝜎 2 = 1 π‘š 𝑖=1 π‘š ( π‘₯ 𝑖 βˆ’ πœ‡ ) 2

12 Density estimation Dataset π‘₯ (1) , π‘₯ (2) , β‹―, π‘₯ (π‘š)
Each example π‘₯∈ 𝑅 𝑛 𝑝 π‘₯ =𝑝( π‘₯ 1 ; πœ‡ 1 , 𝜎 1 2 ) 𝑝 π‘₯ 2 ; πœ‡ 2 , 𝜎 2 2 ⋯𝑝 π‘₯ 𝑛 ; πœ‡ 𝑛 , 𝜎 𝑛 2 = Ξ  𝑗 𝑝( π‘₯ 𝑗 ; πœ‡ 𝑗 , 𝜎 𝑗 2 )

13 Anomaly detection algorithm
Choose features π‘₯ 𝑖 that you think might be indicative of anomalous examples Fit parameters πœ‡ 1 , πœ‡ 2, β‹―, πœ‡ 𝑛 , 𝜎 1 2 , 𝜎 2 2 , β‹―, 𝜎 𝑛 2 πœ‡ 𝑗 = 1 π‘š 𝑖=1 π‘š π‘₯ 𝑗 (𝑖) 𝜎 𝑗 2 = 1 π‘š 𝑖=1 π‘š ( π‘₯ 𝑗 𝑖 βˆ’ πœ‡ 𝑗 ) 2 Given new example π‘₯, compute 𝑝 π‘₯ 𝑝 π‘₯ = Ξ  𝑗 𝑝( π‘₯ 𝑗 ; πœ‡ 𝑗 , 𝜎 𝑗 2 ) Anomaly if 𝑝 π‘₯ <πœ–

14 Evaluation Assume we have some labeled data, of anomalous and non- anomalous examples (𝑦=0 if normal, 𝑦=1 if anomalous) Training set π‘₯ (1) , π‘₯ (2) , β‹―, π‘₯ (π‘š) (assume normal examples) Cross-validation set: ( π‘₯ 𝑐𝑣 (1) , 𝑦 𝑐𝑣 (1) ),( π‘₯ 𝑐𝑣 (2) , 𝑦 𝑐𝑣 (2) ), β‹―,( π‘₯ 𝑐𝑣 ( π‘š 𝑐𝑣 ) , 𝑦 𝑐𝑣 ( π‘š 𝑐𝑣 ) ) Test set: ( π‘₯ 𝑑𝑒𝑠𝑑 (1) , 𝑦 𝑑𝑒𝑠𝑑 (1) ),( π‘₯ 𝑑𝑒𝑠𝑑 (2) , 𝑦 𝑑𝑒𝑠𝑑 (2) ), β‹―,( π‘₯ 𝑑𝑒𝑠𝑑 ( π‘š 𝑑𝑒𝑠𝑑 ) , 𝑦 𝑑𝑒𝑠𝑑 ( π‘š 𝑑𝑒𝑠𝑑 ) )

15 Aircraft engines motivating example
10000 good (normal) engines flawed engines (anomalous) Training set: 6000 good engines CV: 2000 good engines (y = 0), 10 anomalous (y = 1) Test: 2000 good engines (y = 0), 10 anomalous (y = 1)

16 Algorithm evaluation Fit model 𝑝(π‘₯) on training set { π‘₯ 1 , β‹―, π‘₯ π‘š }
On a cross-validation/test example π‘₯, predict 𝑦= if 𝑝 π‘₯ <πœ– (anomaly) if 𝑝 π‘₯ β‰₯πœ– (normal) Possible evaluation metrics: True positive, false positive, false negative, true negative Precision/Recall F1-score Can use cross-validation set to choose parameter πœ–

17 Evaluation metric How about accuracy?
Assume only 0.1% of the engines are anomaly (skewed classes) Declare every example as normal -> 99.9% accuracy!

18 Precision/Recall F1 score: 2 𝑃𝑅 𝑃+𝑅

19 Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

20 Anomaly detection Very small number of positive examples (y=1) (0-20 is common) Large number of negative (y=0) examples Many different types of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like Future anomalies may look nothing like any of the anomalous examples we have seen so far Supervised learning Large number of positive and negative examples Enough positive examples for algorithm to get a sense of what positive are like, future positive examples likely to be similar to ones in training set.

21 Anomaly detection Fraud detection Manufacturing Monitoring machines in a data center Supervised learning spam classification Weather prediction Cancer classification

22 Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

23 Non-Gaussian features
log π‘₯

24 Error analysis for anomaly detection
Want 𝑝(π‘₯) large for normal examples π‘₯ 𝑝(π‘₯) small for anomalous examples π‘₯ Most common problem: 𝑝(π‘₯) is comparable (say both large) for normal and anomalous examples

25 Monitoring computers in a data center
Choose features that might take on unusually large or small values in the event of an anomaly π‘₯ 1 = memory use of computer π‘₯ 2 = number of dis accesses/sec π‘₯ 3 = CPU load π‘₯ 4 = network traffic π‘₯ 5 = CPU load network traffic π‘₯ 5 = CPU load^2 network traffic

26 Anomaly Detection Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution

27 Motivating example: Monitoring machines in a data center
π‘₯ 2 (Memory use) π‘₯ 1 (CPU load) π‘₯ 2 (Memory use) π‘₯ 1 (CPU load)

28 Multivariate Gaussian (normal) distribution
π‘₯∈ 𝑅 𝑛 . Don’t model 𝑝 π‘₯ 1 ,𝑝 π‘₯ 2 , β‹― separately Model 𝑝 π‘₯ all in one go. Parameters: πœ‡βˆˆ 𝑅 𝑛 , Σ∈ 𝑅 𝑛×𝑛 (covariance matrix) 𝑝 π‘₯;πœ‡, Ξ£ = 1 2πœ‹ 𝑛/2 Ξ£ 1/2 exp βˆ’ π‘₯βˆ’πœ‡ ⊀ Ξ£ βˆ’1 (π‘₯βˆ’πœ‡)

29 Multivariate Gaussian (normal) examples
Ξ£ = Ξ£ = Ξ£ = π‘₯ 2 π‘₯ 2 π‘₯ 2 π‘₯ 1 π‘₯ 1 π‘₯ 1

30 Multivariate Gaussian (normal) examples
Ξ£ = Ξ£ = Ξ£ = π‘₯ 2 π‘₯ 2 π‘₯ 2 π‘₯ 1 π‘₯ 1 π‘₯ 1

31 Multivariate Gaussian (normal) examples
Ξ£ = Ξ£ = Ξ£ = π‘₯ 2 π‘₯ 2 π‘₯ 2 π‘₯ 1 π‘₯ 1 π‘₯ 1

32 Anomaly detection using the multivariate Gaussian distribution
Fit model 𝑝 π‘₯ by setting πœ‡= 1 π‘š 𝑖=1 π‘š π‘₯ (𝑖) Ξ£= 1 π‘š 𝑖=1 π‘š (π‘₯ (𝑖) βˆ’πœ‡)(π‘₯ (𝑖) βˆ’ πœ‡) ⊀ 2 Give a new example π‘₯, compute 𝑝 π‘₯;πœ‡, Ξ£ = 1 2πœ‹ 𝑛/2 Ξ£ 1/2 exp βˆ’ π‘₯βˆ’πœ‡ ⊀ Ξ£ βˆ’1 (π‘₯βˆ’πœ‡) Flag an anomaly if 𝑝 π‘₯ <πœ–

33 Automatically captures correlations between features
Original model 𝑝 π‘₯ 1 ; πœ‡ 1 , 𝜎 1 2 𝑝 π‘₯ 2 ; πœ‡ 2 , 𝜎 2 2 ⋯𝑝 π‘₯ 𝑛 ; πœ‡ 𝑛 , 𝜎 𝑛 2 Manually create features to capture anomalies where π‘₯ 1 , π‘₯ 2 take unusual combinations of values Computationally cheaper (alternatively, scales better) OK even if training set size is small Original model 𝑝 π‘₯;πœ‡, Ξ£ = 1 2πœ‹ 𝑛/2 Ξ£ 1/2 exp βˆ’ π‘₯βˆ’πœ‡ ⊀ Ξ£ βˆ’1 (π‘₯βˆ’πœ‡) exp βˆ’ π‘₯βˆ’πœ‡ ⊀ Ξ£ βˆ’1 (π‘₯βˆ’πœ‡) Automatically captures correlations between features Computationally more expensive Must have π‘š>𝑛 or else Ξ£ is non- invertible

34 Things to remember Motivation Developing an anomaly detection system
Anomaly detection vs. supervised learning Choosing what features to use Multivariate Gaussian distribution


Download ppt "Jia-Bin Huang Virginia Tech"

Similar presentations


Ads by Google