Communication-Efficient Online Detection of Network-Wide Anomalies Ling Huang* XuanLong Nguyen* Minos Garofalakis § Joe Hellerstein* Michael Jordan* Anthony.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Fast Algorithms For Hierarchical Range Histogram Constructions

Detectability of Traffic Anomalies in Two Adjacent Networks Augustin Soule, Haakon Ringberg, Fernando Silveira, Jennifer Rexford, Christophe Diot.

Fast, Memory-Efficient Traffic Estimation by Coincidence Counting Fang Hao 1, Murali Kodialam 1, T. V. Lakshman 1, Hui Zhang 2, 1 Bell Labs, Lucent Technologies.

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

STAT 497 APPLIED TIME SERIES ANALYSIS

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

1 Communication-Efficient Online Detection of Network-Wide Anomalies Ling Huang* XuanLong Nguyen* Minos Garofalakis § Joe Hellerstein* Michael Jordan*

Models and Security Requirements for IDS. Overview The system and attack model Security requirements for IDS –Sensitivity –Detection Analysis methodology.

“Real-time” Transient Detection Algorithms Dr. Kang Hyeun Ji, Thomas Herring MIT.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Principal Component Analysis

1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.

Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.

Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.

1 Toward Sophisticated Detection With Distributed Triggers Ling Huang* Minos Garofalakis § Joe Hellerstein* Anthony Joseph* Nina Taft § *UC Berkeley §

1 Distributed Online Simultaneous Fault Detection for Multiple Sensors Ram Rajagopal, Xuanlong Nguyen, Sinem Ergen, Pravin Varaiya EECS, University of.

Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.

Cumulative Violation For any window size  t  Communication-Efficient Tracking for Distributed Cumulative Triggers Ling Huang* Minos Garofalakis.

Measurement and Monitoring Nick Feamster Georgia Tech.

EL 933 Final Project Presentation Combining Filtering and Statistical Methods for Anomaly Detection Augustin Soule Kav´e SalamatianNina Taft.

1 D-Trigger: A General Framework for Efficient Online Detection Ling Huang University of California, Berkeley.

Network Anomography Yin Zhang, Zihui Ge, Albert Greenberg, Matthew Roughan Internet Measurement Conference 2005 Berkeley, CA, USA Presented by Huizhong.

Lecture II-2: Probability Review

Tomo-gravity Yin ZhangMatthew Roughan Nick DuffieldAlbert Greenberg “A Northern NJ Research Lab” ACM.

Gaussian process modelling

Principles of Pattern Recognition

Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.

© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.

Chapter 21 R(x) Algorithm a) Anomaly Detection b) Matched Filter.

Network Anomography Yin Zhang – University of Texas at Austin Zihui Ge and Albert Greenberg – AT&T Labs Matthew Roughan – University of Adelaide IMC 2005.

The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.

1 Distributed Detection of Network-Wide Traffic Anomalies Ling Huang* XuanLong Nguyen* Minos Garofalakis § Joe Hellerstein* Michael Jordan* Anthony Joseph*

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

1 D-Trigger: A General Framework for Efficient Online Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis ◊ Joe Hellerstein* Michael Jordan* Anthony.

Mining Anomalies Using Traffic Feature Distributions Anukool Lakhina Mark Crovella Christophe Diot in ACM SIGCOMM 2005 Presented by: Sailesh Kumar.

Computer and Robot Vision II Chapter 20 Accuracy Presented by: 傅楸善 & 王林農指導教授 : 傅楸善博士.

ASTUTE: Detecting a Different Class of Traffic Anomalies Fernando Silveira 1,2, Christophe Diot 1, Nina Taft 3, Ramesh Govindan 4 1 Technicolor 2 UPMC.

EE515/IS523: Security 101: Think Like an Adversary Evading Anomarly Detection through Variance Injection Attacks on PCA Benjamin I.P. Rubinstein, Blaine.

Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.

Feature Extraction 主講人：虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.

Network Anomography Yin Zhang Joint work with Zihui Ge, Albert Greenberg, Matthew Roughan Internet Measurement.

Feature Extraction 主講人：虞台文.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:

SketchVisor: Robust Network Measurement for Software Packet Processing

Estimating standard error using bootstrap

Confidence Intervals Cont.

Experience Report: System Log Analysis for Anomaly Detection

Principal Component Analysis (PCA)

Background on Classification

A paper on Join Synopses for Approximate Query Answering

CH 5: Multivariate Methods

CHAPTER 3 Architectures for Distributed Systems

Principal Component Analysis (PCA)

Collaborative Filtering Matrix Factorization Approach

Discrete Event Simulation - 4

Lecture 2 – Monte Carlo method in finance

Principal Component Analysis

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

Principles of the Global Positioning System Lecture 11

Principal Components What matters most?.

By: Ran Ben Basat, Technion, Israel

Feature Selection Methods

Presentation transcript:

Communication-Efficient Online Detection of Network-Wide Anomalies Ling Huang* XuanLong Nguyen* Minos Garofalakis § Joe Hellerstein* Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel Research Orna Agmon Ben-Yehuda Presents Coming on Spring 2011 to A Seminar on Processing and Mining Distributed Data Near you

Huang et al., presented by Agmon Ben-Yehuda2 Network-Wide Anomalies Are bad:  Router mis-configurations  Border Gateway Protocol (BGP) policy modifications  Device failures Or even malicious:  DDOS attacks  Viruses, spam sending  Port scanning But also just unpredictable  Flash Crowds (mob) supercomputing Lakhina et al.,

Huang et al., presented by Agmon Ben-Yehuda3 Detection Problems in Enterprise Network Do machines in my network participate in a Botnet to attack other machines? victim command & control Send at rate 0.9*allowed to victim V V Operation Center Coordinated Detection! Data: Byte rate on a network link. CAN’T AFFORD TO DO THE DETECTION CONTINUOUSLY ! For efficient and scalable detection, push data processing to the edge of network! Approximate but accurate detection!

Huang et al., presented by Agmon Ben-Yehuda4 We shall talk about: Lakhina et al.'s centralized algorithm Decentralized anomaly detection Slack determination Evaluation Open Discussion

Huang et al., presented by Agmon Ben-Yehuda5 Towards Decentralized Detection 5 Lakhina et al.: Distributed Monitoring & Centralized Computation  Stream-based data collection  Periodically evaluate detection function over collected data  Doesn’t scale well in network size, timescale, detection delay Huang et al.: Decentralized Detection  Continuously evaluate detection function in a decentr. way  Low-overhead, rapid response, accurate and scalable  Detection accuracy controllable by a “tuning knob” Provable guarantee on detection error (false alarm rate) Flexible tradeoff between overhead and accuracy

Huang et al., presented by Agmon Ben-Yehuda6 Detection of Network-wide Anomalies A volume anomaly is a sudden change in an Origin-Destination flow (i.e., point to point traffic) Given link traffic measurements, detect the volume anomalies H1H1 H2H2 The backbone network Regional network 1 Regional network 2

Huang et al., presented by Agmon Ben-Yehuda8 Flow vs. Link (Lakhina et al.) Observed network link data = aggregate of application-level flows Each link is a dimension Anomalies in (unobserved) flow data Finding anomalies in high-dimensional, noisy data is difficult! Lakhina et al.,

Huang et al., presented by Agmon Ben-Yehuda9 Traffic on Link 1 Traffic on Link 2 Principal Component Analysis (PCA) y Anomalous traffic usually results in a large value of : principal components : minor components Principal components are top eigenvectors of covariance matrix. They are also directions of maximal variance. They form the subspace projection matrices, C no and C ab

Huang et al., presented by Agmon Ben-Yehuda10 The Subspace Method (Lakhina’04) An approach to separate normal from anomalous traffic based on Principal Component Analysis (PCA) Normal Subspace : space spanned by the top k principal components Anomalous Subspace : space spanned by the remaining components Then, decompose traffic on all links by projecting onto and to obtain: Traffic vector of all links at a particular point in time Normal traffic vector Residual traffic vector [3] Lakhina et al. Diagnosing Network-Wide Traffic Anomalies. In ACM SIGCOMM (2004). [4] Zhang et al. Network Anomography, In IMC (2005). Say: An centralied approach based on PCA: Lakhina 2004 develop PCA: High dim projected down to low dim, looked at the residual traffic. For detail, see their paper.

Huang et al., presented by Agmon Ben-Yehuda11 Lakhina et al., Link Traffic Variance of Principle Components Link matrices have low dimensionality

Huang et al., presented by Agmon Ben-Yehuda12 Projections onto Principle Components – normal and abnormal traffic variation Lakhina et al.,

Huang et al., presented by Agmon Ben-Yehuda13 Detection Illustration Value of over time (all traffic) over time Value of No detail, high level intuition. Red dots: anomalies Blue curve: traffic data Lakhina et al., 1-α= % of Alarms Are Real, But More Anomalies Go undetected 1-α=99.5 Only 99.5% of Alarms Are real But many Anomalies Are detected This axis is The squared projection error NOT The false alarm rate!! This small spike is not an anomaly we wished to detect

Huang et al., presented by Agmon Ben-Yehuda14 Detection Threshold is a threshold on the Squared Projection Error (SPE). It guarantees a false alarm rate of less than α. Jackson & Mudholkar: computed threshold based on the abnormal eigenvalues of the covariance matrix.  No matter where the distinction is made (how many components are considered normal).  No matter what the mean amount of traffic is.  For multivariate Gaussian distribution only. Jensen & Solomon: In practice, holds for different distributions. Lakhina et al. Believe traffic is multivariate Gaussian.  but have not verified this. Lakhina et al.,

Huang et al., presented by Agmon Ben-Yehuda15 m (timestep) n (nodeID) Operation center The Centralized Algorithm The Network Eigen values Eigen vectors Data matrix Y 1) Each link produces a column of m data over time. 2) n links produce a row data y at each time instance. Detection by Squared Prediction Error (SPE): Y 1 (t)Y 2 (t)Y 3 (t)Y n (t) Periodically Does not scale well to large networks or to small timescales

Huang et al., presented by Agmon Ben-Yehuda16 PCA-Based Detection Huang et al.: In-Network Detection Framework data 1 (t) data 2 (t) data n (t) Perturbation Analysis Adjust Filter Parameters original monitored time series filtered_data 1 (t) filtered_data 2 (t) filtered_data n (t) Coordinator Alarms user input: required detection error: α+μ Distr. Monitors

Huang et al., presented by Agmon Ben-Yehuda18 The Protocol At Monitors A prediction model is known to both monitor and coordinator. can be based on any prediction model built on historical data. Monitor i updates information if are the filtering parameters. Example for a model: the average of last 5 communicated signal values.

Huang et al., presented by Agmon Ben-Yehuda19 The Protocol At Monitors  Simple but enough to achieve 10x data reduction

Huang et al., presented by Agmon Ben-Yehuda20 The Protocol at the Coordinator Create new time data from communication and predictions Update (cyclic) matrix: add new data, lose oldest Re-compute PCA (residual projection matrix, threshold) Detect anomalies, fire alarms Update slacks when needed (no details...)

Huang et al., presented by Agmon Ben-Yehuda21 Parameter Design and Error Control Users specify an upper bound on false alarm rate, But we determine the filtering parameters  Data vs. Model An algorithmic analysis for the tradeoff between detection accuracy and data communication cost  monitor parameters determined from the target accuracy  technical tool relies on stochastic matrix perturbation theory Eigen error: L 2 norm of the difference between the approximate eigenvalues and the actual ones ActualApproximate Implicit solution: Monte Carlo and fast binary search Stochastic Matrix Perturbation Theory

Huang et al., presented by Agmon Ben-Yehuda22 Parameter Design and Error Control (II) Detection Error   Eigen-Error   Experience showed that  is monotonically rising.  Monte Carlo simulation to find the mapping from  to   For the given  a fast binary search to find a reverse mapping – a value for   

Huang et al., presented by Agmon Ben-Yehuda23 From Eigen-Error to detection Deviation Normalized form of (Jensen & Solomon) (1-α)-th percentile Upper bound on Estimated using max of Monte Carlo results Centraliz ed False alarms Distributed false alarms Detec tion Error ηx Cα Cα+ηxCα+ηx

Huang et al., presented by Agmon Ben-Yehuda24 Parameter Design and Error Control (III) Eigen-Error   Filtering parameters  s Error Matrix: Elements of column vector bound by Assumptions:  are independent, radially symmetric random vectors  For each i, all elements of a column vector are i.i.d random variables with mean 0 and variance The variance is a function of the slacks

Huang et al., presented by Agmon Ben-Yehuda25 Parameter Design and Error Control (III) Theorem: Setting to satisfy: Guarantees actual Eigen-Error is smaller than tolerable Eigen-Error with high probability. Tolerable Eigen-Error Average of Perturbed eigenvalues

Huang et al., presented by Agmon Ben-Yehuda26 Absent: A connection between local variances and local slacks

Huang et al., presented by Agmon Ben-Yehuda27 Slack Allocation Methods  Homogeneous slack allocation: uniform distribution of errors in range ● => closed expression for  Homogeneous slack allocation: local variance estimation ●, monitors approximate locally by fitting an (e.g., quadratic) function according to a recent window of data. Approximation sent to coordinator.  Heterogeneous slack allocation. ● Assume uniform distribution of errors in range ● Minimize communication; Solve using Lagrange multipliers.

Huang et al., presented by Agmon Ben-Yehuda28 Evaluation:Accuracy and Cost Given user-specified false alarm rate, evaluate the actual detection accuracy and communication overhead Experiment setup – “matching Lakhina et al.”  Abilene backbone network (internet2) data of one week: 121 flows, 41 links, minute periods  Traffic matrices of size 1008 X 41  Set uniform slack for all monitors  Injected: 60 small “bursts” +60 large “anomalies”  Threshold corresponding 0.5% false alarm rate  Lakhina at al. Tested every flow, at every time of day!

Huang et al., presented by Agmon Ben-Yehuda29 Evaluation Metrics False alarm rate = false alarms/ bursts Missed detection rate = missed detections/anomalies Cost = num/(n*m) = messages per monitor per sampled time points  num = all exchanged messages  n = number of monitors  M = number of time series points

Huang et al., presented by Agmon Ben-Yehuda30 Evaluation Results Monitor Slack Communication Cost (fraction of centralized completely updated) Actual Detection Errors Deviation of false alarm rate μ

Huang et al., presented by Agmon Ben-Yehuda31 Observations Variance estimation outperforms Uniform distribution, but not by much (5%-10%). Uniform Distribution method is simple. Uniform Distribution might be “good enough”. 80%-90% of the transferred data can be saved without hurting performance.

Huang et al., presented by Agmon Ben-Yehuda32 ROC – Receiver Operating Characteristic Curve Update rate Of PCA (data is always full)

Huang et al., presented by Agmon Ben-Yehuda33 Evaluation of Scalability BRITE topology generator links Up to 500*500 Origin-Destination flows 4 weeks of realistic data, based of statistical characteristics of Abilene In each experiment on n nodes: 5 repetitions, on n randomly picked nodes.

Huang et al., presented by Agmon Ben-Yehuda34 Graceful Scalability by number of monitors: coordinator communication Slope=1 Slope=0.09 Slope=0.15

Huang et al., presented by Agmon Ben-Yehuda35 Summary: Huang et al. A communication-efficient framework that  detects anomalies at desired accuracy level  with minimal communication cost A distributed protocol for data processing  Local monitors decide when to update data to coordinator  Coordinator makes global decision and feedback to monitors An method to guide the tradeoff between communication overhead and detection accuracy

Huang et al., presented by Agmon Ben-Yehuda36 References [Huang’07] Communication-Efficient Online Detection of Network-Wide Anomalies. L. Huang, X. Nguyen, M. Garofalakis, J. Hellerstein, M. Jordan, A. Joseph and N. Taft. To appear in INFOCOM'07. [Lakhina’04] Diagnosing Network-Wide Traffic Anomalies. A. Lakhina, M. Crovella and C. Diot. In SIGCOMM '04. [Jensen & Solomon] A Gaussian approximation for the distribution of definite quadratic forms. D.R. Jensen and H. Solomon, In J. Amer. Stat. Assoc., 67: (1972). [Jackson & Mudholkar] Control Procedures for Residuals Associated with Principal Component Analysis. J. E. Jackson and G. S. Mudholkar, Technometrics, pages 341–349, Discussion

Huang et al., presented by Agmon Ben-Yehuda37 Weaknesses (My Opinion) Symmetry + Independence Experiments

Huang et al., presented by Agmon Ben-Yehuda38 symmetry + independence Is the symmetry + independence assumption valid? Correlation may result from simultaneous errors upon surprising data changes, or from (cyclic?) bursts induced by the updating algorithm.

Huang et al., presented by Agmon Ben-Yehuda39 Experiments Single Experiment Quantization error: 1/60=0.016 (means one alarm)

Huang et al., presented by Agmon Ben-Yehuda40 Experiments: Lack of Trend Experiments do not show a statistically significant trend (dependency) of “tolerated deviation from false alarm rate”and actual false alarm rate. Estimations are too loose, or Experiments are too synthetic Between the lines: user is expected to trust experiment results.

Huang et al., presented by Agmon Ben-Yehuda41 My Summary The decentralized algorithm works well in practice according to insufficient experiments. The tuning knob was not proved to work in experiments (to be connected to practical accuracy guarantees). Noisier experiments are needed.

Huang et al., presented by Agmon Ben-Yehuda42 Backup Slides

Huang et al., presented by Agmon Ben-Yehuda43 Operation Center Traditional Distributed Monitoring Large-scale network monitoring and detection systems  Distributed and collaborative monitoring boxes  Continuously generating time series data Existing research focuses on data streaming  Centrally collect, store and aggregate network state  Well suited to answering approximate queries and continuously recording system state  Incur high overhead! Monitor 1 Monitor 2 Monitor 3 Local Network 2 Local Network 3 Local Network 1 Bandwidth Bottleneck! Overloaded !

Huang et al., presented by Agmon Ben-Yehuda44 Our Distributed Processing Approach A coordinator  Is aggregation, correlation and detection center A set of distributed monitors  Each produces a time series signals  Processes data locally, only sends needed info. to coordinator  No communication among monitors  Coordinator tells monitors the level of accuracy for signal updates Filters x “push” Filters x adjust Hosts/Monitors Coordinator Detection Accuracy

Huang et al., presented by Agmon Ben-Yehuda45 Performance Data Used: Abilene traffic matrix, 2 weeks, 41 links. error tolerance = upper bound on error