Sequential analysis: balancing the tradeoff between detection accuracy and detection delay XuanLong Nguyen Radlab, 11/06/06.

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Detecting Spam Zombies by Monitoring Outgoing Messages Zhenhai Duan Department of Computer Science Florida State University.
Statistical Decision Theory Abraham Wald ( ) Wald’s test Rigorous proof of the consistency of MLE “Note on the consistency of the maximum likelihood.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Anomaly and sequential detection with time series data XuanLong Nguyen CS 294 Practical Machine Learning Lecture 10/30/2006.
1 In-Network PCA and Anomaly Detection Ling Huang* XuanLong Nguyen* Minos Garofalakis § Michael Jordan* Anthony Joseph* Nina Taft § *UC Berkeley § Intel.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Chapter 2: Bayesian Decision Theory (Part 1) Introduction Bayesian Decision Theory–Continuous Features All materials used in this course were taken from.
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering On-line Alert Systems for Production Plants A Conflict Based Approach.
Statistical Decision Theory, Bayes Classifier
Sequential Hypothesis Testing under Stochastic Deadlines Peter Frazier, Angela Yu Princeton University TexPoint fonts used in EMF. Read the TexPoint manual.
An Optimal Learning Approach to Finding an Outbreak of a Disease Warren Scott Warren Powell
Evaluating Hypotheses
Fast Port Scan Using Sequential Hypothesis Testing Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan.
1 Distributed Online Simultaneous Fault Detection for Multiple Sensors Ram Rajagopal, Xuanlong Nguyen, Sinem Ergen, Pravin Varaiya EECS, University of.
Machine Learning CMPT 726 Simon Fraser University
ANOMALY DETECTION AND CHARACTERIZATION: LEARNING AND EXPERIANCE YAN CHEN – MATT MODAFF – AARON BEACH.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Anomaly detection and sequential statistics in time series Alex Shyr CS 294 Practical Machine Learning 11/12/2009 (many slides from XuanLong Nguyen and.
Chapter 9 Title and Outline 1 9 Tests of Hypotheses for a Single Sample 9-1 Hypothesis Testing Statistical Hypotheses Tests of Statistical.
Introduction to Hypothesis Testing
Data Selection In Ad-Hoc Wireless Sensor Networks Olawoye Oyeyele 11/24/2003.
MITACS-PINTS Prediction In Interacting Systems Project Leader : Michael Kouriztin.
Universal and composite hypothesis testing via Mismatched Divergence Jayakrishnan Unnikrishnan LCAV, EPFL Collaborators Dayu Huang, Sean Meyn, Venu Veeravalli,
Overview G. Jogesh Babu. Probability theory Probability is all about flip of a coin Conditional probability & Bayes theorem (Bayesian analysis) Expectation,
Sequential Detection Overview & Open Problems George V. Moustakides, University of Patras, GREECE.
IIT Indore © Neminah Hubballi
Anomaly and sequential detection with time series data XuanLong Nguyen CS 294 Practical Machine Learning Lecture 10/30/2006.
Fast Portscan Detection Using Sequential Hypothesis Testing Authors: Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan Publication: IEEE.
EM and expected complete log-likelihood Mixture of Experts
Model Inference and Averaging
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Learning Rules for Anomaly Detection of Hostile Network Traffic Matthew V. Mahoney and Philip K. Chan Florida Institute of Technology.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
Anomaly detection in VoIP and Ethernet traffic under presence of daily patterns Piotr Żuraniewski (UvA/TNO/AGH) Felipe Mata (UAM), Michel Mandjes (UvA),
Likelihood Methods in Ecology November 16 th – 20 th, 2009 Millbrook, NY Instructors: Charles Canham and María Uriarte Teaching Assistant Liza Comita.
Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.
Fast Port Scan Detection Using Sequential Hypotheses Testing* Authors: Jaeyeon Jung, Vern Paxson, Arthur W. Berger, and Hari Balakrishnan IEEE Symposium.
Quickest Detection of a Change Process Across a Sensor Array Vasanthan Raghavan and Venugopal V. Veeravalli Presented by: Kuntal Ray.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
On optimal quantization rules for some sequential decision problems by X. Nguyen, M. Wainwright & M. Jordan Discussion led by Qi An ECE, Duke University.
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
BCS547 Neural Decoding.
Simple examples of the Bayesian approach For proportions and means.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Consensus Extraction from Heterogeneous Detectors to Improve Performance over Network Traffic Anomaly Detection Jing Gao 1, Wei Fan 2, Deepak Turaga 2,
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Spatial Smoothing and Multiple Comparisons Correction for Dummies Alexa Morcom, Matthew Brett Acknowledgements.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Machine Learning 5. Parametric Methods.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistics Sampling Distributions and Point Estimation of Parameters Contents, figures, and exercises come from the textbook: Applied Statistics and Probability.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Methods. 2 Concepts and Notations Sample unit – the basic landscape unit at which we wish to establish the presence/absence of the species.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Psychology and Neurobiology of Decision-Making under Uncertainty Angela Yu March 11, 2010.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Optimal Decision-Making in Humans & Animals Angela Yu March 05, 2009.
Lecture 1.31 Criteria for optimal reception of radio signals.
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Parametric Methods Berlin Chen, 2005 References:
CS639: Data Management for Data Science
Presentation transcript:

Sequential analysis: balancing the tradeoff between detection accuracy and detection delay XuanLong Nguyen Radlab, 11/06/06

Outline Motivation in detection problems –need to minimize detection delay time Brief intro to sequential analysis –sequential hypothesis testing –sequential change-point detection Applications –Detection of anomalies in network traffic (network attacks), faulty software, etc

Three quantities of interest in detection problems Detection accuracy –False alarm rate –Misdetection rate Detection delay time

Network volume anomaly detection [Huang et al, 06]

So far, anomalies treated as isolated events Spikes seem to appear out of nowhere Hard to predict early short burst –unless we reduce the time granularity of collected data To achieve early detection –have to look at medium to long-term trend –know when to stop deliberating

Early detection of anomalous trends We want to –distinguish “bad” process from good process/ multiple processes –detect a point where a “good” process turns bad Applicable when evidence accumulates over time (no matter how fast or slow) –e.g., because a router or a server fails –worm propagates its effect Sequential analysis is well-suited –minimize the detection time given fixed false alarm and misdetection rates –balance the tradeoff between these three quantities (false alarm, misdetection rate, detection time) effectively

Example: Port scan detection Detect whether a remote host is a port scanner or a benign host Ground truth: based on percentage of local hosts which a remote host has a failed connection We set: – for a scanner, the probability of hitting inactive local host is 0.8 –for a benign host, that probability is 0.1 Figure: –X: percentage of inactive local hosts for a remote host –Y: cumulative distribution function for X (Jung et al, 2004) 80% bad hosts

Hypothesis testing formulation A remote host R attempts to connect a local host at time i let Y i = 0 if the connection attempt is a success, 1 if failed connection As outcomes Y 1, Y 2,… are observed we wish to determine whether R is a scanner or not Two competing hypotheses: –H 0 : R is benign –H 1 : R is a scanner

An off-line approach 1.Collect sequence of data Y for one day (wait for a day) 2. Compute the likelihood ratio accumulated over a day This is related to the proportion of inactive local hosts that R tries to connect (resulting in failed connections) 3. Raise a flag if this statistic exceeds some threshold

A sequential (on-line) solution 1.Update accumulative likelihood ratio statistic in an online fashion 2.Raise a flag if this exceeds some threshold Threshold a Threshold b Acc. Likelihood ratio Stopping time hour024

Comparison with other existing intrusion detection systems (Bro & Snort) Efficiency: 1 - #false positives / #true positives Effectiveness: #false negatives/ #all samples N: # of samples used (i.e., detection delay time)

Two sequential decision problems Sequential hypothesis testing –differentiating “bad” process from “good process” –E.g., our previous portscan example Sequential change-point detection –detecting a point(s) where a “good” process starts to turn bad

Sequential hypothesis testing H = 0 (Null hypothesis): normal situation H = 1 (Alternative hypothesis): abnormal situation Sequence of observed data –X 1, X 2, X 3, … Decision consists of –stopping time N (when to stop taking samples?) –make a hypothesis H = 0 or H = 1 ?

Quantities of interest False alarm rate Misdetection rate Expected stopping time (aka number of samples, or decision delay time) E N Frequentist formulation:Bayesian formulation:

Key statistic: Posterior probability As more data are observed, the posterior is edging closer to either 0 or 1 Optimal cost-to-go function is a function of G(p) can be computed by Bellman’s update –G(p) = min { cost if stop now, or cost of taking one more sample} –G(p) is concave Stop: when p n hits thresholds a or b N(m 0,v 0 ) N(m 1,v 1 ) := optimal G 01 p G(p) p 1, p 2,..,p n a b

Multiple hypothesis test Suppose we have m hypotheses H = 1,2,…,m The relevant statistic is posterior probability vector in (m-1) simplex Stop when p n reaches on of the corners (passing through red boundary) H=1 H=2 H=3

Thresholding posterior probability = thresholding sequential log likelihood ratio Applying Bayes’ rule: Log likelihood ratio:

Thresholds vs. errors Threshold b Threshold a Acc. Likelihood ratio Stopping time (N) 0 SnSn Exact if there’s no overshoot at hitting time!

Expected stopping times vs errors The stopping time of hitting time N of a random walk What is E[N]? Wald’s equation

Outline Sequential hypothesis testing Change-point detection –Off-line formulation methods based on clustering /maximum likelihood –On-line (sequential) formulation Minimax method Bayesian method –Application in detecting network traffic anomalies

Change-point detection problem Identify where there is a change in the data sequence –change in mean, dispersion, correlation function, spectral density, etc… –generally change in distribution Xt t1 t2

Off-line change-point detection Viewed as a clustering problem across time axis –Change points being the boundary of clusters Partition time series data that respects –Homogeneity within a partition –Heterogeneity between partitions

A heuristic: clustering by minimizing intra-partition variance Suppose that we look at a mean changing process Suppose also that there is only one change point Define running mean x[i..j] Define variation within a partition A sq [i..j] Seek a time point v that minimizes the sum of variations G (Fisher, 1958)

Statistical inference of change point A change point is considered as a latent variable Statistical inference of change point location via –frequentist method, e.g., maximum likelihood estimation –Bayesian method by inferring posterior probability

Maximum-likelihood method Hypothesis H v : sequence has density f 0 before v, and f 1 after Hypothesis H 0 : sequence is stochastically homogeneous This is the precursor for various sequential procedures (to come!) SkSk v 1 n f0 f1 k [Page, 1965]

Maximum-likelihood method [Hinkley, 1970,1971]

Sequential change-point detection Data are observed serially There is a change from distribution f 0 to f 1 in at time point v Raise an alarm if change is detected at N Need to (a) Minimize the false alarm rate (b) Minimize the average delay to detection Change point v False alarm Delayed alarm f0f0 f1f1 time N

Minimax formulation Among all procedures such that the time to false alarm is bounded from below by a constant T, find a procedure that minimizes the average delay to detection Class of procedures with false alarm condition Average delay to detection average-worst delay worst-worst delay Cusum, SRP tests Cusum test

Bayesian formulation Assume a prior distribution of the change point Among all procedures such that the false alarm probability is less than \alpha, find a procedure that minimizes the average delay to detection False alarm condition Average delay to detecion Shiryaev’s test

All procedures involve running likelihood ratios Hypothesis H v : sequence has density f 0 before v, and f 1 after Hypothesis : no change point Likelihood ratio for v = k vs. v = infinity All procedures involve online thresholding: Stop whenever the statistic exceeds a threshold b Cusum test : Shiryaev-Roberts-Polak’s: Shiryaev’s Bayesian test:

Cusum test (Page, 1966) gngn b Stopping time N This test minimizes the worst-average detection delay (in an asymptotic sense) :

Generalized likelihood ratio Unfortunately, we don’t know f 0 and f 1 Assume that they follow the form f 0 is estimated from “normal” training data f 1 is estimated on the flight (on test data) Sequential generalized likelihood ratio statistic (same as CUSUM): Our testing rule: Stop and declare the change point at the first n such that g n exceeds a threshold b

Change point detection in network traffic Data features: number of good packets received that were directed to the broadcast address number of Ethernet packets with an unknown protocol type number of good address resolution protocol (ARP) packets on the segment number of incoming TCP connection requests (TCP packets with SYN flag set) [Hajji, 2005] N(m,v) N(m1,v1) Changed behavior N(m0,v0) Each feature is modeled as a mixture of 3-4 gaussians to adjust to the daily traffic patterns (night hours vs day times, weekday vs. weekends,…)

Subtle change in traffic (aggregated statistic vs individual variables) Caused by web robots

Adaptability to normal daily and weekely fluctuations weekend PM time

Anomalies detected Broadcast storms, DoS attacks injected 2 broadcast/sec 16mins delay Sustained rate of TCP connection requests injecting 10 packets/sec 17mins delay

Anomalies detected ARP cache poisoning attacks TCP SYN DoS attack, excessive traffic load 16mins delay 50 seconds delay

Summary Sequential hypothesis test –distinguish “good” process from “bad” Sequential change-point detection –detecting where a process changes its behavior Framework for optimal reduction of detection delay Sequential tests are very easy to apply –even though the analysis might look difficult

References Wald, A. Sequential analysis, John Wiley and Sons, Inc, Arrow, K., Blackwell, D., Girshik, Ann. Math. Stat., Shiryaev, R. Optimal stopping rules, Springer-Verlag, Siegmund, D. Sequential analysis, Springer-Verlag, Brodsky, B. E. and Darkhovsky B.S. Nonparametric methods in change-point problems. Kluwer Academic Pub, Baum, C. W. & Veeravalli, V.V. A Sequential Procedure for Multihypothesis Testing. IEEE Trans on Info Thy, 40(6) , Lai, T.L., Sequential analysis: Some classical problems and new challenges (with discussion), Statistica Sinica, 11:303—408, Mei, Y. Asymptotically optimal methods for sequential change-point detection, Caltech PhD thesis, Hajji, H. Statistical analysis of network traffic for adaptive faults detection, IEEE Trans Neural Networks, Tartakovsky, A & Veeravalli, V.V. General asymptotic Bayesian theory of quickest change detection. Theory of Probability and Its Applications, 2005 Nguyen, X., Wainwright, M. & Jordan, M.I. On optimal quantization rules in sequential decision problems. Proc. ISIT, Seattle, 2006.