Presentation is loading. Please wait.

Presentation is loading. Please wait.

Time Series Algorithm Tutorial

Similar presentations


Presentation on theme: "Time Series Algorithm Tutorial"— Presentation transcript:

1 Time Series Algorithm Tutorial
Adopted from Andrew Moore’s slides RODS: Auton Lab: Copyright © 2002, 2003, 2004 Andrew Moore

2 Copyright © 2002, 2003, Andrew Moore

3 The Basic Task: Analyze a time series data stream to find outbreaks without sounding too many false alarms Signal Time Copyright © 2002, 2003, Andrew Moore

4 Many Methods! Method Has Pitt/CMU tried it? Tried but little used
Tried and used Under development Multivariate signal tracking? Spatial? Time-weighted averaging Yes Serfling ARIMA SARIMA + External Factors Univariate HMM Kalman Filter Recursive Least Squares Support Vector Machine Neural Nets Randomization Spatial Scan Statistics (w/ Howard Burkom) Bayesian Networks Contingency Tables Scalar Outlier (SQC) Multivariate Anomalies Change-point statistics FDR Tests WSARE (Recent patterns) PANDA (Causal Model) FLUMOD (space/Time HMM) Details of these methods and bibliography available from “Summary of Biosurveillance-relevant statistical and data mining technologies” by Moore, Cooper, Tsui and Wagner. Downloadable (PDF format) from Copyright © 2002, 2003, Andrew Moore

5 What you’ll learn about
Noticing events in bio-event time series Tracking many series at once Copyright © 2002, 2003, Andrew Moore

6 What you’ll learn about
These are all powerful statistical methods, which means they all have to have one thing in common… Noticing events in bio-event time series Tracking many series at once Copyright © 2002, 2003, Andrew Moore

7 What you’ll learn about
These are all powerful statistical methods, which means they all have to have one thing in common… Boring Names. Noticing events in bio-event time series Tracking many series at once Copyright © 2002, 2003, Andrew Moore

8 What you’ll learn about
These are all powerful statistical methods, which means they all have to have one thing in common… Boring Names. Noticing events in bio-event time series Tracking many series at once Univariate Anomaly Detection Multivariate Anomaly Detection Copyright © 2002, 2003, Andrew Moore

9 What you’ll learn about
Noticing events in bio-event time series Tracking many series at once Univariate Anomaly Detection Multivariate Anomaly Detection Copyright © 2002, 2003, Andrew Moore

10 Univariate Time Series
Signal Time Example Signals: Number of ED visits today Number of ED visits this hour Number of Respiratory Cases Today School absenteeism today Nyquil Sales today NyQuil: an OTC medicine for cold and flu. Copyright © 2002, 2003, Andrew Moore

11 (When) is there an anomaly?
Copyright © 2002, 2003, Andrew Moore

12 (When) is there an anomaly?
This is a time series of counts of primary-physician visits in data from Norfolk in December I added a fake outbreak, starting at a certain date. Can you guess the start date? Copyright © 2002, 2003, Andrew Moore

13 (When) is there an anomaly?
Here (much too high for a Friday) This is a time series of counts of primary-physician visits in data from Norfolk in December I added a fake outbreak, starting at a certain date. Can you guess when? (injected outbreak) Copyright © 2002, 2003, Andrew Moore

14 An easy case Signal Time Dealt with by Statistical Quality Control
Record the mean and standard deviation up to the current time. Signal an alarm if we go outside 3 sigmas Mean: \mu standard deviation: \sigma often uses 1.96\sigma instead of 3\sigma Copyright © 2002, 2003, Andrew Moore

15 An easy case: Control Charts
Upper Safe Range Signal Mean Time Dealt with by Statistical Quality Control Record the mean and standard deviation up to the current time. Signal an alarm if we go outside 3 sigmas Copyright © 2002, 2003, Andrew Moore

16 Control Charts on the Norfolk Data
Alarm Level Predicted value is the mean up to the current time. (injected outbreak) Copyright © 2002, 2003, Andrew Moore

17 Control Charts on the Norfolk Data
Alarm Level How well the anomaly detection algorithm perform largely depends on how the alarm level (or severity) reflects the actual (injected outbreak) Copyright © 2002, 2003, Andrew Moore

18 Control Charts on the Norfolk Data
Alarm Level Previously we used two-week data, now let’s look at almost 3 months, plus we have more than one data points perday. Challenge: the alarm level of the anomaly might not be very different from those of the normal data points if the mean/variation are both increasing. Copyright © 2002, 2003, Andrew Moore

19 Looking at changes from yesterday
Control Chart’s problem: too insensitive to recent changes. Now, what if we go to the other extreme, we use yesterday’s data to predict today’s value Copyright © 2002, 2003, Andrew Moore

20 Looking at changes from yesterday
Alarm Level It was not successful in detecting Friday’s outbreak --- Thursday’s value is too high . Fase negative. Copyright © 2002, 2003, Andrew Moore

21 Looking at changes from yesterday
Alarm Level Let’s look at the 3 month data. There will be many false alarms. Copyright © 2002, 2003, Andrew Moore

22 We need a happy medium: Control Chart: Too insensitive to recent changes Change from yesterday: Too sensitive to recent changes Control chart: either we miss the outbreak/rampup or there are too many false alarms in the last few weeks. Too many false spikes detected. Copyright © 2002, 2003, Andrew Moore

23 Moving Average Copyright © 2002, 2003, Andrew Moore

24 Moving Average Adapt to recent changes, but not too fast. In this data, Alarm is still detected on Monday, not Friday. Copyright © 2002, 2003, Andrew Moore

25 Moving Average However, for the 3-month data, the spike stands out in the data. Copyright © 2002, 2003, Andrew Moore

26 Looks better. But how can we be quantitative about this?
Moving Average In other words, how much better is moving average than control chart and using-yesterday? Looks better. But how can we be quantitative about this? Copyright © 2002, 2003, Andrew Moore

27 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack But how do we calculate these metrics? It turns out to be not easy. Copyright © 2002, 2003, Andrew Moore

28 The evaluation data we used so far is called semi-synthetic data
Copyright © 2002, 2003, Andrew Moore

29 Copyright © 2002, 2003, Andrew Moore

30 Copyright © 2002, 2003, Andrew Moore

31 Copyright © 2002, 2003, Andrew Moore

32 Copyright © 2002, 2003, Andrew Moore

33 Copyright © 2002, 2003, Andrew Moore

34 Copyright © 2002, 2003, Andrew Moore

35 Copyright © 2002, 2003, Andrew Moore

36 Copyright © 2002, 2003, Andrew Moore

37 Copyright © 2002, 2003, Andrew Moore

38 Copyright © 2002, 2003, Andrew Moore

39 Copyright © 2002, 2003, Andrew Moore

40 Copyright © 2002, 2003, Andrew Moore

41 Copyright © 2002, 2003, Andrew Moore

42 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Again, these are the results we got using semi-synthetic data. Copyright © 2002, 2003, Andrew Moore

43 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Copyright © 2002, 2003, Andrew Moore

44 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Why seven days have a better performance than 3 and 56? This is because of the weekly seasonal effects. Copyright © 2002, 2003, Andrew Moore

45 Seasonal Effects Signal Time
Fit a periodic function (e.g. sine wave) to previous data. Predict today’s signal and 3-sigma confidence intervals. Signal an alarm if we’re off. Reduces False alarms from Natural outbreaks. Different times of year deserve different thresholds. The season can be hours, days of week, weeks of month, months of year etc. Copyright © 2002, 2003, Andrew Moore

46 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack If we consider different hours_of_daylight, then we get better performance. But we know different day-of-week are different in our dataset. Copyright © 2002, 2003, Andrew Moore

47 Day-of-week effects Fit a day-of-week component
E[Signal] = a + deltaday E.G: deltamon= +5.42, deltatue= +2.20, deltawed= +3.33, deltathu= +3.10, deltafri= +4.02, deltasat= -12.2, deltasun= A simple form of ANOVA Copyright © 2002, 2003, Andrew Moore

48 Regression using Hours-in-day & IsMonday
Predict= c + \beta * IsMonay. Outbreak detection is not satisfying. Hours-of-day is not helpful since there is no hourly data in this data set. Copyright © 2002, 2003, Andrew Moore

49 Regression using Hours-in-day & IsMonday
Adding Is_Monday helps here – at least it help with Monday data. Copyright © 2002, 2003, Andrew Moore

50 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Copyright © 2002, 2003, Andrew Moore

51 Regression using Mon-Tue
Let’s say do this : predict = c + \alpha * is_Monday + \beta *isTueday Copyright © 2002, 2003, Andrew Moore

52 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Note: we get a higher detection rate when we add more days of week into account. Copyright © 2002, 2003, Andrew Moore

53 CUSUM CUmulative SUM Statistics
Keep a running sum of “surprises”: a sum of excesses each day over the prediction When this sum exceeds threshold, signal alarm and reset sum Copyright © 2002, 2003, Andrew Moore

54 CUSUM In general it works well with outbreak/ramp-up detction, outbreak is detected on Saturday. Copyright © 2002, 2003, Andrew Moore

55 CUSUM But might not work well with spike detection
Copyright © 2002, 2003, Andrew Moore

56 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Copyright © 2002, 2003, Andrew Moore

57 The Sickness/Availability Model
Counts = sickness * availability Plot this Sickness = counts / availability e.g. less counts during weekend, but this does not mean less sickness. Sick people may seek care more often on certain days due to availability of medical services or time in their schedules, so adjust for that phenomenon Copyright © 2002, 2003, Andrew Moore

58 The Sickness/Availability Model
Columbus day is a Monday. Veterans day Nov 11th. Copyright © 2002, 2003, Andrew Moore

59 The Sickness/Availability Model
Copyright © 2002, 2003, Andrew Moore

60 The Sickness/Availability Model
Sickness is different from count. Alarm level is based on sickness. Copyright © 2002, 2003, Andrew Moore

61 The Sickness/Availability Model
Sickness is different from count Copyright © 2002, 2003, Andrew Moore

62 The Sickness/Availability Model
Sickness is different from count. Here the alarm level is based on sickness. Copyright © 2002, 2003, Andrew Moore

63 The Sickness/Availability Model
Successfully detect the outbreak: the underlying reason for the weekly pattern is due to the availability issue. - can be used to deal with holidays (which might not have the fixed cycles). First replace count with sickness, then do the seasonal effect etc. Copyright © 2002, 2003, Andrew Moore

64 The Sickness/Availability Model
The spike really stands out - see the green spikes. Copyright © 2002, 2003, Andrew Moore

65 Algorithm Performance
Allowing one False Alarm per TWO weeks… Allowing one False Alarm per SIX weeks… Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Fraction of spikes detected Days to detect a ramp attack Apply sickness/availability model first, and then moving average…. Again, 7 days works the best, and it works much better than the simple moving average of 7 days. Copyright © 2002, 2003, Andrew Moore

66 Other state-of-the-art methods
Wavelets Change-point detection Kalman filters Hidden Markov Models Many others Copyright © 2002, 2003, Andrew Moore

67 Copyright © 2002, 2003, Andrew Moore

68 Copyright © 2002, 2003, Andrew Moore

69 A generalized anomaly detector model based on time series algorithms
For example 1 Historical Average sThld Then, we need to set a severity threshold to decide anomalies. So, this is a general detector model. Different detectors basically work in these two steps, except that they use different techniques or algorithms to measure the severities. 2018/11/24

70 Open-sourced Libraries for Time Series Algorithms
2017/02 Facebook Prophet (R/Python) Yahoo! egads (Java) Twitter anomaly detection (R) 2015 Netflix Surus (Pig,based on PCA) Etsy skyline (python) 2013 Numenta NuPIC (python,based on HTM) 1997 RRDtool HWPREDICT。(C,based on holt-winters)

71 What you’ll learn about
Noticing events in bio-event time series Tracking many series at once Univariate Anomaly Detection Read the remaining slides by yourselves Multivariate Anomaly Detection Copyright © 2002, 2003, Andrew Moore

72 Multiple Signals Copyright © 2002, 2003, Andrew Moore

73 Multivariate Signals (relevant to inhalational diseases)
Copyright © 2002, 2003, Andrew Moore

74 Multi Source Signals Lab Flu WebMD School Cough& Cold Throat Resp
Viral Death Influcenza: contiguous cold weeks Copyright © 2002, 2003, Andrew Moore

75 What if you’ve got multiple signals?
Red: Cough Sales Blue: ED Respiratory Visits Signal Time Idea One: Simply treat it as two separate alarm-from-signal problems. …Question: why might that not be the best we can do? Copyright © 2002, 2003, Andrew Moore

76 Another View Signal Red: Cough Sales Blue: ED Respiratory Visits
Question: why might that not be the best we can do? Cough Sales ED Respiratory Visits Copyright © 2002, 2003, Andrew Moore

77 This should be an anomaly
Another View Red: Cough Sales Blue: ED Respiratory Visits Signal This should be an anomaly Question: why might that not be the best we can do? Cough Sales ED Respiratory Visits Copyright © 2002, 2003, Andrew Moore

78 N-dimensional Gaussian
Red: Cough Sales Blue: ED Respiratory Visits Signal Good Practical Idea: Model the joint with a Gaussian One Sigma Cough Sales Normal distribution: \mu (Mean) and \sigma (standard deviation) Gaussian model: multiple dimension. Sqc: standard quality control Sensible: wise 2 Sigma ED Respiratory Visits Copyright © 2002, 2003, Andrew Moore


Download ppt "Time Series Algorithm Tutorial"

Similar presentations


Ads by Google