Download presentation
Presentation is loading. Please wait.
Published byFrancine Lawson Modified over 9 years ago
1
Biosurveillance Detection Algorithms: Slide 1 Copyright © 2002, 2003, 2004 Andrew Moore Detection Algorithms for Biosurveillance: A tutorial RODS: http://www.health.pitt.edu/rodshttp://www.health.pitt.edu/rods Auton Lab: http://www.autonlab.orghttp://www.autonlab.org
2
Biosurveillance Detection Algorithms: Slide 2 Copyright © 2002, 2003, Andrew Moore The Basic Task: Analyze a time series data stream to find outbreaks without sounding too many false alarms Time Signal
3
Biosurveillance Detection Algorithms: Slide 3 Copyright © 2002, 2003, Andrew Moore MethodHas Pitt/CMU tried it? Tried but little used Tried and used Under developmentMultivariate signal tracking? Spatial ? Time-weighted averagingYes SerflingYes ARIMAYes SARIMA + External FactorsYes Univariate HMMYes Kalman FilterYes Recursive Least SquaresYes Support Vector MachineYes Neural NetsYes RandomizationYes Spatial Scan StatisticsYes(w/ Howard Burkom)Yes Bayesian NetworksYes Contingency TablesYes Scalar Outlier (SQC)Yes Multivariate AnomaliesYes Change-point statisticsYes FDR TestsYes WSARE (Recent patterns)Yes PANDA (Causal Model)Yes FLUMOD (space/Time HMM)Yes Details of these methods and bibliography available from “Summary of Biosurveillance-relevant statistical and data mining technologies” by Moore, Cooper, Tsui and Wagner. Downloadable (PDF format) from www.cs.cmu.edu/~awm/biosurv-methods.pdf Many Methods!
4
Biosurveillance Detection Algorithms: Slide 4 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns
5
Biosurveillance Detection Algorithms: Slide 5 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns These are all powerful statistical methods, which means they all have to have one thing in common…
6
Biosurveillance Detection Algorithms: Slide 6 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns These are all powerful statistical methods, which means they all have to have one thing in common… Boring Names.
7
Biosurveillance Detection Algorithms: Slide 7 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns Univariate Anomaly Detection These are all powerful statistical methods, which means they all have to have one thing in common… Boring Names. Multivariate Anomaly Detection Spatial Scan Statistics WSARE
8
Biosurveillance Detection Algorithms: Slide 8 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics WSARE
9
Biosurveillance Detection Algorithms: Slide 9 Copyright © 2002, 2003, Andrew Moore Univariate Time Series Time Signal Example Signals: Number of ED visits today Number of ED visits this hour Number of Respiratory Cases Today School absenteeism today Nyquil Sales today
10
Biosurveillance Detection Algorithms: Slide 10 Copyright © 2002, 2003, Andrew Moore (When) is there an anomaly?
11
Biosurveillance Detection Algorithms: Slide 11 Copyright © 2002, 2003, Andrew Moore (When) is there an anomaly? This is a time series of counts of primary-physician visits in data from Norfolk in December 2001. I added a fake outbreak, starting at a certain date. Can you guess the start date?
12
Biosurveillance Detection Algorithms: Slide 12 Copyright © 2002, 2003, Andrew Moore (When) is there an anomaly? This is a time series of counts of primary-physician visits in data from Norfolk in December 2001. I added a fake outbreak, starting at a certain date. Can you guess when? Here (much too high for a Friday) (injected outbreak)
13
Biosurveillance Detection Algorithms: Slide 13 Copyright © 2002, 2003, Andrew Moore An easy case Time Signal Dealt with by Statistical Quality Control Record the mean and standard deviation up to the current time. Signal an alarm if we go outside 3 sigmas
14
Biosurveillance Detection Algorithms: Slide 14 Copyright © 2002, 2003, Andrew Moore An easy case: Control Charts Time Signal Dealt with by Statistical Quality Control Record the mean and standard deviation up to the current time. Signal an alarm if we go outside 3 sigmas Mean Upper Safe Range
15
Biosurveillance Detection Algorithms: Slide 15 Copyright © 2002, 2003, Andrew Moore Control Charts on the Norfolk Data Alarm Level (injected outbreak)
16
Biosurveillance Detection Algorithms: Slide 16 Copyright © 2002, 2003, Andrew Moore Control Charts on the Norfolk Data Alarm Level (injected outbreak)
17
Biosurveillance Detection Algorithms: Slide 17 Copyright © 2002, 2003, Andrew Moore Control Charts on the Norfolk Data Alarm Level
18
Biosurveillance Detection Algorithms: Slide 18 Copyright © 2002, 2003, Andrew Moore Looking at changes from yesterday
19
Biosurveillance Detection Algorithms: Slide 19 Copyright © 2002, 2003, Andrew Moore Looking at changes from yesterday Alarm Level
20
Biosurveillance Detection Algorithms: Slide 20 Copyright © 2002, 2003, Andrew Moore Looking at changes from yesterday Alarm Level
21
Biosurveillance Detection Algorithms: Slide 21 Copyright © 2002, 2003, Andrew Moore We need a happy medium: Control Chart: Too insensitive to recent changes Change from yesterday: Too sensitive to recent changes
22
Biosurveillance Detection Algorithms: Slide 22 Copyright © 2002, 2003, Andrew Moore Moving Average
23
Biosurveillance Detection Algorithms: Slide 23 Copyright © 2002, 2003, Andrew Moore Moving Average
24
Biosurveillance Detection Algorithms: Slide 24 Copyright © 2002, 2003, Andrew Moore Moving Average
25
Biosurveillance Detection Algorithms: Slide 25 Copyright © 2002, 2003, Andrew Moore Moving Average Looks better. But how can we be quantitative about this?
26
Biosurveillance Detection Algorithms: Slide 26 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
27
Biosurveillance Detection Algorithms: Slide 27 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
28
Biosurveillance Detection Algorithms: Slide 28 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
29
Biosurveillance Detection Algorithms: Slide 29 Copyright © 2002, 2003, Andrew Moore Seasonal Effects Time Signal Fit a periodic function (e.g. sine wave) to previous data. Predict today’s signal and 3-sigma confidence intervals. Signal an alarm if we’re off. Reduces False alarms from Natural outbreaks. Different times of year deserve different thresholds.
30
Biosurveillance Detection Algorithms: Slide 30 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
31
Biosurveillance Detection Algorithms: Slide 31 Copyright © 2002, 2003, Andrew Moore Day-of-week effects Fit a day-of-week component E[Signal] = a + delta day E.G: delta mon = +5.42, delta tue = +2.20, delta wed = +3.33, delta thu = +3.10, delta fri = +4.02, delta sat = -12.2, delta sun = -23.42 A simple form of ANOVA
32
Biosurveillance Detection Algorithms: Slide 32 Copyright © 2002, 2003, Andrew Moore Regression using Hours-in-day & IsMonday
33
Biosurveillance Detection Algorithms: Slide 33 Copyright © 2002, 2003, Andrew Moore Regression using Hours-in-day & IsMonday
34
Biosurveillance Detection Algorithms: Slide 34 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
35
Biosurveillance Detection Algorithms: Slide 35 Copyright © 2002, 2003, Andrew Moore Regression using Mon-Tue
36
Biosurveillance Detection Algorithms: Slide 36 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
37
Biosurveillance Detection Algorithms: Slide 37 Copyright © 2002, 2003, Andrew Moore CUSUM CUmulative SUM Statistics Keep a running sum of “surprises”: a sum of excesses each day over the prediction When this sum exceeds threshold, signal alarm and reset sum
38
Biosurveillance Detection Algorithms: Slide 38 Copyright © 2002, 2003, Andrew Moore CUSUM
39
Biosurveillance Detection Algorithms: Slide 39 Copyright © 2002, 2003, Andrew Moore CUSUM
40
Biosurveillance Detection Algorithms: Slide 40 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
41
Biosurveillance Detection Algorithms: Slide 41 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model Counts = sickness * availability Sick people may seek care more often on certain days due to availability of medical services or time in their schedules, so adjust for that phenomenon Sickness = counts / availability Plot this
42
Biosurveillance Detection Algorithms: Slide 42 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
43
Biosurveillance Detection Algorithms: Slide 43 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
44
Biosurveillance Detection Algorithms: Slide 44 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
45
Biosurveillance Detection Algorithms: Slide 45 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
46
Biosurveillance Detection Algorithms: Slide 46 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
47
Biosurveillance Detection Algorithms: Slide 47 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
48
Biosurveillance Detection Algorithms: Slide 48 Copyright © 2002, 2003, Andrew Moore The Sickness/Availability Model
49
Biosurveillance Detection Algorithms: Slide 49 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
50
Biosurveillance Detection Algorithms: Slide 50 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
51
Biosurveillance Detection Algorithms: Slide 51 Copyright © 2002, 2003, Andrew Moore Exploiting Denominator Data Normalize (divide) by total visits
52
Biosurveillance Detection Algorithms: Slide 52 Copyright © 2002, 2003, Andrew Moore Exploiting Denominator Data
53
Biosurveillance Detection Algorithms: Slide 53 Copyright © 2002, 2003, Andrew Moore Exploiting Denominator Data
54
Biosurveillance Detection Algorithms: Slide 54 Copyright © 2002, 2003, Andrew Moore Exploiting Denominator Data and Smoothing
55
Biosurveillance Detection Algorithms: Slide 55 Copyright © 2002, 2003, Andrew Moore Algorithm Performance Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per TWO weeks… Fraction of spikes detected Days to detect a ramp attack Allowing one False Alarm per SIX weeks…
56
Biosurveillance Detection Algorithms: Slide 56 Copyright © 2002, 2003, Andrew Moore Other state-of-the-art methods Wavelets Change-point detection Kalman filters Hidden Markov Models
57
Biosurveillance Detection Algorithms: Slide 57 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics WSARE
58
Biosurveillance Detection Algorithms: Slide 58 Copyright © 2002, 2003, Andrew Moore Multiple Signals
59
Biosurveillance Detection Algorithms: Slide 59 Copyright © 2002, 2003, Andrew Moore Multivariate Signals (relevant to inhalational diseases)
60
Biosurveillance Detection Algorithms: Slide 60 Copyright © 2002, 2003, Andrew Moore Multi Source Signals Lab Flu WebMD School Cough& Cold Throat Resp Viral Death weeks
61
Biosurveillance Detection Algorithms: Slide 61 Copyright © 2002, 2003, Andrew Moore What if you’ve got multiple signals? Time Signal Idea One: Simply treat it as two separate alarm-from- signal problems. …Question: why might that not be the best we can do? Red: Cough Sales Blue: ED Respiratory Visits
62
Biosurveillance Detection Algorithms: Slide 62 Copyright © 2002, 2003, Andrew Moore Another View Signal Question: why might that not be the best we can do? Red: Cough Sales Blue: ED Respiratory Visits Cough Sales ED Respiratory Visits
63
Biosurveillance Detection Algorithms: Slide 63 Copyright © 2002, 2003, Andrew Moore Another View Signal Red: Cough Sales Blue: ED Respiratory Visits Cough Sales ED Respiratory Visits This should be an anomaly Question: why might that not be the best we can do?
64
Biosurveillance Detection Algorithms: Slide 64 Copyright © 2002, 2003, Andrew Moore N-dimensional Gaussian Signal Good Practical Idea: Model the joint with a Gaussian This is a sensible N-dimensional SQC …But you can also do N- dimensional modeling of dynamics (leads to the idea of Kalman Filter model) Red: Cough Sales Blue: ED Respiratory Visits Cough Sales ED Respiratory Visits One Sigma 2 Sigma
65
Biosurveillance Detection Algorithms: Slide 65 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics WSARE
66
Biosurveillance Detection Algorithms: Slide 66 Copyright © 2002, 2003, Andrew Moore One Step of Spatial Scan Entire area being scanned (Philadelphia Metro)
67
Biosurveillance Detection Algorithms: Slide 67 Copyright © 2002, 2003, Andrew Moore One Step of Spatial Scan Entire area being scanned Current region being considered
68
Biosurveillance Detection Algorithms: Slide 68 Copyright © 2002, 2003, Andrew Moore One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) Everywhere else has a population of 2,200,000 of whom 20,000 are sick (0.9%)
69
Biosurveillance Detection Algorithms: Slide 69 Copyright © 2002, 2003, Andrew Moore One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) Everywhere else has a population of 2,200,000 of whom 20,000 are sick (0.9%) So... is that a big deal? Evaluated with Score function (e.g. Kulldorf’s score)
70
Biosurveillance Detection Algorithms: Slide 70 Copyright © 2002, 2003, Andrew Moore One Step of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) [Score = 1.4] Everywhere else has a population of 2,200,000 of whom 20,000 are sick (0.9%) So... is that a big deal? Evaluated with Score function (e.g. Kulldorf’s score)
71
Biosurveillance Detection Algorithms: Slide 71 Copyright © 2002, 2003, Andrew Moore Many Steps of Spatial Scan Entire area being scanned Current region being considered I have a population of 5300 of whom 53 are sick (1%) [Score = 1.4] Everywhere else has a population of 2,200,000 of whom 20,000 are sick (0.9%) So... is that a big deal? Evaluated with Score function (e.g. Kulldorf’s score) Highest scoring region in search so far [Score = 9.3]
72
Biosurveillance Detection Algorithms: Slide 72 Copyright © 2002, 2003, Andrew Moore Scan Statistics Standard scan statistic question: Given the geographical locations of occurrences of a phenomenon, is there a region with an unusually high (low) rate of these occurrences? Standard approach: 1.Compute the likelihood of the data given the hypothesis that the rate of occurrence is uniform everywhere, L 0 2.For some geographical region, W, compute the likelihood that the rate of occurrence is uniform at one level inside the region and uniform at another level outside the region, L(W). 3.Compute the likelihood ratio, L(W)/L 0 4.Repeat for all regions, and find the largest likelihood ratio. This is the scan statistic, S* W 5.Report the region, W, which yielded the max, S* W See [Glaz and Balakrishnan, 99] for details
73
Biosurveillance Detection Algorithms: Slide 73 Copyright © 2002, 2003, Andrew Moore Significance testing Given that region W is the most likely to be abnormal, is it significantly abnormal? Standard approach: 1.Generate many randomized versions of the data set by shuffling the labels (positive instance of the phenomenon or not). 2.Compute S* W for each randomized data set. This forms a baseline distribution for S* W if the null hypothesis holds. 3.Compare the observed value of S* W against the baseline distribution to determine a p-value.
74
Biosurveillance Detection Algorithms: Slide 74 Copyright © 2002, 2003, Andrew Moore Fast squares speedup Theoretical complexity of fast squares: O(N 2 ) (as opposed to naïve N 3 ), if maximum density region sufficiently dense. If not, we can use several other speedup tricks. In practice: 10-200x speedups on real and artificially generated datasets. Emergency Dept. dataset (600K records): 20 minutes, versus 66 hours with naïve approach. N N
75
Biosurveillance Detection Algorithms: Slide 75 Copyright © 2002, 2003, Andrew Moore Fast rectangles speedup Theoretical complexity of fast rectangles: O(N 2 log N) (as opposed to naïve N 4 ) N N
76
Biosurveillance Detection Algorithms: Slide 76 Copyright © 2002, 2003, Andrew Moore Fast oriented rectangles speedup Theoretical complexity of fast rectangles: 18N 2 log N (as opposed to naïve 18N 4 ) (Angles discretized to 5 degree buckets) N N
77
Biosurveillance Detection Algorithms: Slide 77 Copyright © 2002, 2003, Andrew Moore Why the Scan Statistic speed obsession? Traditional Scan Statistics very expensive, especially with Randomization tests
78
Biosurveillance Detection Algorithms: Slide 78 Copyright © 2002, 2003, Andrew Moore Rectangular SS on Electrolyte Sales
79
Biosurveillance Detection Algorithms: Slide 79 Copyright © 2002, 2003, Andrew Moore Rectangular SS on Cough/cold Sales
80
Biosurveillance Detection Algorithms: Slide 80 Copyright © 2002, 2003, Andrew Moore Proposed new WSARE/Scan Statistic hybrid This is the strangest region because the age distribution of respiratory cases has changed dramatically for no reason that can be explained by known background changes
81
Biosurveillance Detection Algorithms: Slide 81 Copyright © 2002, 2003, Andrew Moore What you’ll learn about Noticing events in bio- event time series Tracking many series at once Detecting geographic hotspots Finding emerging new patterns Univariate Anomaly Detection Multivariate Anomaly Detection Spatial Scan Statistics WSARE
82
Biosurveillance Detection Algorithms: Slide 82 Copyright © 2002, 2003, Andrew Moore A Limitation of Univariate Analysis DateTimeHospitalICD9ProdromeGenderAgeHome Location Many more… 6/1/039:121781FeverM20sNE… 6/1/039:451787DiarrheaF40sSE… : : : : : : : : : REPRESENTATIVE SURVEILLANCE DATA Standard Approach Select in advance which subpopulations to monitor (e.g., each county, zip) Do not pay close attention to effect of multiple testing WSARE Approach Monitor hundreds of thousands of subpopulations Pay close attention to effect of multiple testing
83
Biosurveillance Detection Algorithms: Slide 83 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 What’s Strange About Recent Events? Designed to be easily applicable to any date/time-indexed biosurveillance-relevant data stream.
84
Biosurveillance Detection Algorithms: Slide 84 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 Inputs: 1. Date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use?
85
Biosurveillance Detection Algorithms: Slide 85 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 Input s: Primary Key DateTimeHospitalICD9ProdromeGenderAgeHomeWorkRecent Flu Levels Recent Weather (Many more…) Large Scale Medium Scale Fine Scale Large Scale Medium Scale Fine Scale h6r326/2/214:12Down- town 781FeverM20sNE15217A5NW15213B82%70R… t3q156/2/214:15River- side 717Respirat ory M60sNE15222J3NE15222J32%70R… t5hh56/2/214:15Smith- field 622Respirat ory F80sSE15210K9SE15210K92%70R… : : : : : : : : : : : : : : : : : 1. Date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use? Example “last 24 hours” “ignore key and weather”
86
Biosurveillance Detection Algorithms: Slide 86 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 Inputs: Primary Key DateTimeHospitalICD9ProdromeGenderAgeHomeWorkRecent Flu Levels Recent Weather (Many more…) Large Scale Medium Scale Fine Scale Large Scale Medium Scale Fine Scale h6r326/2/214:12Down- town 781FeverM20sNE15217A5NW15213B82%70R… t3q156/2/214:15River- side 717Respirat ory M60sNE15222J3NE15222J32%70R… t5hh56/2/214:15Smith- field 622Respirat ory F80sSE15210K9SE15210K92%70R… : : : : : : : : : : : : : : : : : 1. Date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use? Outputs: 1. Here are the records that most surprise me 2. Here’s why 3. And here’s how seriously you should take it
87
Biosurveillance Detection Algorithms: Slide 87 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 Given 500 day’s worth of ER cases at 15 hospitals… DateCases Thu 5/22/2000C1, C2, C3, C4 … Fri 5/23/2000C1, C2, C3, C4 … :: :: Sat 12/9/2000C1, C2, C3, C4 … Sun 12/10/2000C1, C2, C3, C4 … :: Sat 12/16/2000C1, C2, C3, C4 … :: Sat 12/23/2000C1, C2, C3, C4 … :: :: Fri 9/14/2001C1, C2, C3, C4 …
88
Biosurveillance Detection Algorithms: Slide 88 Copyright © 2002, 2003, Andrew Moore Given 500 day’s worth of ER cases at 15 hospitals… For each day… Take today’s cases DateCases Thu 5/22/2000C1, C2, C3, C4 … Fri 5/23/2000C1, C2, C3, C4 … :: :: Sat 12/9/2000C1, C2, C3, C4 … Sun 12/10/2000C1, C2, C3, C4 … :: Sat 12/16/2000C1, C2, C3, C4 … :: Sat 12/23/2000C1, C2, C3, C4 … :: :: Fri 9/14/2001C1, C2, C3, C4 … WSARE v2.0
89
Biosurveillance Detection Algorithms: Slide 89 Copyright © 2002, 2003, Andrew Moore Given 500 day’s worth of ER cases at 15 hospitals… For each day… Take today’s cases The cases one week ago The cases two weeks ago DateCases Thu 5/22/2000C1, C2, C3, C4 … Fri 5/23/2000C1, C2, C3, C4 … :: :: Sat 12/9/2000C1, C2, C3, C4 … Sun 12/10/2000C1, C2, C3, C4 … :: Sat 12/16/2000C1, C2, C3, C4 … :: Sat 12/23/2000C1, C2, C3, C4 … :: :: Fri 9/14/2001C1, C2, C3, C4 … WSARE v2.0
90
Biosurveillance Detection Algorithms: Slide 90 Copyright © 2002, 2003, Andrew Moore Given 500 day’s worth of ER cases at 15 hospitals… For each day… Take today’s cases The cases one week ago The cases two weeks ago Ask: “What’s different about today?” WSARE v2.0
91
Biosurveillance Detection Algorithms: Slide 91 Copyright © 2002, 2003, Andrew Moore Given 500 day’s worth of ER cases at 15 hospitals… For each day… Take today’s cases The cases one week ago The cases two weeks ago Ask: “What’s different about today?” Fields we use: Date, Time of Day, Prodrome, ICD9, Symptoms, Age, Gender, Coarse Location, Fine Location, ICD9 Derived Features, Census Block Derived Features, Work Details, Colocation Details WSARE v2.0
92
Biosurveillance Detection Algorithms: Slide 92 Copyright © 2002, 2003, Andrew Moore Example of Output Sat 12-23-2001 (daynum 36882, dayindex 239) 35.8% ( 48/134) of today's cases have 30 <= age < 40 17.0% ( 45/265) of other cases have 30 <= age < 40
93
Biosurveillance Detection Algorithms: Slide 93 Copyright © 2002, 2003, Andrew Moore Example of Output Sat 12-23-2001 (daynum 36882, dayindex 239) FISHER_PVALUE = 0.000051 35.8% ( 48/134) of today's cases have 30 <= age < 40 17.0% ( 45/265) of other cases have 30 <= age < 40
94
Biosurveillance Detection Algorithms: Slide 94 Copyright © 2002, 2003, Andrew Moore Searching for the best score… Try ICD9 = x for each value of x Try Gender=M, Gender=F Try CoarseRegion=NE, =NW, SE, SW.. Try FineRegion=AA,AB,AC, … DD (4x4 Grid) Try Hospital=x, TimeofDay=x, Prodrome=X, … [In future… features of census blocks] Overfitting Alert!
95
Biosurveillance Detection Algorithms: Slide 95 Copyright © 2002, 2003, Andrew Moore Corrected P value Sat 12-23-2001 (daynum 36882, dayindex 239) FISHER_PVALUE = 0.000051 RANDOMIZATION_PVALUE = 0.031 35.8% ( 48/134) of today's cases have 30 <= age < 40 17.0% ( 45/265) of other cases have 30 <= age < 40
96
Biosurveillance Detection Algorithms: Slide 96 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 Inputs: Primary Key DateTimeHospitalICD9ProdromeGenderAgeHomeWorkRecent Flu Levels Recent Weather (Many more…) Large Scale Medium Scale Fine Scale Large Scale Medium Scale Fine Scale h6r326/2/214:12Down- town 781FeverM20sNE15217A5NW15213B82%70R… t3q156/2/214:15River- side 717Respirat ory M60sNE15222J3NE15222J32%70R… t5hh56/2/214:15Smith- field 622Respirat ory F80sSE15210K9SE15210K92%70R… : : : : : : : : : : : : : : : : : 1. Date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use? Outputs: 1. Here are the records that most surprise me 2. Here’s why 3. And here’s how seriously you should take it
97
Biosurveillance Detection Algorithms: Slide 97 Copyright © 2002, 2003, Andrew Moore WSARE v2.0 Input s: Primary Key DateTimeHospita l ICD 9 Prodrom e Gende r Ag e HomeWorkRecen t Flu Levels Recent Weathe r (Many more… ) Large Scale Mediu m Scale Fine Scale Large Scale Mediu m Scale Fine Scale h6r326/2/214:12Down- town 781FeverM20 s NE15217A5NW15213B82%70R… t3q156/2/214:15River- side 717Respira tory M60 s NE15222J3NE15222J32%70R… t5hh56/2/214:15Smith- field 622Respira tory F80 s SE15210K9SE15210K92%70R… : : : : : : : : : : : : : : : : : 1. Date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use? Output s: 1. Here are the records that most surprise me 2. Here’s why 3. And here’s how seriously you should take it Normally, 8% of cases in the East are over-50s with respiratory problems. But today it’s been 15% Don’t be too impressed! Taking into account all the patterns I’ve been searching over, there’s a 20% chance I’d have found a rule this dramatic just by chance
98
Biosurveillance Detection Algorithms: Slide 98 Copyright © 2002, 2003, Andrew Moore WSARE on recent Utah Data Saturday June 1st in Utah: The most surprising thing about recent records is: Normally: 0.8% of records (50/6205) have time before 2pm and prodrome = Hemorrhagic But recently: 2.1% of records (19/907) have time before 2pm and prodrome = Hemorrhagic Pvalue = 0.0484042 Which means that in a world where nothing changes we'd expect to have a result this significant about once every 20 times we ran the program
99
Biosurveillance Detection Algorithms: Slide 99 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 “Taking into account recent flu levels…” “Taking into account that today is a public holday…” “Taking into account that this is Spring…” “Taking into account recent heatwave…” “Taking into account that there’s a known natural Food-borne outbreak in progress…” Bonus: More efficient use of historical data
100
Biosurveillance Detection Algorithms: Slide 100 Copyright © 2002, 2003, Andrew Moore Idea: Bayesian Networks “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “Patients from West Park Hospital are less likely to be young” “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon”
101
Biosurveillance Detection Algorithms: Slide 101 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data
102
Biosurveillance Detection Algorithms: Slide 102 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data
103
Biosurveillance Detection Algorithms: Slide 103 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data Today’s Environment What should be happening today?
104
Biosurveillance Detection Algorithms: Slide 104 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data Today’s Environment What should be happening today? Today’s Cases What’s strange about today, considering its environment?
105
Biosurveillance Detection Algorithms: Slide 105 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data Today’s Environment What should be happening today? Today’s Cases What’s strange about today, considering its environment? And how big a deal is this, considering how much search I’ve done?
106
Biosurveillance Detection Algorithms: Slide 106 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data Today’s Environment What should be happening today? Today’s Cases What’s strange about today, considering its environment? And how big a deal is this, considering how much search I’ve done? Expensive Cheap
107
Biosurveillance Detection Algorithms: Slide 107 Copyright © 2002, 2003, Andrew Moore WSARE 3.0 All historical data Today’s Environment What should be happening today? Today’s Cases What’s strange about today, considering its environment? And how big a deal is this, considering how much search I’ve done? Expensive Cheap Racing Randomization Differential Randomization All-dimensions Trees RADSEARCH
108
Biosurveillance Detection Algorithms: Slide 108 Copyright © 2002, 2003, Andrew Moore Results on Simulation Standard WSARE2.0 WSARE2.5 WSARE3.0
109
Biosurveillance Detection Algorithms: Slide 109 Copyright © 2002, 2003, Andrew Moore BARD (Bayesian Aerosol Release Detector) Key Points Goal: detect aerosol release of B. anthracis spores Automates the analysis done by Meselson et al. Alarms when increase in disease activity spatially and temporally consistent with aerosol anthrax Makes use of inverted atmospheric dispersion model and meteorological data In preliminary evaluation, no false positives in 6.5 months More info : BARD Tech report Meselson et al, 1994 Science - By simply analyzing existing surveillance data more thoroughly (without additional data collection), BARD has the potential to improve the earliness and specificity of detection
110
Biosurveillance Detection Algorithms: Slide 110 Copyright © 2002, 2003, Andrew Moore For further info Papers on these and other anti-terror applications: www.cs.cmu.edu/~awm/antiterror Papers on scaling up many of these analysis methods: www.cs.cmu.edu/~awm/papers.html Software implementing the above: www.autonlab.org Copies of 18 lectures on 25 statistical data mining topics: www.cs.cmu.edu/~awm/781 CD-ROM, powerpoint-synchronized video/audio recordings of the above lectures: awm@cs.cmu.edu Information Gain, Decision Trees Probabilistic Reasoning, Bayes Classifiers, Density Estimation Probability Densities in Data Mining Gaussians in Data Mining Maximum Likelihood Estimation Gaussian Bayes Classifiers Regression, Neural Nets Overfitting: detection and avoidance The many approaches to cross-validation Locally Weighted Learning Bayes Net, Bayes Net Structure Learning, Anomaly Detection Andrew's Top 8 Favorite Regression Algorithms (Regression Trees, Cascade Correlation, Group Method Data Handling (GMDH), Multivariate Adaptive Regression Splines (MARS), Multilinear Interpolation, Radial Basis Functions, Robust Regression, Cascade Correlation + Projection Pursuit Clustering, Mixture Models, Model Selection K-means clustering and hierarchical clustering Vapnik-Chervonenkis (VC) Dimensionality and Structural Risk Minimization PAC Learning Support Vector Machines Time Series Analysis with Hidden Markov Models
111
Biosurveillance Detection Algorithms: Slide 111 Copyright © 2002, 2003, Andrew Moore References 1.WSARE 3.0 : Bayesian Network based Anomaly Pattern Detection Wong, Moore, Cooper and Wagner [ICML/KDD 2003] 2.Fast Grid Based Computation of Spatial Scan Statistics Neill and Moore [NIPS 2003] These and other Biosurveillance algorithms papers and free software available from http://www.autonlab.org/ See also: http://www.health.pitt.edu/rodshttp://www.health.pitt.edu/rods
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.