Whats Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University.

Slides:



Advertisements
Similar presentations
High Resolution studies
Advertisements

Which Test? Which Test? Explorin g Data Explorin g Data Planning a Study Planning a Study Anticipat.
Weather Forecasting This chapter discusses: 1.Various weather forecasting methods, their tools, and forecasting accuracy and skill 2.Images for the forecasting.
Sales Forecasting using Dynamic Bayesian Networks Steve Djajasaputra SNN Nijmegen The Netherlands.
Statistical Issues and Challenges Associated with Rapid Detection of Bio-Terrorist Attacks SE Fienberg and G Shmueli (2005) Presented by Lisa Denogean.
Comparing Two Proportions
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
A small taste of inferential statistics
1 Contact details Colin Gray Room S16 (occasionally) address: Telephone: (27) 2233 Dont hesitate to get in touch.
Class 6: Hypothesis testing and confidence intervals
Inference in the Simple Regression Model
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Chapter 7 Sampling and Sampling Distributions
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
Module 4. Forecasting MGS3100.
CS525: Special Topics in DBs Large-Scale Data Management
On Comparing Classifiers : Pitfalls to Avoid and Recommended Approach
Chapter 7 Hypothesis Testing
Contingency Tables Prepared by Yu-Fen Li.
Chapter 16 Goodness-of-Fit Tests and Contingency Tables
Chi-Square and Analysis of Variance (ANOVA)
Chapter 4 Inference About Process Quality
CHAPTER 15: Tests of Significance: The Basics Lecture PowerPoint Slides The Basic Practice of Statistics 6 th Edition Moore / Notz / Fligner.
Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.
Chapter 16 Inferential Statistics
Shortest Paths (1/11)  In this section, we shall study the path problems such like  Is there a path from city A to city B?  If there is more than one.
Statistics Review – Part II Topics: – Hypothesis Testing – Paired Tests – Tests of variability 1.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.
Early Statistical Detection of Bio-Terrorism Attacks by Tracking OTC Medication Sales Galit Shmueli Dept. of Statistics and CALD Carnegie Mellon University.
Optimizing Disease Outbreak Detection Methods Using Reinforcement Learning Masoumeh Izadi Clinical & Health Informatics Research Group Faculty of Medicine,
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Weng-Keen Wong, Greg Cooper, Denver Dash *, John Levander, John Dowling,
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
What’s Strange About Recent Events (WSARE) v3.0: Adjusting for a Changing Baseline Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon.
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling,
Data Sources The most sophisticated forecasting model will fail if it is applied to unreliable data Data should be reliable and accurate Data should be.
Summarization and Deviation Detection -- What is new?
Conclusions On our large scale anthrax attack simulations, being able to infer the work zip appears to improve detection time over just using the home.
Population-Wide Anomaly Detection Weng-Keen Wong 1, Gregory Cooper 2, Denver Dash 3, John Levander 2, John Dowling 2, Bill Hogan 2, Michael Wagner 2 1.
Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
1 Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
A Wavelet-based Anomaly Detector for Disease Outbreaks Thomas Lotze Galit Shmueli University of Maryland College Park Sean Murphy Howard Burkom Johns Hopkins.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...
What’s Strange About Recent Events (WSARE) Weng-Keen Wong (University of Pittsburgh) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University.
Time Series Data Analysis - I Yaji Sripada. Dept. of Computing Science, University of Aberdeen2 In this lecture you learn What are Time Series? How to.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
Forecast, Detect, Intervene: Anomaly Detection for Time Series. Deepak Agarwal Yahoo! Research.
1 Auton Lab Walkerton Analysis. Proprietary Information. Early Analysis of Walkerton Data Version 11, June 18 th, 2005 Auton Lab:
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1.
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Learning and Acting with Bayes Nets Chapter 20.. Page 2 === A Network and a Training Data.
BPS - 5th Ed. Chapter 231 Inference for Regression.
Bayesian Biosurveillance of Disease Outbreaks RODS Laboratory Center for Biomedical Informatics University of Pittsburgh Gregory F. Cooper, Denver H.
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Online Conditional Outlier Detection in Nonstationary Time Series
Bayesian Biosurveillance of Disease Outbreaks
Time Series Algorithm Tutorial
Estimating the Expected Warning Time of Outbreak-Detection Algorithms
What’s Strange About Recent Events (WSARE)
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Whats Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) DIMACS Tutorial on Statistical and Other Analytic Health Surveillance Methods

Motivation Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… 1036/1/0311:072787DiarrheaM60sE?… 1046/1/0312:151717RespiratoryM60sENE… 1056/1/0313:013780ViralF50s?NW… 1066/1/0313:053487RespiratoryF40sSW … 1076/1/0313:572786UnmappedM50sSESW… 1086/1/0314:221780ViralM40s??… : : : : : : : : : : : Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)

The Problem From this data, can we detect if a disease outbreak is happening?

The Problem From this data, can we detect if a disease outbreak is happening? Were talking about a non- specific disease detection

The Problem From this data, can we detect if a disease outbreak is happening? How early can we detect it?

The Problem From this data, can we detect if a disease outbreak is happening? How early can we detect it? The question were really asking: In the last n hours, has anything strange happened?

Traditional Approaches What about using traditional anomaly detection? Typically assume data is generated by a model Finds individual data points that have low probability with respect to this model These outliers have rare attributes or combinations of attributes Need to identify anomalous patterns not isolated data points

Traditional Approaches –Time series algorithms –Regression techniques –Statistical Quality Control methods Need to know apriori which attributes to form daily aggregates for! What about monitoring aggregate daily counts of certain attributes? Weve now turned multivariate data into univariate data Lots of algorithms have been developed for monitoring univariate data:

Traditional Approaches What if we dont know what attributes to monitor?

Traditional Approaches What if we dont know what attributes to monitor? What if we want to exploit the spatial, temporal and/or demographic characteristics of the epidemic to detect the outbreak as early as possible?

Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Respiratory syndrome cases among females Viral syndrome cases involving senior citizens from eastern part of city Number of children from downtown hospital Number of cases involving people working in southern part of the city Number of cases involving teenage girls living in the western part of the city Botulinic syndrome cases And so on…

Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Respiratory syndrome cases among females Viral syndrome cases involving senior citizens from eastern part of city Number of children from downtown hospital Number of cases involving people working in southern part of the city Number of cases involving teenage girls living in the western part of the city Botulinic syndrome cases And so on… Youll need hundreds of univariate detectors! We would like to identify the groups with the strangest behavior in recent events.

Our Approach We use Rule-Based Anomaly Pattern Detection Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40 Age < 50 Related work: –Market basket analysis [Agrawal et. al, Brin et. al.] –Contrast sets [Bay and Pazzani] –Spatial Scan Statistic [Kulldorff] –Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance [Brossette et. al.]

WSARE v2.0 Last 24 hours Ignore key Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… : : : : : : : : : : : Inputs: 1. Multivariate date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use? Emergency Department Data

WSARE v2.0 Outputs: 1. Here are the records that most surprise me 2. Heres why 3. And heres how seriously you should take it Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… : : : : : : : : : : : Inputs: 1. Multivariate date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use?

WSARE v2.0 Overview 2.Search for rule with best score 3.Determine p-value of best scoring rule through randomization test All Data 4.If p-value is less than threshold, signal alert Recent Data Baseline 1.Obtain Recent and Baseline datasets

Step 1: Obtain Recent and Baseline Data Recent Data Baseline Data from last 24 hours Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day

Step 2. Search for Best Scoring Rule For each rule, form a 2x2 contingency table eg. Perform Fishers Exact Test to get a p-value for each rule => call this p-value the score Take the rule with the lowest score. Call this rule R BEST. This score is not the true p-value of R BEST because we are performing multiple hypothesis tests on each day to find the rule with the best score Count Recent Count Baseline Age Decile = Age Decile

The Multiple Hypothesis Testing Problem Suppose we reject null hypothesis when score <, where = 0.05 For a single hypothesis test, the probability of making a false discovery = Suppose we do 1000 tests, one for each possible rule Probability(false discovery) could be as bad as: 1 – ( 1 – 0.05) 1000 >> 0.05

Step 3: Randomization Test Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DB Rand Find the rule with the best score on DB Rand. June 4, 2002C2 June 5, 2002C3 June 12, 2002C4 June 19, 2002C5 June 26, 2002C6 June 26, 2002C7 July 2, 2002C8 July 3, 2002C9 July 10, 2002C10 July 17, 2002C11 July 24, 2002C12 July 30, 2002C13 July 31, 2002C14 July 31, 2002C15 June 4, 2002C2 June 12, 2002C3 July 31, 2002C4 June 26, 2002C5 July 31, 2002C6 June 5, 2002C7 July 2, 2002C8 July 3, 2002C9 July 10, 2002C10 July 17, 2002C11 July 24, 2002C12 July 30, 2002C13 June 19, 2002C14 June 26, 2002C15

Step 3: Randomization Test Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised. Estimated p-value of the rule is: # better scores / # iterations

Two Kinds of Analysis Day by Day If we want to run WSARE just for the current day… …then we end here. Historical Analysis If we want to review all previous days and their p- values for several years and control for some percentage of false positives… …then well once again run into overfitting problems …we need to compensate for multiple hypothesis testing because we perform a hypothesis test on each day in the history

We only need to do this for historical analysis! False Discovery Rate [Benjamini and Hochberg] Can determine which of these p-values are significant Specifically, given an α FDR, FDR guarantees that Given an α FDR, FDR produces a threshold below which any p-values in the history are considered significant

WSARE v3.0

WSARE v2.0 Review 2.Search for rule with best score 3.Determine p-value of best scoring rule through randomization test All Data 4.If p-value is less than threshold, signal alert Recent Data Baseline 1.Obtain Recent and Baseline datasets

Obtaining the Baseline Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. Baseline

Obtaining the Baseline Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. Baseline We would like to determine the baseline automatically! What if this assumption isnt true? What if data from 7, 14, 21 and 28 days prior is better?

Temporal Trends But health care data has many different trends due to –Seasonal effects in temperature and weather –Day of Week effects –Holidays –Etc. Allowing the baseline to be affected by these trends may dramatically alter the detection time and false positives of the detection algorithm

Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp )

WSARE v3.0 Generate the baseline… Taking into account recent flu levels… Taking into account that today is a public holiday… Taking into account that this is Spring… Taking into account recent heatwave… Taking into account that theres a known natural Food- borne outbreak in progress… Bonus: More efficient use of historical data

Conditioning on observed environment: Well understood for Univariate Time Series Time Signal Example Signals: Number of ED visits today Number of ED visits this hour Number of Respiratory Cases Today School absenteeism today Nyquil Sales today

An easy case Time Signal Dealt with by Statistical Quality Control Record the mean and standard deviation up the the current time. Signal an alarm if we go outside 3 sigmas Mean Upper Safe Range

Conditioning on Seasonal Effects Time Signal

Time Signal Fit a periodic function (e.g. sine wave) to previous data. Predict todays signal and 3-sigma confidence intervals. Signal an alarm if were off. Reduces False alarms from Natural outbreaks. Different times of year deserve different thresholds. Conditioning on Seasonal Effects

Weekly counts of P&I from week 1/98 to 48/00 Example [Tsui et. Al] From: Value of ICD 9–Coded Chief Complaints for Detection of Epidemics, Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, Chung- Chou Ho Chang, AMIA 2000

Seasonal Effects with Long-Term Trend Weekly counts of IS from week 1/98 to 48/00. From: Value of ICD 9–Coded Chief Complaints for Detection of Epidemics, Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, Chung- Chou Ho Chang, AMIA 2000

Fit a periodic function (e.g. sine wave) plus a linear trend: E[Signal] = a + bt + c sin(d + t/365) Good if theres a long term trend in the disease or the population. Weekly counts of IS from week 1/98 to 48/00. From: Value of ICD 9–Coded Chief Complaints for Detection of Epidemics, Fu-Chiang Tsui, Michael M. Wagner, Virginia Dato, Chung- Chou Ho Chang, AMIA 2000 Called the Serfling Method [Serfling, 1963] Seasonal Effects with Long-Term Trend

Day-of-week effects From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the- counter medication sales. Proceedings of the National Academy of Sciences (pp )

Day-of-week effects From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the- counter medication sales. Proceedings of the National Academy of Sciences (pp ) Fit a day-of-week component E[Signal] = a + delta day E.G: delta mon = +5.42, delta tue = +2.20, delta wed = +3.33, delta thu = +3.10, delta fri = +4.02, delta sat = , delta sun = Another simple form of ANOVA

Analysis of variance (ANOVA) Good news: If youre tracking a daily aggregate (univariate data)…then ANOVA can take care of many of these effects. But… What if youre tracking a whole joint distribution of events?

Idea: Bayesian Networks On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems Patients from West Park Hospital are less likely to be young On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic

WSARE Overview 2.Search for rule with best score 3.Determine p-value of best scoring rule through randomization test All Data 4.If p-value is less than threshold, signal alert Recent Data Baseline 1.Obtain Recent and Baseline datasets

Obtaining Baseline Data Baseline All Historical Data Todays Environment 1.Learn Bayesian Network 2. Generate baseline given todays environment

Obtaining Baseline Data Baseline All Historical Data Todays Environment 1.Learn Bayesian Network 2. Generate baseline given todays environment What should be happening today given todays environment

Step 1: Learning the Bayes Net Structure Involves searching over DAGs for the structure that maximizes a scoring function. Most common algorithm is hillclimbing. Initial Structure Add an arcDelete an arcReverse an arc 3 possible operations:

Step 1: Learning the Bayes Net Structure Involves searching over DAGs for the structure that maximizes a scoring function. Most common algorithm is hillclimbing. Initial Structure Add an arcDelete an arcReverse an arc 3 possible operations: But hillclimbing is too slow and single link modifications may not find the correct structure (Xiang, Wong and Cercone 1997). We use Optimal Reinsertion (Moore and Wong 2002).

T 1. Select target node in current graph T 2. Remove all arcs connected to T Optimal Reinsertion

T 3. Efficiently find new in/out arcs T 4. Choose best new way to connect T ? ? ? ?? ? ? ?

The Outer Loop Until no change in current DAG: Generate random ordering of nodes For each node in the ordering, do Optimal Reinsertion

The Outer Loop For NumJolts: Begin with randomly corrupted version of best DAG so far Until no change in current DAG: Generate random ordering of nodes For each node in the ordering, do Optimal Reinsertion

For NumJolts: Begin with randomly corrupted version of best DAG so far The Outer Loop Until no change in current DAG: Generate random ordering of nodes For each node in the ordering, do Optimal Reinsertion Conventional hill-climbing without maxParams restriction

How is Optimal Reinsertion done efficiently? 1.Create an efficient cache of NodeScore(PS->T) values using ADSearch [Moore and Schneider 2002] 2.Restrict PS->T combinations to those with CPTs with maxParams or fewer parameters 3.Additional Branch and Bound is used to restrict space an additional order of magnitude Scoring functions can be decomposed: P1P2P3 T Efficiency Tricks

Environmental Attributes Divide the data into two types of attributes: Environmental attributes: attributes that cause trends in the data eg. day of week, season, weather, flu levels Response attributes: all other non- environmental attributes

Environmental Attributes When learning the Bayesian network structure, do not allow environmental attributes to have parents. Why? We are not interested in predicting their distributions Instead, we use them to predict the distributions of the response attributes Side Benefit: We can speed up the structure search by avoiding DAGs that assign parents to the environmental attributes SeasonDay of WeekWeatherFlu Level

Step 2: Generate Baseline Given Todays Environment SeasonDay of WeekWeatherFlu Level TodayWinterMondaySnowHigh Season = Winter Day of Week = Monday Weather = Snow Flu Level = High Suppose we know the following for today: We fill in these values for the environmental attributes in the learned Bayesian network Baseline We sample records from the Bayesian network and make this data set the baseline

Step 2: Generate Baseline Given Todays Environment SeasonDay of WeekWeatherFlu Level TodayWinterMondaySnowHigh Season = Winter Day of Week = Monday Flu Level = High Suppose we know the following for today: We fill in these values for the environmental attributes in the learned Bayesian network Baseline We sample records from the Bayesian network and make this data set the baseline Sampling is easy because environmental attributes are at the top of the Bayes Net Weather = Snow

Why not use inference? With sampling, we create the baseline data and then use it to obtain the p-value of the rule for the randomization test If we used inference, we will not be able to perform the same randomization test and we need to find some other way to correct for the multiple hypothesis testing Sampling was chosen for its simplicity

Why not use inference? With sampling, we create the baseline data and then use it to obtain the p-value of the rule for the randomization test If we used inference, we will not be able to perform the same randomization test and we need to find some other way to correct for the multiple hypothesis testing Sampling was chosen for its simplicity But there may be clever things to do with inference which may help us. File this under future work

Simulation NW 100 N 400 NE 500 W 100 C 200 E 300 SW 200 S 200 SE 600 City with 9 regions and different population in each region For each day, sample the citys environment from the following Bayesian Network Date Day of Week Previous Weather Season Previous Flu Level Previous Region Food Condition Previous Region Anthrax Concentration Region Food Condition Region Anthrax Concentration Weather Flu Level

Simulation DATE DAY OF WEEK SEASON FLU LEVEL WEATHER REGION AGE GENDER Region Grassiness Region Anthrax Concentration Region Food Condition Immune System Outside Activity Has Anthrax Has Flu Has Allergy Has Heart Attack Has Sunburn Has Cold Heart Health Has Food Poisoning Disease ACTION Actual Symptom REPORTED SYMPTOM DRUG For each person in a region, sample their profile

Visible Environmental Attributes DATE DAY OF WEEK SEASON FLU LEVEL WEATHER REGION AGE GENDER Region Grassiness Region Anthrax Concentration Region Food Condition Immune System Outside Activity Has Anthrax Has Flu Has Allergy Has Heart Attack Has Sunburn Has Cold Heart Health Has Food Poisoning Disease ACTION Actual Symptom REPORTED SYMPTOM DRUG

Simulation DATE DAY OF WEEK SEASON FLU LEVEL WEATHER REGION AGE GENDER Region Grassiness Region Anthrax Concentration Region Food Condition Immune System Outside Activity Has Anthrax Has Flu Has Allergy Has Heart Attack Has Sunburn Has Cold Heart Health Has Food Poisoning Disease ACTION Actual Symptom REPORTED SYMPTOM DRUG Diseases: Allergy, cold, sunburn, flu, food poisoning, heart problems, anthrax (in order of precedence)

Simulation DATE DAY OF WEEK SEASON FLU LEVEL WEATHER REGION AGE GENDER Region Grassiness Region Anthrax Concentration Region Food Condition Immune System Outside Activity Has Anthrax Has Flu Has Allergy Has Heart Attack Has Sunburn Has Cold Heart Health Has Food Poisoning Disease ACTION Actual Symptom REPORTED SYMPTOM DRUG Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset.

Simulation Plot

Anthrax release (not highest peak)

Simulation 100 different data sets Each data set consisted of a two year period Anthrax release occurred at a random point during the second year Algorithms allowed to train on data from the current day back to the first day in the simulation Any alerts before actual anthrax release are considered a false positive Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days

Other Algorithms used in Simulation Time Signal Mean Upper Safe Range 1. Standard algorithm 2. WSARE WSARE 2.5 Use all past data but condition on environmental attributes

Results on Simulation

Conclusion One approach to biosurveillance: one algorithm monitoring millions of signals derived from multivariate data instead of Hundreds of univariate detectors WSARE is best used as a general purpose safety net in combination with other detectors Modeling historical data with Bayesian Networks to allow conditioning on unique features of today Computationally intense unless we use clever algorithms

Conclusion WSARE 2.0 deployed during the past year WSARE 3.0 about to go online WSARE now being extended to additionally exploit over the counter medicine sales

For more information References: Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2002). Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks. Proceedings of AAAI-02 (pp ). MIT Press. Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2003). Bayesian Network Anomaly Pattern Detection for Disease Outbreaks. Proceedings of ICML Moore, A., and Wong, W. K. (2003). Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning. Proceedings of ICML AUTON lab website: