What’s Strange About Recent Events (WSARE) v3.0: Adjusting for a Changing Baseline Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Whats Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University.
Advertisements

2005 Syndromic Surveillance1 Estimating the Expected Warning Time of Outbreak- Detection Algorithms Yanna Shen, Weng-Keen Wong, Gregory F. Cooper RODS.
 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.
Early Statistical Detection of Bio-Terrorism Attacks by Tracking OTC Medication Sales Galit Shmueli Dept. of Statistics and CALD Carnegie Mellon University.
1 A Tutorial on Bayesian Networks Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon State University.
Optimizing Disease Outbreak Detection Methods Using Reinforcement Learning Masoumeh Izadi Clinical & Health Informatics Research Group Faculty of Medicine,
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Project Mimic: Simulation for Syndromic Surveillance Thomas Lotze Applied Mathematics and Scientific Computation University of Maryland Galit Shmueli and.
Surveillance of gastroenteritis using drug sales data in France Mathilde Pivette, PharmD, MPH Pr Avner Bar-Hen Dr Pascal Crépey.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Weng-Keen Wong, Greg Cooper, Denver Dash *, John Levander, John Dowling,
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling,
UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering On-line Alert Systems for Production Plants A Conflict Based Approach.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Multi-Scale Analysis for Network Traffic Prediction and Anomaly Detection Ling Huang Joint work with Anthony Joseph and Nina Taft January, 2005.
The Space-Time Scan Statistic for Multiple Data Streams
Summarization and Deviation Detection -- What is new?
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
1 Graduate Statistics Student, 2 Undergraduate Computer Science Student, 3 Professor and Director of Statistical Consulting Collaboratory 4 Chief Technology.
Conclusions On our large scale anthrax attack simulations, being able to infer the work zip appears to improve detection time over just using the home.
Population-Wide Anomaly Detection Weng-Keen Wong 1, Gregory Cooper 2, Denver Dash 3, John Levander 2, John Dowling 2, Bill Hogan 2, Michael Wagner 2 1.
Value of Information for Complex Economic Models Jeremy Oakley Department of Probability and Statistics, University of Sheffield. Paper available from.
Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
1 Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
Statistical Process Control
Anomaly detection Problem motivation Machine Learning.
Chapter 10 Hypothesis Testing
SPONSOR JAMES C. BENNEYAN DEVELOPMENT OF A PRESCRIPTION DRUG SURVEILLANCE SYSTEM TEAM MEMBERS Jeffrey Mason Dan Mitus Jenna Eickhoff Benjamin Harris.
EVENT MANAGEMENT IN MULTIVARIATE STREAMING SENSOR DATA National and Kapodistrian University of Athens.
A Wavelet-based Anomaly Detector for Disease Outbreaks Thomas Lotze Galit Shmueli University of Maryland College Park Sean Murphy Howard Burkom Johns Hopkins.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...
What’s Strange About Recent Events (WSARE) Weng-Keen Wong (University of Pittsburgh) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University.
Digital Statisticians INST 4200 David J Stucki Spring 2015.
Daniel Guetta (DRO)Transitional Care Units IEOR Final Project 9 th May 2012 Daniel Guetta Joint work with Carri Chan.
1 Novel Influenza A H1N1 Outbreak: The Florida Response Epidemiology Perspective: Situation Update.
Biosurveillance Detection Algorithms: Slide 1 Copyright © 2002, 2003, Andrew Moore Detection Algorithms for Biosurveillance: A tutorial Greg Cooper ProfessorComputer.
Acknowledgements Contact Information Anthony Wong, MTech 1, Senthil K. Nachimuthu, MD 1, Peter J. Haug, MD 1,2 Patterns and Rules  Vital signs medoids.
1 Auton Lab Walkerton Analysis. Proprietary Information. Early Analysis of Walkerton Data Version 11, June 18 th, 2005 Auton Lab:
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
Biosurveillance Detection Algorithms: Slide 1 Copyright © 2002, 2003, 2004 Andrew Moore Detection Algorithms for Biosurveillance: A tutorial RODS:
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
~PPT Howard Burkom 1, PhD Yevgeniy Elbert 2, MSc LTC Julie Pavlin 2, MD MPH Christina Polyak 2, MPH 1 The Johns Hopkins University Applied Physics.
No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
Bayesian Biosurveillance of Disease Outbreaks RODS Laboratory Center for Biomedical Informatics University of Pittsburgh Gregory F. Cooper, Denver H.
Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
Towards Improved Sensitivity, Specificity, and Timeliness of Syndromic Surveillance Systems Anna L. Buczak, PhD, Linda J. Moniz, PhD, Joseph Lombardo,
CSE 4705 Artificial Intelligence
By Arijit Chatterjee Dr
Online Conditional Outlier Detection in Nonstationary Time Series
Dept of Biostatistics, Emory University
QianZhu, Liang Chen and Gagan Agrawal
From last time: on-policy vs off-policy Take an action Observe a reward Choose the next action Learn (using chosen action) Take the next action Off-policy.
Bayesian Networks: A Tutorial
Bayesian Biosurveillance of Disease Outbreaks
APHA, Washington, November, 2007
New Directions in Pre-Syndromic and Subpopulation Health Surveillance
Baselining PMU Data to Find Patterns and Anomalies
One Health Early Warning Alert
Estimating the Expected Warning Time of Outbreak-Detection Algorithms
Pattern Recognition and Image Analysis
What’s Strange About Recent Events (WSARE)
Improving Overlap Farrokh Alemi, Ph.D.
Using Bayesian Network in the Construction of a Bi-level Multi-classifier. A Case Study Using Intensive Care Unit Patients Data B. Sierra, N. Serrano,
Presentation transcript:

What’s Strange About Recent Events (WSARE) v3.0: Adjusting for a Changing Baseline Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh)

Motivation Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… 1036/1/0311:072787DiarrheaM60sE?… 1046/1/0312:151717RespiratoryM60sENE… 1056/1/0313:013780ViralF50s?NW… 1066/1/0313:053487RespiratoryF40sSW … 1076/1/0313:572786UnmappedM50sSESW… 1086/1/0314:221780ViralM40s??… : : : : : : : : : : : Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)

The Problem From this data, can we detect if a disease outbreak is happening?

The Problem From this data, can we detect if a disease outbreak is happening? We’re talking about a non- specific disease detection

The Problem From this data, can we detect if a disease outbreak is happening? How early can we detect it?

The Problem From this data, can we detect if a disease outbreak is happening? How early can we detect it? The question we’re really asking: In the last n hours, has anything strange happened?

Traditional Approaches What about using traditional anomaly detection? Typically assume data is generated by a model Finds individual data points that have low probability with respect to this model These outliers have rare attributes or combinations of attributes Need to identify anomalous patterns not isolated data points

Traditional Approaches –Time series algorithms –Regression techniques –Statistical Quality Control methods Need to know apriori which attributes to form daily aggregates for! What about monitoring aggregate daily counts of certain attributes? We’ve now turned multivariate data into univariate data Lots of algorithms have been developed for monitoring univariate data:

Traditional Approaches What if we don’t know what attributes to monitor?

Traditional Approaches What if we don’t know what attributes to monitor? What if we want to exploit the spatial, temporal and/or demographic characteristics of the epidemic to detect the outbreak as early as possible?

Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Respiratory syndrome cases among females Viral syndrome cases involving senior citizens from eastern part of city Number of children from downtown hospital Number of cases involving people working in southern part of the city Number of cases involving teenage girls living in the western part of the city Botulinic syndrome cases And so on…

Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Respiratory syndrome cases among females Viral syndrome cases involving senior citizens from eastern part of city Number of children from downtown hospital Number of cases involving people working in southern part of the city Number of cases involving teenage girls living in the western part of the city Botulinic syndrome cases And so on… You’ll need hundreds of univariate detectors! We would like to identify the groups with the strangest behavior in recent events.

Our Approach We use Rule-Based Anomaly Pattern Detection Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40  Age < 50 Related work: –Market basket analysis [Agrawal et. al, Brin et. al.] –Contrast sets [Bay and Pazzani] –Spatial Scan Statistic [Kulldorff] –Association Rules and Data Mining in Hospital Infection Control and Public Health Surveillance [Brossette et. al.]

WSARE v2.0 “Last 24 hours” “Ignore key” Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… : : : : : : : : : : : Inputs: 1. Multivariate date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use? “Emergency Department Data”

WSARE v2.0 Outputs: 1. Here are the records that most surprise me 2. Here’s why 3. And here’s how seriously you should take it Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… : : : : : : : : : : : Inputs: 1. Multivariate date/time-indexed biosurveillance- relevant data stream 2. Time Window Length 3. Which attributes to use?

WSARE v2.0 Overview 2.Search for rule with best score 3.Determine p-value of best scoring rule through randomization test All Data 4.If p-value is less than threshold, signal alert Recent Data Baseline 1.Obtain Recent and Baseline datasets

Obtaining the Baseline (WSARE v2.0) Baseline Here we use data from 35,42, 49 and 56 days ago Assumed to capture non-epidemic behavior. We use raw historical data.

Obtaining the Baseline (WSARE v2.0) Baseline What if data from 7, 14, 21 and 28 days ago is better? Assumed to capture non-epidemic behavior. We use raw historical data.

Obtaining the Baseline (WSARE v2.0) Baseline What if we could automatically generate the baseline? Assumed to capture non-epidemic behavior. We use raw historical data.

Temporal Trends But health care data has many different trends due to –Seasonal effects in temperature and weather –Day of Week effects –Holidays –Etc. Allowing the baseline to be affected by these trends may dramatically alter the detection time and false positives of the detection algorithm

Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp )

WSARE v3.0 Generate the baseline… “Taking into account recent flu levels…” “Taking into account that today is a public holiday…” “Taking into account that this is Spring…” “Taking into account recent heatwave…” “Taking into account that there’s a known natural Food- borne outbreak in progress…” Bonus: More efficient use of historical data

Idea: Bayesian Networks “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “Patients from West Park Hospital are less likely to be young” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic”

What is a Bayesian Network? Has_Flu Age Weather The arrows say something about the conditional independence structure of the attributes. They do not necessarily say anything about causality.

What is a Bayesian Network? Has_Flu Age Weather Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables From the Bayesian Network above, we can get: P(Age = Senior, Weather = Cold, Has_Flu = True) More importantly, we can get: P(Has_Flu = True | Age = Senior, Weather = Cold )

How do we come up with the Bayesian Network Structure? 1.By hand 2.By learning it from historical data Lots of different algorithms for doing this We use Optimal Reinsertion [Moore and Wong 2003]

Obtaining Baseline Data Baseline All Historical Data Today’s Environment 1.Learn Bayesian Network 2. Generate baseline given today’s environment

Obtaining Baseline Data Baseline All Historical Data Today’s Environment 1.Learn Bayesian Network 2. Generate baseline given today’s environment What should be happening today given today’s environment

Environmental Attributes Divide the data into two types of attributes: Environmental attributes: attributes that cause trends in the data eg. day of week, season, weather, flu levels Response attributes: all other non- environmental attributes eg. age, gender, symptom information

Environmental Attributes When learning the Bayesian network structure, do not allow environmental attributes to have parents. Why? We are not interested in predicting their distributions Instead, we use them to predict the distributions of the response attributes Side Benefit: We can speed up the structure search by avoiding DAGs that assign parents to the environmental attributes SeasonDay of WeekWeatherFlu Level

Step 2: Generate Baseline Given Today’s Environment SeasonDay of WeekWeatherFlu Level TodayWinterMondaySnowHigh Season = Winter Day of Week = Monday Weather = Snow Flu Level = High Suppose we know the following for today: We fill in these values for the environmental attributes in the learned Bayesian network Baseline We sample records from the Bayesian network and make this data set the baseline

Step 2: Generate Baseline Given Today’s Environment SeasonDay of WeekWeatherFlu Level TodayWinterMondaySnowHigh Season = Winter Day of Week = Monday Flu Level = High Suppose we know the following for today: We fill in these values for the environmental attributes in the learned Bayesian network Baseline We sample records from the Bayesian network and make this data set the baseline Sampling is easy because environmental attributes are at the top of the Bayes Net Weather = Snow

Simulator

Simulation 100 different data sets (available on web) Each data set consisted of a two year period Anthrax release occurred at a random point during the second year Algorithms allowed to train on data from the current day back to the first day in the simulation Any alerts before actual anthrax release are considered a false positive Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days

Other Algorithms used in Simulation Signal Time Mean Upper Safe Range 1. Control Chart 2. WSARE 2.0 Create baseline using historical data from 7, 14, 21 and 28 days ago 3. WSARE 2.5 Use all past data but condition on environmental attributes

Results on Simulation

Results on Actual ED Data from Sat : SCORE = PVALUE = % ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 2. Sat : SCORE = PVALUE = % ( 58/467) of today's cases have Respiratory Syndrome = True 6.53% (653/10000) of baseline have Respiratory Syndrome = True 3. Wed : SCORE = PVALUE = % ( 9/625) of today's cases have 100 <= Age < % ( 8/10000) of baseline have 100 <= Age < Sun : SCORE = PVALUE = % (481/574) of today's cases have Unknown Syndrome = False 74.29% (7430/10001) of baseline have Unknown Syndrome = False 5. Thu : SCORE = PVALUE = % ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 6. Thu : SCORE = PVALUE = % ( 38/443) of today's cases have Hospital ID = 1 and Viral Syndrome = True 2.40% (240/10000) of baseline have Hospital ID = 1 and Viral Syndrome = True

Conclusion One approach to biosurveillance: one algorithm monitoring millions of signals derived from multivariate data instead of Hundreds of univariate detectors WSARE is best used as a general purpose safety net in combination with other detectors Modeling historical data with Bayesian Networks to allow conditioning on unique features of today Computationally intense unless we use clever algorithms

Conclusion WSARE 2.0 deployed during the past year WSARE 3.0 to be deployed online WSARE now being extended to additionally exploit over the counter medicine sales

For more information References: Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2002). Rule-based Anomaly Pattern Detection for Detecting Disease Outbreaks. Proceedings of AAAI-02 (pp ). MIT Press. Wong, W. K., Moore, A. W., Cooper, G., and Wagner, M. (2003). Bayesian Network Anomaly Pattern Detection for Disease Outbreaks. Proceedings of ICML Moore, A., and Wong, W. K. (2003). Optimal Reinsertion: A New Search Operator for Accelerated and More Accurate Bayesian Network Structure Learning. Proceedings of ICML AUTON lab website: