Summarization and Deviation Detection -- What is new?

Slides:



Advertisements
Similar presentations
Whats Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University.
Advertisements

2005 Syndromic Surveillance1 Estimating the Expected Warning Time of Outbreak- Detection Algorithms Yanna Shen, Weng-Keen Wong, Gregory F. Cooper RODS.
 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.
© 2010 Artur Dubrawski 1 T-Cube Web Interface in RTBP: A Review of R&D Challenges Artur Dubrawski, Ph.D, M.Eng. Director, Auton Lab Senior Systems Scientist,
1 A Tutorial on Bayesian Networks Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon State University.
Optimizing Disease Outbreak Detection Methods Using Reinforcement Learning Masoumeh Izadi Clinical & Health Informatics Research Group Faculty of Medicine,
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
An introduction to time series approaches in biosurveillance Professor The Auton Lab School of Computer Science Carnegie Mellon University
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Weng-Keen Wong, Greg Cooper, Denver Dash *, John Levander, John Dowling,
What’s Strange About Recent Events (WSARE) v3.0: Adjusting for a Changing Baseline Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
 2004 University of Pittsburgh Bayesian Biosurveillance Using Multiple Data Streams Greg Cooper, Weng-Keen Wong, Denver Dash*, John Levander, John Dowling,
Weng-Keen Wong, Oregon State University © Bayesian Networks: A Tutorial Weng-Keen Wong School of Electrical Engineering and Computer Science Oregon.
SOWK 6003 Social Work Research Week 10 Quantitative Data Analysis
Conclusions On our large scale anthrax attack simulations, being able to infer the work zip appears to improve detection time over just using the home.
Population-Wide Anomaly Detection Weng-Keen Wong 1, Gregory Cooper 2, Denver Dash 3, John Levander 2, John Dowling 2, Bill Hogan 2, Michael Wagner 2 1.
Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
Bayesian Analysis for Extreme Events Pao-Shin Chu and Xin Zhao Department of Meteorology School of Ocean & Earth Science & Technology University of Hawaii-
1 Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
Data Mining – Intro.
Data Mining: A Closer Look
Chapter 6 Measuring Indicators
Copyright © 2010 Lumina Decision Systems, Inc. Statistical Hypothesis Testing (8 th Session in “Gentle Introduction to Modeling Uncertainty”) Lonnie Chrisman,
Data Mining Chun-Hung Chou
SPONSOR JAMES C. BENNEYAN DEVELOPMENT OF A PRESCRIPTION DRUG SURVEILLANCE SYSTEM TEAM MEMBERS Jeffrey Mason Dan Mitus Jenna Eickhoff Benjamin Harris.
Unit 1: Overview of HIV/AIDS Case Reporting #6-0-1.
Charge Capture Auditing …. How to Uncover Revenue Leakage
Why Use MONAHRQ for Health Care Reporting? May 2014 Note: This is one of seven slide sets outlining MONAHRQ and its value, available at
A Wavelet-based Anomaly Detector for Disease Outbreaks Thomas Lotze Galit Shmueli University of Maryland College Park Sean Murphy Howard Burkom Johns Hopkins.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 9-2 Inferences About Two Proportions.
Additional Data For Harmonized Use Case for Biosurveillance HINF 5430 Final Project By Maria Metty, Priyaranjan Tokachichu &Resty Namata December 13, 2007.
ESSENCE v1.9 Training Course: 1. Introduction to ESSENCE Emily Kuo, MPH Missouri ESSENCE Questions or concerns? Contact Training and Support at : Tel:
For Physicians/Patients Sales Discovery Template 1 OEP Sales Discovery Process Template : Technology Entrepreneurship Venture Lab 2012.
Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:
What’s Strange About Recent Events (WSARE) Weng-Keen Wong (University of Pittsburgh) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University.
Digital Statisticians INST 4200 David J Stucki Spring 2015.
©2015 Apigee Corp. All Rights Reserved. Preserving signal in customer journeys Joy Thomas, Apigee Jagdish Chand, Visa.
Essential Statistics Chapter 131 Introduction to Inference.
Report on Intrusion Detection and Data Fusion By Ganesh Godavari.
Why Use MONAHRQ for Health Care Reporting? March 2015 Note: This is one of eight slide sets outlining MONAHRQ and its value, available at
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
1 © 2008 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 12 The Analysis of Categorical Data and Goodness-of-Fit Tests.
Suttajit S a, Tantipidoke R a, Sitthi-amorn C a, Wagner A b, Ross-Degnan D b. a Chulalongkorn University, Bangkok; b Harvard Medical School, USA Problem.
June 9, 2008 Making Mortality Measurement More Meaningful Incorporating Advanced Directives and Palliative Care Designations Eugene A. Kroch, Ph.D. Mark.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
16.1: Basic Probability. Definitions Probability experiment: An action through which specific results (counts, measurements, or responses) are obtained.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Introduction for Basic Epidemiological Analysis for Surveillance Data National Center for Immunization & Respiratory Diseases Influenza Division.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
1 Auton Lab Walkerton Analysis. Proprietary Information. Early Analysis of Walkerton Data Version 11, June 18 th, 2005 Auton Lab:
Harmonized Biosurveillance Use Case By Resty Namata, Maria Metty & Priyaranjan Tokachichu December 13, 2007.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Information for action: Principles of surveillance Integrated Disease Surveillance Programme (IDSP) district surveillance officers (DSO) course.
Automatic Discovery and Processing of EEG Cohorts from Clinical Records Mission: Enable comparative research by automatically uncovering clinical knowledge.
Copyright © 2001, SAS Institute Inc. All rights reserved. Data Mining Methods: Applications, Problems and Opportunities in the Public Sector John Stultz,
Bayesian Disease Outbreak Detection that Includes a Model of Unknown Diseases Yanna Shen and Gregory F. Cooper Intelligent Systems Program and Department.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 5: Probability: What are the Chances? Section 5.3 Conditional Probability.
Revolutionizing Point of Care with Remote Healthcare Solutions Lance Myers, PhD.
Bayesian Biosurveillance of Disease Outbreaks RODS Laboratory Center for Biomedical Informatics University of Pittsburgh Gregory F. Cooper, Denver H.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Data Mining – Intro.
Quality of Electronic Emergency Department Data: How Good Are They?
Online Conditional Outlier Detection in Nonstationary Time Series
Bayesian Biosurveillance of Disease Outbreaks
What’s Strange About Recent Events (WSARE)
Improving Overlap Farrokh Alemi, Ph.D.
Presentation transcript:

Summarization and Deviation Detection -- What is new?

2 Outline  Summarization  KEFIR – Key Findings Reporter  WSARE – What is Strange About Recent Events

3 What is New? Old data new data

4 Summarization  Concisely summarize what is new and different, unexpected  with respect to previous values  with respect to expected values  …  Focus on what is actionable!

5 Problem: Healthcare Costs  Healthcare costs in US: 1 out of 7 GDP $ and rising  potential problems: fraud, misuse, …  understanding where the problems are is first step to fixing them  GTE – self insured for medical costs  GTE healthcare costs – $X00,000,000  Task: Analyze employee health care data and generate a report that describes the major problems

6 GTE Key Findings Reporter: KEFIR  KEFIR Approach:  Analyze all possible deviations  Select interesting findings  Augment key findings with:  Explanations of plausible causes  Recommendations of appropriate actions  Convert findings to a user-friendly report with text and graphics

KEFIR Search Space

8 Drill-Down Example

9 What Change Is Important?

10 Deviation Detection  Drill Down through the search space  Generate a finding for each measure  deviation from previous period  deviation from norm  deviation projected for next period, if no action

Interestingness of Deviations Impact: how much the deviation affects the bottom line Savings Percentage: how much of the deviation from the norm can be expected to be saved by the action

Recommendations Hierarchical recommendation rules define appropriate intervention strategies for important measures and study areas. Example:measure = admission rate per 1000 & study_area = Inpatient admissions & percent_change > 0.10 If Then Utilization review is needed in the area of admission certification. Expected Savings: 20%

13 Explanation A measure is explained by finding the path of related measures with the highest impact The large increase in m 1 in group s 1 was caused by an increase in m 3, which was caused by a rise in m 5, primarily in sector s 13.

14 Report Generation  Automatic generation of business-user-oriented reports  Natural language generation with template matching  Graphics  delivered via browser

16 Sample KEFIR pages Overview Inpatient admissions

Status  Prototype implemented in GTE in 1995  KEFIR received GTE’s highest award for technical achievement in 1995  Key business user left GTE in 1996 and system was no longer used  Publication:  Selecting and Reporting What is Interesting: The KEFIR Application to Healthcare Data, C. Matheus, G. Piatetsky-Shapiro, and D. McNeill, in Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996Advances in Knowledge Discovery and Data Mining

What’s Strange About Recent Events (WSARE) Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University) Gregory Cooper (University of Pittsburgh) Michael Wagner (University of Pittsburgh) Designed to be easily applicable to any date/time- indexed biosurveillance-relevant data stream

19 Motivation Primary Key DateTimeHospitalICD9ProdromeGenderAgeHome Location Work Location Many more… 1006/1/039:121781FeverM20sNE?… 1016/1/0310:451787DiarrheaF40sNE … 1026/1/0311:031786RespiratoryF60sNEN… 1036/1/0311:072787DiarrheaM60sE?… 1046/1/0312:151717RespiratoryM60sENE… 1056/1/0313:013780ViralF50s?NW… 1066/1/0313:053487RespiratoryF40sSW … 1076/1/0313:572786UnmappedM50sSESW… 1086/1/0314:221780ViralM40s??… : : : : : : : : : : : Suppose we have access to Emergency Department data from hospitals around a city (with patient confidentiality preserved)

20 Traditional Approaches We need to build a univariate detector to monitor each interesting combination of attributes: Diarrhea cases among children Respiratory syndrome cases among females Viral syndrome cases involving senior citizens from eastern part of city Number of children from downtown hospital Number of cases involving people working in southern part of the city Number of cases involving teenage girls living in the western part of the city Botulinic syndrome cases And so on… You’ll need hundreds of univariate detectors! We would like to identify the groups with the strangest behavior in recent events.

21 WSARE Approach  Rule-Based Anomaly Pattern Detection  Association rules used to characterize anomalous patterns. For example, a two-component rule would be: Gender = Male AND 40  Age < 50

22 WSARE v2.0 Overview 2.Search for rule with best score 3.Determine p-value of best scoring rule through randomization test All Data 4.If p-value is less than threshold, signal alert Recent Data Baseline 1.Obtain Recent and Baseline datasets

23 Step 1: Obtain Recent and Baseline Data Recent Data Baseline Data from last 24 hours Baseline data is assumed to capture non-outbreak behavior. We use data from 35, 42, 49 and 56 days prior to the current day

24 Example Sat % (48/134) of today's cases have 30 <= age < % (45/265) of other (baseline) cases have 30 <= age < 40

25 Step 2. Search for Best Rule For each rule, form a 2x2 contingency table eg.  Perform Fisher’s Exact Test to get a p-value (score) for each rule (for this data )  Find rule R-best with the lowest score.  Caution: This score is not the true p-value of R BEST because of multiple tests Count Recent Count Baseline Age Decile = Age Decile 

26 Step 3: Randomization Test  Take the recent cases and the baseline cases. Shuffle the date field to produce a randomized dataset called DB Rand  Find the rule with the best score on DB Rand. June 4, 2002C2 June 5, 2002C3 June 12, 2002C4 June 19, 2002C5 June 26, 2002C6 June 26, 2002C7 July 2, 2002C8 July 3, 2002C9 July 10, 2002C10 July 17, 2002C11 July 24, 2002C12 July 30, 2002C13 July 31, 2002C14 July 31, 2002C15 June 4, 2002C2 June 12, 2002C3 July 31, 2002C4 June 26, 2002C5 July 31, 2002C6 June 5, 2002C7 July 2, 2002C8 July 3, 2002C9 July 10, 2002C10 July 17, 2002C11 July 24, 2002C12 July 30, 2002C13 June 19, 2002C14 June 26, 2002C15

27 Step 3: Randomization Test Repeat the procedure on the previous slide for 1000 iterations. Determine how many scores from the 1000 iterations are better than the original score. If the original score were here, it would place in the top 1% of the 1000 scores from the randomization test. We would be impressed and an alert should be raised. Estimated p-value of the rule is: # better scores / # iterations

28 Results on Actual ED Data from Sat : SCORE = PVALUE = % ( 74/500) of today's cases have Viral Syndrome = True and Encephalitic Prodome = False 7.42% (742/10000) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False 2. Sat : SCORE = PVALUE = % ( 58/467) of today's cases have Respiratory Syndrome = True 6.53% (653/10000) of baseline have Respiratory Syndrome = True 3. Wed : SCORE = PVALUE = % ( 9/625) of today's cases have 100 <= Age < % ( 8/10000) of baseline have 100 <= Age < Sun : SCORE = PVALUE = % (481/574) of today's cases have Unknown Syndrome = False 74.29% (7430/10001) of baseline have Unknown Syndrome = False 5. Thu : SCORE = PVALUE = % ( 70/476) of today's cases have Viral Syndrome = True and Encephalitic Syndrome = False 7.89% (789/9999) of baseline have Viral Syndrome = True and Encephalitic Syndrome = False

29 WSARE 3:0 Improving the Baseline Recall that the baseline was assumed to be captured by data that was from 35, 42, 49, and 56 days prior to the current day. Baseline We would like to determine the baseline automatically! What if this assumption isn’t true? What if data from 7, 14, 21 and 28 days prior is better?

30 Temporal Trends From: Goldenberg, A., Shmueli, G., Caruana, R. A., and Fienberg, S. E. (2002). Early statistical detection of anthrax outbreaks by tracking over-the-counter medication sales. Proceedings of the National Academy of Sciences (pp )

31 WSARE v3.0 Generate the baseline…  “Taking into account recent flu levels…”  “Taking into account that today is a public holiday…”  “Taking into account that this is Spring…”  “Taking into account recent heatwave…”  “Taking into account that there’s a known natural Food- borne outbreak in progress…” Bonus: More efficient use of historical data

32 Idea: Bayesian Networks “On Cold Tuesday Mornings the folks coming in from the North part of the city are more likely to have respiratory problems” “Patients from West Park Hospital are less likely to be young” “On the day after a major holiday, expect a boost in the morning followed by a lull in the afternoon” Bayesian Network: A graphical model representing the joint probability distribution of a set of random variables “The Viral prodrome is more likely to co-occur with a Rash prodrome than Botulinic”

33 Obtaining Baseline Data Baseline All Historical Data Today’s Environment 1.Learn Bayesian Network 2. Generate baseline given today’s environment What should be happening today given today’s environment

34 Simulation DATE DAY OF WEEK SEASON FLU LEVEL WEATHER REGION AGE GENDER Region Grassiness Region Anthrax Concentration Region Food Condition Immune System Outside Activity Has Anthrax Has Flu Has Allergy Has Heart Attack Has Sunburn Has Cold Heart Health Has Food Poisoning Disease ACTION Actual Symptom REPORTED SYMPTOM DRUG Actions: None, Purchase Medication, ED visit, Absent. If Action is not None, output record to dataset.

35 Simulation  100 different data sets  Each data set consisted of a two year period  Anthrax release occurred at a random point during the second year  Algorithms allowed to train on data from the current day back to the first day in the simulation  Any alerts before actual anthrax release are considered a false positive  Detection time calculated as first alert after anthrax release. If no alerts raised, cap detection time at 14 days

36 Simulation Plot Anthrax release (not highest peak)

37 Results on Simulation

38 Summary  Summarization of what is new and interesting  Key ideas  search many possible findings  compare to past data and expected data  avoid overfitting  focus on actionable changes  Example systems  KEFIR (GTE, )  WSARE (CMU/Pitt, )