Download presentation
Presentation is loading. Please wait.
Published byJesse Dalton Modified over 9 years ago
1
No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns Hopkins University Applied Physics Laboratory 7th Annual Conference of the International Society for Disease Surveillance Workshop Session: Public Health Track Analysis of Data Raleigh, North Carolina December 2, 2008
2
Computer Preparation Windows and Microsoft EXCEL Windows screen resolution: at least 1024 x 768 From Windows: Start…Settings…Control Panel…Display Set screen resolution to 1024 x 768 or higher, Then Apply…OK, & close Control panel EXCEL preparation: Open EXCEL, and then: Allow use of macros in EXCEL files: Tools…Options…Security…Macro Security, choose Medium security…OK…OK, and you will choose Enable Macros when you open workshop files
3
Outline Time series that occur in disease surveillance What information do we want from these time series? Statistical properties relevant to situational awareness Sample detection method and properties Examples using modified, aggregated Distribute data Summary and conclusions
4
Rash Syndrome Grouping of Diagnosis Codes www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc
5
Example: Daily Count Data with Expected Values
6
Sample DiSTRIBuTE Data Distributed Surveillance Taskforce for Real-time Influenza Burden Tracking and Evaluation
7
Aggregated Distribute Data Sample
8
State Level Daily Counts
9
State Level Weekly & Total Counts
10
Utility of Surveillance Data What information do we want from these time series? Early indications of potential public health events Infectious disease outbreaks Environmental exposures Bioterrorist attack More recent emphasis: enhancement of situational awareness: What are usual patterns? What is disease burden of natural disaster or accident? How is outbreak progressing, who is at risk?
11
Relevant Statistical Properties of Surveillance Data Important because they affect recognition of anomalies and choice of algorithm: Seasonal, day-of-week effects dependent on data source & filtering of records Variance, autocorrelation Cross-correlation among data sources, among time series Degree of sparseness
12
Seasonal, DOW Effects: Daily Mean Record Counts
13
Elements of an Alerting Algorithm –Values to be tested: raw data, or residuals from a model? –Baseline period Historical data used to determine expected data behavior Fixed or a sliding window? Outlier removal: to avoid training on unrepresentative data What does algorithm do when there is all zero/no baseline data? Is a warmup period of data history required? –Buffer period (or guardband) Separation between the baseline period and interval to be tested –Test period Interval of current data to be tested –Reset criterion to prevent flooding by persistent alerts caused by extreme values –Test statistic: value computed to make alerting decisions –Threshold: alert issued if test statistic exceeds this value
14
Spreadsheet Objectives To present underlying concepts of alerting methods and the adaptations needed for daily health surveillance –Independent of any specific system, software environment, or corporate infrastructure –To understand what complex systems are offering To enable direct visualization of algorithm performance in immediate data context To furnish a spreadsheet toolset for –focused data analysis & experimentation –independent checking of anomalies –sharing without language, database barriers
15
Example with Detection Statistic Plot Statistic Exceeds Threshold Threshold
16
Example: covering 1402 data days
17
Aggregated Distribute Data Sample: Weekly
18
Aggregated Distribute Data Sample: Daily
19
Data issues affecting monitoring –Statistical properties Scale and random dispersion –Periodic effects Day-of-week effects, seasonality –Delayed (often variably) availability in monitoring system –Trends: long/short term: many causes, incl. changes in: Population distribution or demographic composition Data provider participation Consumer health care behavior Coding or billing practices –Prolonged data drop-outs, sometimes with catch-ups –Outliers unrelated to infectious disease levels Often due to problems in data chain Inclement weather Media reports (example: the “Clinton effect”) Most suitable for modeling without data-specific information
20
Adjustment for Known Time Series Behavior Regression modeling –Direct modeling of known effects –Predictors: day of week, day of year, linear trend, holiday, post-holiday, max. daily temperature, others Stratification – Separate monitoring, according to purpose, by Weekday and weekend/holiday, or day of week Region Age group “Differential Detection” –Use related “context” data with similar features –Example: total counts to modulate syndromic counts
21
Sample Algorithm “rate method” Adaptive control chart derived from EARS methods –Implicit adjustment for known and unknown data behavior Sliding daily baseline of both syndromic visit counts, total visit counts Assumption: the ratio of syndromic visits to total visits has not changed
22
CDC EARS Methods C1-C3 Three adaptive methods chosen by National Center for Infectious Diseases after 9/1/2001 as most consistent Look for aberrations representing increases, not decreases Fixed mean, variance replaced by values from sliding baseline (usually 7 days) Baseline for C1-MILD (-1 to -7 day) Baseline C2-MEDIUM (-3 to -9days) Baseline for C3-ULTRA (-3 to -9 days) Current Count Day-9 Day-8 Day-7 Day-6 Day-5 Day-4 Day-3 Day-2 Day-1 Day 0
23
Sample Algorithm “Modified C2” If T n = the current day’s total visit count, X B = the sum of ili visit counts in the baseline, T B = the sum of total visit counts in the baseline, Then the expected current day’s ili visit count E n is: E n = T n ( X B / T B ) If the number of ili visits recorded for the current day is X n, then by analogy to C2, the detection statistic is: Max { 0, ( X n - E n - kS n ) / S n } For k = 1 and a standard error S n If the total visit count is constant, this is exactly the C2 algorithm
24
Technical Points Attempts to control bias: –Oversensitivity (undersensitivity) on high (low) volume days Effect of temporary change in data participation –hospital drop-outs/catch-ups Importance of minimum standard error –management of small counts to avoid excess alerting Effect of baseline length –trade-offs in volatile vs recent representative behavior Steady state vs unstable time intervals –Startup –Recovery after an alert or data problems –Adjustment to permanent change of scale Comparative analysis of daily vs weekly time series –relation to emphasis on general PH vs bioterrorism concerns
25
Control for Day-of-Week Bias, 1
26
Ignoring Day-of-Week Effects
27
Control for Day-of-Week Bias Using Total Visit Counts
28
Control for Day-of-Week Bias Variance Effects
29
Control for Day-of-Week Bias Distribute Data
30
Analysis of Sparse Data Streams
31
Effect of Data Processing Issues
32
Effect of Baseline: Daily Data 28-day baseline 7-day baseline
33
Effect of Baseline: Weekly Data 7-week baseline 4-week baseline
34
7-day baseline: no alert on 8 cases 3 days later, alerts on 5 cases a. b. Figure 2: Comparison of daily alerting thresholds as a function of baseline lengths. The multi-year overview in 2a. Illustrates the comparative stability of 7-day and 28-day baselines, and 2b. shows the short-term effects of 1, 2, 4, and 6-week baselines. Comparative Effect of Baseline Length
35
Summary Points Understand Data Issues –Natural data behavior (trends, patterns) –Artifacts of data processing (temporary & permanent changes in coding, participation,…) Algorithm Selection and Tuning –Do methods control for known, unknown data issues, relative to data source and syndrome filtering? Algorithm Output Interpretation –What can be explained away by knowledge of data; what should be investigated?
36
Assessing a Data Source for Application of Cluster Detection Are there case classification differences among subregions? How stable is the spatial case distribution over time? –What is a good, efficient way to estimate an expected distribution? –Examples: census counts, (recent and/or stratified) baseline data, eligibility/enrollment lists, modeling of individual subregion data How many total subregions are reasonably represented? –Do a few subregions dominate the data? How sparse is the dataset overall? –May affect choice of spatial estimation, analysis methods Is there a reason to expect noncircular disease clusters? Do subregions drop in and out of the distribution over time? –Can you reduce false clustering by ignoring problem subregions? Are there day-of-week and seasonal effects in subregion counts? –Are these effects common across subregions, or is there interaction between these effects and the spatial distribution?
37
Cluster Investigation by Record Inspection Records Corresponding to a Respiratory Cluster
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.