No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns.

No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns Hopkins University Applied Physics Laboratory 7th Annual Conference of the International Society for Disease Surveillance Workshop Session: Public Health Track Analysis of Data Raleigh, North Carolina December 2, 2008

Computer Preparation Windows and Microsoft EXCEL Windows screen resolution: at least 1024 x 768 From Windows: Start…Settings…Control Panel…Display Set screen resolution to 1024 x 768 or higher, Then Apply…OK, & close Control panel EXCEL preparation: Open EXCEL, and then: Allow use of macros in EXCEL files: Tools…Options…Security…Macro Security, choose Medium security…OK…OK, and you will choose Enable Macros when you open workshop files

Outline Time series that occur in disease surveillance What information do we want from these time series? Statistical properties relevant to situational awareness Sample detection method and properties Examples using modified, aggregated Distribute data Summary and conclusions

Rash Syndrome Grouping of Diagnosis Codes www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc

Example: Daily Count Data with Expected Values

Sample DiSTRIBuTE Data Distributed Surveillance Taskforce for Real-time Influenza Burden Tracking and Evaluation

Aggregated Distribute Data Sample

State Level Daily Counts

State Level Weekly & Total Counts

Utility of Surveillance Data What information do we want from these time series? Early indications of potential public health events Infectious disease outbreaks Environmental exposures Bioterrorist attack More recent emphasis: enhancement of situational awareness: What are usual patterns? What is disease burden of natural disaster or accident? How is outbreak progressing, who is at risk?

Relevant Statistical Properties of Surveillance Data Important because they affect recognition of anomalies and choice of algorithm: Seasonal, day-of-week effects dependent on data source & filtering of records Variance, autocorrelation Cross-correlation among data sources, among time series Degree of sparseness

Seasonal, DOW Effects: Daily Mean Record Counts

Elements of an Alerting Algorithm –Values to be tested: raw data, or residuals from a model? –Baseline period Historical data used to determine expected data behavior Fixed or a sliding window? Outlier removal: to avoid training on unrepresentative data What does algorithm do when there is all zero/no baseline data? Is a warmup period of data history required? –Buffer period (or guardband) Separation between the baseline period and interval to be tested –Test period Interval of current data to be tested –Reset criterion to prevent flooding by persistent alerts caused by extreme values –Test statistic: value computed to make alerting decisions –Threshold: alert issued if test statistic exceeds this value

Spreadsheet Objectives To present underlying concepts of alerting methods and the adaptations needed for daily health surveillance –Independent of any specific system, software environment, or corporate infrastructure –To understand what complex systems are offering To enable direct visualization of algorithm performance in immediate data context To furnish a spreadsheet toolset for –focused data analysis & experimentation –independent checking of anomalies –sharing without language, database barriers

Example with Detection Statistic Plot Statistic Exceeds Threshold Threshold

Example: covering 1402 data days

Aggregated Distribute Data Sample: Weekly

Aggregated Distribute Data Sample: Daily

Data issues affecting monitoring –Statistical properties Scale and random dispersion –Periodic effects Day-of-week effects, seasonality –Delayed (often variably) availability in monitoring system –Trends: long/short term: many causes, incl. changes in: Population distribution or demographic composition Data provider participation Consumer health care behavior Coding or billing practices –Prolonged data drop-outs, sometimes with catch-ups –Outliers unrelated to infectious disease levels Often due to problems in data chain Inclement weather Media reports (example: the “Clinton effect”) Most suitable for modeling without data-specific information

Adjustment for Known Time Series Behavior Regression modeling –Direct modeling of known effects –Predictors: day of week, day of year, linear trend, holiday, post-holiday, max. daily temperature, others Stratification – Separate monitoring, according to purpose, by Weekday and weekend/holiday, or day of week Region Age group “Differential Detection” –Use related “context” data with similar features –Example: total counts to modulate syndromic counts

Sample Algorithm “rate method” Adaptive control chart derived from EARS methods –Implicit adjustment for known and unknown data behavior Sliding daily baseline of both syndromic visit counts, total visit counts Assumption: the ratio of syndromic visits to total visits has not changed

CDC EARS Methods C1-C3 Three adaptive methods chosen by National Center for Infectious Diseases after 9/1/2001 as most consistent Look for aberrations representing increases, not decreases Fixed mean, variance replaced by values from sliding baseline (usually 7 days) Baseline for C1-MILD (-1 to -7 day) Baseline C2-MEDIUM (-3 to -9days) Baseline for C3-ULTRA (-3 to -9 days) Current Count Day-9 Day-8 Day-7 Day-6 Day-5 Day-4 Day-3 Day-2 Day-1 Day 0

Sample Algorithm “Modified C2” If T n = the current day’s total visit count, X B = the sum of ili visit counts in the baseline, T B = the sum of total visit counts in the baseline, Then the expected current day’s ili visit count E n is: E n = T n ( X B / T B ) If the number of ili visits recorded for the current day is X n, then by analogy to C2, the detection statistic is: Max { 0, ( X n - E n - kS n ) / S n } For k = 1 and a standard error S n If the total visit count is constant, this is exactly the C2 algorithm

Technical Points Attempts to control bias: –Oversensitivity (undersensitivity) on high (low) volume days Effect of temporary change in data participation –hospital drop-outs/catch-ups Importance of minimum standard error –management of small counts to avoid excess alerting Effect of baseline length –trade-offs in volatile vs recent representative behavior Steady state vs unstable time intervals –Startup –Recovery after an alert or data problems –Adjustment to permanent change of scale Comparative analysis of daily vs weekly time series –relation to emphasis on general PH vs bioterrorism concerns

Control for Day-of-Week Bias, 1

Ignoring Day-of-Week Effects

Control for Day-of-Week Bias Using Total Visit Counts

Control for Day-of-Week Bias Variance Effects

Control for Day-of-Week Bias Distribute Data

Analysis of Sparse Data Streams

Effect of Data Processing Issues

Effect of Baseline: Daily Data 28-day baseline 7-day baseline

Effect of Baseline: Weekly Data 7-week baseline 4-week baseline

7-day baseline: no alert on 8 cases 3 days later, alerts on 5 cases a. b. Figure 2: Comparison of daily alerting thresholds as a function of baseline lengths. The multi-year overview in 2a. Illustrates the comparative stability of 7-day and 28-day baselines, and 2b. shows the short-term effects of 1, 2, 4, and 6-week baselines. Comparative Effect of Baseline Length

Summary Points Understand Data Issues –Natural data behavior (trends, patterns) –Artifacts of data processing (temporary & permanent changes in coding, participation,…) Algorithm Selection and Tuning –Do methods control for known, unknown data issues, relative to data source and syndrome filtering? Algorithm Output Interpretation –What can be explained away by knowledge of data; what should be investigated?

Assessing a Data Source for Application of Cluster Detection Are there case classification differences among subregions? How stable is the spatial case distribution over time? –What is a good, efficient way to estimate an expected distribution? –Examples: census counts, (recent and/or stratified) baseline data, eligibility/enrollment lists, modeling of individual subregion data How many total subregions are reasonably represented? –Do a few subregions dominate the data? How sparse is the dataset overall? –May affect choice of spatial estimation, analysis methods Is there a reason to expect noncircular disease clusters? Do subregions drop in and out of the distribution over time? –Can you reduce false clustering by ignoring problem subregions? Are there day-of-week and seasonal effects in subregion counts? –Are these effects common across subregions, or is there interaction between these effects and the spatial distribution?

Cluster Investigation by Record Inspection Records Corresponding to a Respiratory Cluster

No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns.

Similar presentations

Presentation on theme: "No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns.

Similar presentations

Presentation on theme: "No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns."— Presentation transcript:

Similar presentations

About project

Feedback