Presentation is loading. Please wait.

Presentation is loading. Please wait.

No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns.

Similar presentations


Presentation on theme: "No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns."— Presentation transcript:

1 No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns Hopkins University Applied Physics Laboratory 7th Annual Conference of the International Society for Disease Surveillance Workshop Session: Public Health Track Analysis of Data Raleigh, North Carolina December 2, 2008

2 Computer Preparation Windows and Microsoft EXCEL Windows screen resolution: at least 1024 x 768 From Windows: Start…Settings…Control Panel…Display Set screen resolution to 1024 x 768 or higher, Then Apply…OK, & close Control panel EXCEL preparation: Open EXCEL, and then: Allow use of macros in EXCEL files: Tools…Options…Security…Macro Security, choose Medium security…OK…OK, and you will choose Enable Macros when you open workshop files

3 Outline Time series that occur in disease surveillance What information do we want from these time series? Statistical properties relevant to situational awareness Sample detection method and properties Examples using modified, aggregated Distribute data Summary and conclusions

4 Rash Syndrome Grouping of Diagnosis Codes www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc www.bt.cdc.gov/surveillance/syndromedef/word/syndromedefinitions.doc

5 Example: Daily Count Data with Expected Values

6 Sample DiSTRIBuTE Data Distributed Surveillance Taskforce for Real-time Influenza Burden Tracking and Evaluation

7 Aggregated Distribute Data Sample

8 State Level Daily Counts

9 State Level Weekly & Total Counts

10 Utility of Surveillance Data What information do we want from these time series? Early indications of potential public health events Infectious disease outbreaks Environmental exposures Bioterrorist attack More recent emphasis: enhancement of situational awareness: What are usual patterns? What is disease burden of natural disaster or accident? How is outbreak progressing, who is at risk?

11 Relevant Statistical Properties of Surveillance Data Important because they affect recognition of anomalies and choice of algorithm: Seasonal, day-of-week effects dependent on data source & filtering of records Variance, autocorrelation Cross-correlation among data sources, among time series Degree of sparseness

12 Seasonal, DOW Effects: Daily Mean Record Counts

13 Elements of an Alerting Algorithm –Values to be tested: raw data, or residuals from a model? –Baseline period Historical data used to determine expected data behavior Fixed or a sliding window? Outlier removal: to avoid training on unrepresentative data What does algorithm do when there is all zero/no baseline data? Is a warmup period of data history required? –Buffer period (or guardband) Separation between the baseline period and interval to be tested –Test period Interval of current data to be tested –Reset criterion to prevent flooding by persistent alerts caused by extreme values –Test statistic: value computed to make alerting decisions –Threshold: alert issued if test statistic exceeds this value

14 Spreadsheet Objectives To present underlying concepts of alerting methods and the adaptations needed for daily health surveillance –Independent of any specific system, software environment, or corporate infrastructure –To understand what complex systems are offering To enable direct visualization of algorithm performance in immediate data context To furnish a spreadsheet toolset for –focused data analysis & experimentation –independent checking of anomalies –sharing without language, database barriers

15 Example with Detection Statistic Plot Statistic Exceeds Threshold Threshold

16 Example: covering 1402 data days

17 Aggregated Distribute Data Sample: Weekly

18 Aggregated Distribute Data Sample: Daily

19 Data issues affecting monitoring –Statistical properties Scale and random dispersion –Periodic effects Day-of-week effects, seasonality –Delayed (often variably) availability in monitoring system –Trends: long/short term: many causes, incl. changes in: Population distribution or demographic composition Data provider participation Consumer health care behavior Coding or billing practices –Prolonged data drop-outs, sometimes with catch-ups –Outliers unrelated to infectious disease levels Often due to problems in data chain Inclement weather Media reports (example: the “Clinton effect”) Most suitable for modeling without data-specific information

20 Adjustment for Known Time Series Behavior Regression modeling –Direct modeling of known effects –Predictors: day of week, day of year, linear trend, holiday, post-holiday, max. daily temperature, others Stratification – Separate monitoring, according to purpose, by Weekday and weekend/holiday, or day of week Region Age group “Differential Detection” –Use related “context” data with similar features –Example: total counts to modulate syndromic counts

21 Sample Algorithm “rate method” Adaptive control chart derived from EARS methods –Implicit adjustment for known and unknown data behavior Sliding daily baseline of both syndromic visit counts, total visit counts Assumption: the ratio of syndromic visits to total visits has not changed

22 CDC EARS Methods C1-C3 Three adaptive methods chosen by National Center for Infectious Diseases after 9/1/2001 as most consistent Look for aberrations representing increases, not decreases Fixed mean, variance replaced by values from sliding baseline (usually 7 days) Baseline for C1-MILD (-1 to -7 day) Baseline C2-MEDIUM (-3 to -9days) Baseline for C3-ULTRA (-3 to -9 days) Current Count Day-9 Day-8 Day-7 Day-6 Day-5 Day-4 Day-3 Day-2 Day-1 Day 0

23 Sample Algorithm “Modified C2” If T n = the current day’s total visit count, X B = the sum of ili visit counts in the baseline, T B = the sum of total visit counts in the baseline, Then the expected current day’s ili visit count E n is: E n = T n ( X B / T B ) If the number of ili visits recorded for the current day is X n, then by analogy to C2, the detection statistic is: Max { 0, ( X n - E n - kS n ) / S n } For k = 1 and a standard error S n If the total visit count is constant, this is exactly the C2 algorithm

24 Technical Points Attempts to control bias: –Oversensitivity (undersensitivity) on high (low) volume days Effect of temporary change in data participation –hospital drop-outs/catch-ups Importance of minimum standard error –management of small counts to avoid excess alerting Effect of baseline length –trade-offs in volatile vs recent representative behavior Steady state vs unstable time intervals –Startup –Recovery after an alert or data problems –Adjustment to permanent change of scale Comparative analysis of daily vs weekly time series –relation to emphasis on general PH vs bioterrorism concerns

25 Control for Day-of-Week Bias, 1

26 Ignoring Day-of-Week Effects

27 Control for Day-of-Week Bias Using Total Visit Counts

28 Control for Day-of-Week Bias Variance Effects

29 Control for Day-of-Week Bias Distribute Data

30 Analysis of Sparse Data Streams

31 Effect of Data Processing Issues

32 Effect of Baseline: Daily Data 28-day baseline 7-day baseline

33 Effect of Baseline: Weekly Data 7-week baseline 4-week baseline

34 7-day baseline: no alert on 8 cases 3 days later, alerts on 5 cases a. b. Figure 2: Comparison of daily alerting thresholds as a function of baseline lengths. The multi-year overview in 2a. Illustrates the comparative stability of 7-day and 28-day baselines, and 2b. shows the short-term effects of 1, 2, 4, and 6-week baselines. Comparative Effect of Baseline Length

35 Summary Points Understand Data Issues –Natural data behavior (trends, patterns) –Artifacts of data processing (temporary & permanent changes in coding, participation,…) Algorithm Selection and Tuning –Do methods control for known, unknown data issues, relative to data source and syndrome filtering? Algorithm Output Interpretation –What can be explained away by knowledge of data; what should be investigated?

36 Assessing a Data Source for Application of Cluster Detection Are there case classification differences among subregions? How stable is the spatial case distribution over time? –What is a good, efficient way to estimate an expected distribution? –Examples: census counts, (recent and/or stratified) baseline data, eligibility/enrollment lists, modeling of individual subregion data How many total subregions are reasonably represented? –Do a few subregions dominate the data? How sparse is the dataset overall? –May affect choice of spatial estimation, analysis methods Is there a reason to expect noncircular disease clusters? Do subregions drop in and out of the distribution over time? –Can you reduce false clustering by ignoring problem subregions? Are there day-of-week and seasonal effects in subregion counts? –Are these effects common across subregions, or is there interaction between these effects and the spatial distribution?

37 Cluster Investigation by Record Inspection Records Corresponding to a Respiratory Cluster


Download ppt "No More Black Box: Methods for visualizing and understanding your data for useful analysis Howard Burkom National Security Technology Department Johns."

Similar presentations


Ads by Google