Early Statistical Detection of Bio-Terrorism Attacks by Tracking OTC Medication Sales Galit Shmueli Dept. of Statistics and CALD Carnegie Mellon University With Stephen Fienberg (Statistics) Anna Goldenberg & Rich Caruana (CS)
Overview Current bio-surveillance systems – Monitoring traditional data – Using simple SPC methods Early detection – Use of non-traditional data – Building a flexible, automated detection system – Evaluating the system Results and enhancements
Traditional Data Sources Public health sources – School absence records – Sentinel practices – Laboratory data Medical sources – Patient visits at urgent care, outpatient clinics, emergency rooms Speed of detection: weeks after the actual occurrence – Rate of data arrival
Why is detection slow? Data arrives late – Projects using electronic reporting systems: Influenza surveillance system (U of Utah) Tracking ICD9 codes (U of Pittsburgh) Future: increasing availability of electronic means for gathering surveillance data Data available on weekly or monthly scale Data are nation-wide Signature of outbreak in data is late!
Non-Traditional Data Data that indirectly measure symptoms – Over-the-counter medication and grocery sales – Web browsing at medical websites – Automatic body tracking devices Different levels of availability Regional, localized data Confidentiality issues
Manifestation of Flu in Traditional and Non-Traditional Data Lab Flu WebMD School Cough& Cold Throat Resp Viral Death weeks
OTC Medication and Grocery Sales Benefits – Manifestation of outbreak is very early – Timeliness in collection and reporting (daily) – Extremely detailed (basket-level) Drawbacks – No info about epidemic manifestation in sales data – Requires knowledge about marketing efforts (sales, discounts) – If outbreak replicates sales patterns – hard to detect (Holidays are a big challenge) – Hard to model!
Prior Uses of Non-Traditional Data Diarrheal Disease Surveillance: data from 38 drug stores in NY (Mikol et al., 2000) Monitoring near-real-time satellite vegetation and climate data for predicting emerging Rift Valley Fever epidemics in East Africa (DoD and NASA, 2001)
Description of Our Data Daily sales of several OTC medication groups for 541 days between Aug 8, ’ 99 to Jan 31, ‘ 01 Concentrated on cough&cold medication (inhalational symptoms): – Cough medication – Tabs & Caps – Nasal medication
Hypothetical Scenario of an Inhalational Anthrax Attack Symptoms: almost all typical to flu! – fever – fatigue – cough – mild chest discomfort – but no runny nose (!) Death may occur within hours
Sales of Four Sub-Categories
Overview Current bio-surveillance systems Non-traditional data The detection system An evaluation method Results and Conclusions Future work
The Detection System Take into account special features of OTC and grocery sales data – Time series – Seasonality – Weekday/Weekend effect – Stores closed on certain days – Influence of total sales patterns – Very noisy, non-stationary Create automated system
Layers of the Detection System WARNING! – POSSIBLE BEGINNING OF AN EPIDEMIC/ATTACK YES Real-time sales > threshold Preprocessing Forecasting next day sales Creating a threshold New day sales NO De-noising
Pre-Processing
De-Noising Target: obtain main features of data, reduce noise to improve predictability Selected method: Discrete Cosine Transform with horizontal filtering How much to de-noise? – Retain minimal coefficient set that Maximizes accuracy Optimizes predictability – Use cross-validation and MSE-based criteria
De-Noising: DCT with Horizontal Filtering de-noised set 2 de-noised set 1
Forecasting Target: Predict next day sales Use pre-processed, de-noised data Problem: non-stationary (ARIMA doesn ’ t work) Method: 1) decompose with wavelets 2) predict each wavelet resolution 3) sum to obtain overall prediction
Prediction Using Wavelets
Threshold Selection: SPC Based on empirical distribution of residuals (real values – predictions), we fit a “ 3σ ” limit
Comparing Next-Day Sales to the Threshold
Overview Current bio-surveillance systems Non-traditional data The detection system An evaluation method Results and Conclusions Ongoing work (basket-level data) Future work
Evaluating the System How fast does it detect an anthrax footprint? Problems: – data does not include outbreak signature – We don ’ t know what signature looks like in such data Solution: simulated signature day spike base Inhalational anthrax signature
Constructing the Signature Sverdlovsk outbreak, 1979 Based on data from Meselson et al., Science (1994)
Anthrax Signature in OTC Sales Add signature at each data point sequentially, and look at rate of detection Try different slopes, heights Compare different configurations of system for different signatures slope = 1/3 Detects 100% of spikes within 3 days for height = 1.3(data range)
Results and Conclusions The detection system – works with grocery data – detects simulated footprint quickly – has low false alarm rate The system is flexible (tools are interchangeable) Almost fully automated, efficient computation “ Perfect bio-attack ” is on holiday
Future Work Combine with traditional medical and public health data sources Aggregated data: Track several series simultaneously Basket data: Utilize other features of grocery data such as spatial factor, customer information