Using the Repeated Two-Sample Rank Procedure for Detecting Anomalies in Space and Time Ronald D. Fricker, Jr. Interfaces Conference May 31, 2008.

Slides:



Advertisements
Similar presentations
Copyright 2004 Northrop Grumman Corporation 0 HIT Summit Leveraging HIT for Public Health Surveillance HIT Summit Leveraging HIT for Public Health Surveillance.
Advertisements

1 Health Warning! All may not be what it seems! These examples demonstrate both the importance of graphing data before analysing it and the effect of outliers.
Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 16 l Nonparametrics: Testing with Ordinal Data or Nonnormal Distributions.
Sampling Distributions (§ )
Methodological Issues for Biosurveillance Ronald D. Fricker, Jr. 12 th Biennial CDC & ATSDR Symposium on Statistical Methods April 6, 2009.
Inferential Statistics & Hypothesis Testing
CHAPTER 16 MARKOV CHAIN MONTE CARLO
1 Vertically Integrated Seismic Analysis Stuart Russell Computer Science Division, UC Berkeley Nimar Arora, Erik Sudderth, Nick Hay.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Elementary hypothesis testing
Simulation Modeling and Analysis
Statistical Methods Chichang Jou Tamkang University.
Elementary hypothesis testing
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
1 Graduate Statistics Student, 2 Undergraduate Computer Science Student, 3 Professor and Director of Statistical Consulting Collaboratory 4 Chief Technology.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Conclusions On our large scale anthrax attack simulations, being able to infer the work zip appears to improve detection time over just using the home.
Spatiotemporal Cluster Detection in ESSENCE Biosurveillance Systems Panelist: Howard Burkom National Security Technology Department, John Hopkins University.
Statistical Methods in Computer Science Hypothesis Testing I: Treatment experiment designs Ido Dagan.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Sample Size Determination Ziad Taib March 7, 2014.
SIMULATION MODELING AND ANALYSIS WITH ARENA
Hypothesis testing. Classical hypothesis testing is a statistical method that appeared in the first third of the 20 th Century, alongside the “modern”
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
SPONSOR JAMES C. BENNEYAN DEVELOPMENT OF A PRESCRIPTION DRUG SURVEILLANCE SYSTEM TEAM MEMBERS Jeffrey Mason Dan Mitus Jenna Eickhoff Benjamin Harris.
Boating Beyond Simple Shewhart Model 11. Destinations Purpose—To provide a quick long distance view. Content –Time or observations between events (g,
HSRP 734: Advanced Statistical Methods July 10, 2008.
On Model Validation Techniques Alex Karagrigoriou University of Cyprus "Quality - Theory and Practice”, ORT Braude College of Engineering, Karmiel, May.
CI - 1 Cure Rate Models and Adjuvant Trial Design for ECOG Melanoma Studies in the Past, Present, and Future Joseph Ibrahim, PhD Harvard School of Public.
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Introduction to Statistical Quality Control, 4th Edition
What are Nonparametric Statistics? In all of the preceding chapters we have focused on testing and estimating parameters associated with distributions.
Using the Repeated Two-sample Rank Procedure for Detecting Anomalies in Space and Time Ronald D. Fricker, Jr. University of California, Riverside November.
The Common Shock Model for Correlations Between Lines of Insurance
Random Numbers and Simulation  Generating truly random numbers is not possible Programs have been developed to generate pseudo-random numbers Programs.
Hypothesis Testing Hypothesis Testing Topic 11. Hypothesis Testing Another way of looking at statistical inference in which we want to ask a question.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 23/10/2015 9:22 PM 1 Two-sample comparisons Underlying principles.
Fall 2002Biostat Nonparametric Tests Nonparametric tests are useful when normality or the CLT can not be used. Nonparametric tests base inference.
Comp. Genomics Recitation 3 The statistics of database searching.
Relationship between two variables Two quantitative variables: correlation and regression methods Two qualitative variables: contingency table methods.
Chapter 4 Control Charts for Measurements with Subgrouping (for One Variable)
Estimating  0 Estimating the proportion of true null hypotheses with the method of moments By Jose M Muino.
Practical Aspects of Alerting Algorithms in Biosurveillance Howard S. Burkom The Johns Hopkins University Applied Physics Laboratory National Security.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Ch11: Comparing 2 Samples 11.1: INTRO: This chapter deals with analyzing continuous measurements. Later, some experimental design ideas will be introduced.
Fall 2002Biostat Statistical Inference - Proportions One sample Confidence intervals Hypothesis tests Two Sample Confidence intervals Hypothesis.
"Classical" Inference. Two simple inference scenarios Question 1: Are we in world A or world B?
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Monitoring High-yield processes MONITORING HIGH-YIELD PROCESSES Cesar Acosta-Mejia June 2011.
Detecting Anomalies in Space and Time with Application to Biosurveillance Ronald D. Fricker, Jr. August 15, 2008.
Point Pattern Analysis
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
1/20/2016ENGM 720: Statistical Process Control1 ENGM Lecture 05 Variation Comparisons, Process Capability.
1 SMU EMIS 7364 NTU TO-570-N Control Charts Basic Concepts and Mathematical Basis Updated: 3/2/04 Statistical Quality Control Dr. Jerrell T. Stracener,
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Tests. An Hypothesis is a guess about a situation that can be tested, and the test outcome can be either true or false. –The Null Hypothesis.
Review Design of experiments, histograms, average and standard deviation, normal approximation, measurement error, and probability.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Infectious Disease Surveillance & National/Health Security Michael A. Stoto CNSTAT Workshop on Vital Data for National Needs April 30, 2008, Washington.
Advanced Data Analytics
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Bayesian Biosurveillance of Disease Outbreaks
Epidemic Alerts EECS E6898: TOPICS – INFORMATION PROCESSING: From Data to Solutions Alexander Loh May 5, 2016.
Public Health Surveillance
Sampling Distributions (§ )
Presumptions Subgroups (samples) of data are formed.
Presentation transcript:

Using the Repeated Two-Sample Rank Procedure for Detecting Anomalies in Space and Time Ronald D. Fricker, Jr. Interfaces Conference May 31, 2008

“…surveillance using health-related data that precede diagnosis and signal a sufficient probability of a case or an outbreak to warrant further public health response.” [1] On-going discussion in public health community about use of biosurveillance for “early event detection” vs. “situational awareness” Motivating Problem: Biosurveillance 2 [1] CDC ( accessed 5/29/07)

Definitions Early event detection: gathering and analyzing data in advance of diagnostic case confirmation to give early warning of a possible outbreak Situational awareness: the real-time analysis and display of health data to monitor the location, magnitude, and spread of an outbreak 3

Illustrative Example ER patients come from surrounding area –On average, 30 per day More likely from closer distances –Outbreak occurs at (20,20) Number of patients increase linearly by day after outbreak 4 (Unobservable) distribution of ER patients’ home addresses Observed distribution of ER patients’ home addresses

A Couple of Major Assumptions Can geographically locate individuals in a medically meaningful way –Non-trivial problem –Data not currently available Data is reported in a consistent and timely way –Public health community working this problem, but not solved yet Assuming the above problems away… 5

Idea: Look at Differences in Kernel Density Estimates Construct kernel density estimate (KDE) of “normal” disease incidence using N historical observations Compare to KDE of most recent w observations 6 But how to know when to signal?

Solution: Repeated Two-Sample Rank (RTR) Procedure Sequential hypothesis test of estimated density heights Compare estimated density heights of recent data against heights of set of historical data –Single density estimated via KDE on combined data If no change, heights uniformly distributed –Use nonparametric test to assess 7

Data & Notation (1) Let be a sequence of bivariate observations –E.g., latitude and longitude of a case Assume ~ iid according to f 0 –I.e., natural state of disease incidence At time , ~ iid according to f 1 –Corresponds to an increase in disease incidence Densities f 0 and f 1 unknown 8

Data & Notation (2) Assume a historical sequence is available –Distributed iid according to f 0 Followed by which may change from f 0 to f 1 at any time For notational convenience, define for 9

Estimating the Density Consider the w+1 most recent data points At each time period estimate the density where k is a kernel function on R 2 with bandwidth set to 10

Calculating Density Heights The density estimate is evaluated at each historical and new point –For n < w+1 –For n > w+1 11

Under the Null, Estimated Density Heights are Exchangeable Theorem: The RTR procedure is asymptotically distribution free –I.e., the estimated density heights are exchangeable, so all rankings are equally likely –Proof: See Fricker and Chang (2008) Means can do a hypothesis test on the ranks each time an observation arrives –Signal change in distribution first time test rejects 12

Comparing Distributions of Heights Compute empirical distributions of the two sets of estimated heights: Use Kolmogorov-Smirnov test to assess: –Signal at time 13

Comparison Metrics How to find c ? –Use ARL approximation based on Poisson clumping heuristic: Example: c= with N=1,350 and w+1=250 gives A=900 –If 30 observations per day, gives average time between (false) signals of 30 days 14

Performance Comparison #1 F 0 ~ N(0,1) F 1 ~ N( ,1) 15 ARL     RTR EWMA CUSUM Shewhart

Performance Comparison #2 F 0 ~ N(0,1) F 1 ~ N(0,  2 ) 16 ARL      RTR EWMA CUSUM Shewhart

Performance Comparison #3 F 0 ~ N(0,1) F 1 ~ 17 ARL RTR EWMA CUSUM Shewhart

Performance Comparison #4 F 0 ~ N 2 ((0,0) T, I ) F 1 mean shift in F 0 of distance  18 ARL RTR MEWMA MCUSUM Hotelling’s T 2     

Performance Comparison #5 F 0 ~ N 2 ((0,0) T, I ) F 1 ~ N 2 ((0,0) T,  2 I ) 19 ARL       RTR MEWMA MCUSUM Hotelling’s T 2

Plotting the Outbreak At signal, calculate optimal kernel density estimates and plot pointwise differences where and or 20

Example Results Assess performance by simulating outbreak multiple times, record when RTR signals –Signaled middle of day 5 on average –By end of 5 th day, 15 outbreak and 150 non-outbreak observations –From previous example: 21 Distribution of Signal Day Estimate of Outbreak Location on Day 5 Underlying Surface of Density Height Differences

Normal disease incidence ~ N({0,0} t,  2 I ) with  =15 –Expected count of 30 per day Outbreak incidence ~ N({20,20} t, d 2 I ), where d is the day of outbreak –Expected count is 30+ d per day 22

Normal disease incidence ~ N({0,0} t,  2 I ) with  =15 –Expected count of 30 per day Outbreak incidence ~ N({20,20} t,2.2 d 2 I ), where d is the day of outbreak –Expected count is 30+ d 2 per day

Normal disease incidence ~ N({0,0} t,  2 I ) with  =15 –Expected count of 30 per day Outbreak incidence sweeps across region from left to right –Expected count is per day

Advantages and Disadvantages Advantages –Methodology supports both biosurveillance goals: early event detection and situational awareness –Incorporates observations sequentially (singly) Most other methods use aggregated data –Can be used for more than two dimensions Disadvantage? –Can’t distinguish increase distributed according to f 0 Unlikely for bioterrorism attack? Won’t detect an general increase in background disease incidence rate –E.g., Perhaps caused by an increase in population –In this case, advantage not to detect 25

Selected References Selected Research: Fricker, R.D., Jr., and J.T. Chang, The Repeated Two-Sample Rank Procedure: A Multivariate Nonparametric Individuals Control Chart (in draft). Fricker, R.D., Jr., and J.T. Chang, A Spatio-temporal Method for Real-time Biosurveillance, Quality Engineering (to appear). Fricker, R.D., Jr., and D. Banschbach, Optimizing a System of Threshold Detection Sensors, in submission to Operations Research. Fricker, R.D., Jr., Knitt, M.C., and C.X. Hu, Comparing Directionally Sensitive MCUSUM and MEWMA Procedures with Application to Biosurveillance, Quality Engineering (to appear). Joner, M.D., Jr., Woodall, W.H., Reynolds, M.R., Jr., and R.D. Fricker, Jr., A One-Sided MEWMA Chart for Health Surveillance, Quality and Reliability Engineering International (to appear). Fricker, R.D., Jr., Hegler, B.L., and D.A Dunfee, Assessing the Performance of the Early Aberration Reporting System (EARS) Syndromic Surveillance Algorithms, Statistics in Medicine, Fricker, R.D., Jr., Directionally Sensitive Multivariate Statistical Process Control Methods with Application to Syndromic Surveillance, Advances in Disease Surveillance, 3:1, Background Information: Fricker, R.D., Jr., and H. Rolka, Protecting Against Biological Terrorism: Statistical Issues in Electronic Biosurveillance, Chance, 91, pp. 4-13, 2006 Fricker, R.D., Jr., Syndromic Surveillance, Encyclopedia of Quantitative Risk Assessment (to appear). 26