Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.

Slides:



Advertisements
Similar presentations
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Advertisements

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006.
Hotspot/cluster detection methods(1) Spatial Scan Statistics: Hypothesis testing – Input: data – Using continuous Poisson model Null hypothesis H0: points.
Introduction to Regression with Measurement Error STA431: Spring 2015.
Christopher Dougherty EC220 - Introduction to econometrics (chapter 2) Slideshow: a Monte Carlo experiment Original citation: Dougherty, C. (2012) EC220.
Inference Sampling distributions Hypothesis testing.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Statistical approaches for detecting clusters of disease. Feb. 26, 2013 Thomas Talbot New York State Department of Health Bureau of Environmental and Occupational.
Early Detection of Disease Outbreaks Prospective Surveillance.
McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.
A Spatial Scan Statistic for Survival Data Lan Huang, Dep Statistics, Univ Connecticut Martin Kulldorff, Harvard Medical School David Gregorio, Dep Community.
Model Estimation and Comparison Gamma and Lognormal Distributions 2015 Washington, D.C. Rock ‘n’ Roll Marathon Velocities.
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
 Statistical approaches for detecting unexplained clusters of disease.  Spatial Aggregation Thomas Talbot New York State Department of Health Environmental.
A Tree-Based Scan Statistic for Database Disease Surveillance Martin Kulldorff University of Connecticut Joint work with: Zixing Fang, Stephen Walsh.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Section 7.3 Table summarizes hypothesis tests to compare two means. These tests assume that each sample is taken from a normal distribution, but.
The Space-Time Scan Statistic for Multiple Data Streams
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Chapter Sampling Distributions and Hypothesis Testing.
Chapter 14 Simulation. Monte Carlo Process Statistical Analysis of Simulation Results Verification of the Simulation Model Computer Simulation with Excel.
Scan Statistics via Permutation Tests David Madigan.
Chapter 7 Probability and Samples: The Distribution of Sample Means
1 A MONTE CARLO EXPERIMENT In the previous slideshow, we saw that the error term is responsible for the variations of b 2 around its fixed component 
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Spatial Statistics for Cancer Surveillance Martin Kulldorff Harvard Medical School and Harvard Pilgrim Health Care.
Geographic Information Science
Introduction to Hypothesis Testing
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
Copyright © 2010 Pearson Education, Inc. Chapter 22 Comparing Two Proportions.
Lesson Carrying Out Significance Tests. Vocabulary Hypothesis – a statement or claim regarding a characteristic of one or more populations Hypothesis.
Economic evaluation of health programmes Department of Epidemiology, Biostatistics and Occupational Health Class no. 17: Economic Evaluation using Decision.
Breaking Statistical Rules: How bad is it really? Presented by Sio F. Kong Joint work with: Janet Locke, Samson Amede Advisor: Dr. C. K. Chauhan.
Comparing Two Proportions
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Copyright © 2010 Lumina Decision Systems, Inc. Monte Carlo Simulation Analytica User Group Modeling Uncertainty Series #3 13 May 2010 Lonnie Chrisman,
Montecarlo Simulation LAB NOV ECON Montecarlo Simulations Monte Carlo simulation is a method of analysis based on artificially recreating.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
Enhancing Disease Surveillance with Spatial-temporal Results Patricia Araki, MPH County of Los Angeles – Department of Public Health Acute Communicable.
Two-Sided Hypothesis Tests. p ,29 p ,32,33 p
Cluster Detection Comparison in Syndromic Surveillance MGIS Capstone Project Proposal Tuesday, July 8 th, 2008.
AP Statistics Section 13.1 A. Which of two popular drugs, Lipitor or Pravachol, helps lower bad cholesterol more? 4000 people with heart disease were.
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Chapter 7 Sampling Distributions.
7.1: What is a Sampling Distribution?!?!. Section 7.1 What Is a Sampling Distribution? After this section, you should be able to… DISTINGUISH between.
Today - Messages Additional shared lab hours in A-269 –M, W, F 2:30-4:25 –T, Th 4:00-5:15 First priority is for PH5452. No TA or instructor Handouts –
1 Psych 5500/6500 Introduction to the F Statistic (Segue to ANOVA) Fall, 2008.
Fitting probability models to frequency data. Review - proportions Data: discrete nominal variable with two states (“success” and “failure”) You can do.
Limits to Statistical Theory Bootstrap analysis ESM April 2006.
Materials and Methods GIS Development A GIS was constructed from historical records of known villages reporting human anthrax between the years 1937 and.
Hypothesis Testing An understanding of the method of hypothesis testing is essential for understanding how both the natural and social sciences advance.
Introduction to Statistical Inference Jianan Hui 10/22/2014.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 10 Comparing Two Groups Section 10.1 Categorical Response: Comparing Two Proportions.
Chapter 5 Sampling Distributions. The Concept of Sampling Distributions Parameter – numerical descriptive measure of a population. It is usually unknown.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ES 07 These slides can be found at optimized for Windows)
1 Impact of Sample Estimate Rounding on Accuracy ERCOT Load Profiling Department May 22, 2007.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Chapter 11: The t Test for Two Related Samples. Repeated-Measures Designs The related-samples hypothesis test allows researchers to evaluate the mean.
Statistics 22 Comparing Two Proportions. Comparisons between two percentages are much more common than questions about isolated percentages. And they.
Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.
GEOGRAPHIC CLUSTERS OF HEAD & NECK CANCER IN FLORIDA Recinda Sherman, MPH, CTR Florida Cancer Data Systems NAACCR Detroit, June 7, 2007.
Early Detection of Disease Outbreaks with Applications in New York City Martin Kulldorff University of Connecticut Farzad Mostashari and James Miller.
PCB 3043L - General Ecology Data Analysis Organizing an ecological study What is the aim of the study? What is the main question being asked? What are.
A genetic algorithm for irregularly shaped spatial clusters Luiz Duczmal André L. F. Cançado Lupércio F. Bessegato 2005 Syndromic Surveillance Conference.
Dept of Biostatistics, Emory University
Scenario-Based Evaluation of Cluster Detection and Tracking Capability
Presentation transcript:

Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin Kulldorff, Ken Kleinman Department of Ambulatory Care and Prevention, Harvard Medical School and Harvard Pilgrim Health Care Presented at EVA, August 15, 2005 This work was funded by the United States National Cancer Institute, grant number RO1-CA95979.

Background: Scan Statistics Spatial scan statistic – used to identify geographic clusters Use moving circular window on map –Any point on map can be the center of a cluster –Each circle includes a different set of points –If the centroid of a region is included in the circle, the whole region is included

Background: Scan Statistics For each distinct window, calculate the likelihood, proportional to: n = number of cases inside circle N = total number of cases  = expected number of cases inside circle

Background: Scan Statistics The scan statistic is the maximum likelihood over all possible circles –Identifies the most unusual cluster To find p-value, use Monte Carlo hypothesis testing –Redistribute cases randomly and recalculate the scan statistic many times –Proportion of scan statistics from the Monte Carlo replicates which are greater than or equal to the scan statistic for the true cluster is the p-value

Background: Scan Statistics

That discussion only considered spatial clustering To extend to clustering in space and time, use cylinders instead of circles –The height of the cylinder represents time The rest of the process is unchanged SaTScan is a freely available software that uses the scan statistic to detect clusters in space, time, or space-time (

Background: SaTScan Main drawback to Monte Carlo hypothesis testing: increased precision for p-values can only be obtained through greatly increasing the number of Monte Carlo replicates –A big problem for small p-values SaTScan can take anywhere from seconds to hours to run, depending on the data, the type of analysis, and the number of Monte Carlo replicates

Background We use SaTScan for 2 main reasons 1.Daily surveillance for disease outbreaks 2.Evaluating systems that use SaTScan for surveillance In both cases, we need to limit the amount of time it takes to generate each p-value while still retaining enough precision in the p-value to determine how unusual a cluster is

Goal Estimate distribution of the scan statistic using fewer Monte Carlo replicates –See how the p-values obtained from the distributional parameters compares with the true p-value

Methods Sample map – 245 counties in the northeast United States with 600 cases Ran SaTScan on the sample map using 100,000,000 Monte Carlo replicates to find the 'true' log-likelihood needed to obtain p- values of 0.01, 0.001, , –Corresponds to the following order statistics from the 100,000,000 Monte Carlo replicates: 1,000,000; 100,000; 10,000; 1,000

Methods Ran SaTScan 1000 times on the same map, each time generating 999 Monte Carlo replicates For each of the 1000 SaTScan runs: –Found maximum likelihood estimates of the parameters for each distribution based on the 999 Monte Carlo replicates Distributions used: Normal, Lognormal, Gamma, Gumbel

Methods The empirical/asymptotic p-value for each distribution is the area to the right of the observed log-likelihood for a given distribution For each distribution, we generated: 1.empirical/asymptotic p-values based on the 'true' log-likelihood value 2.the log-likelihoods that would have been required to generate p-values of 0.01, 0.001, , The usual Monte Carlo-based p-value reported in SaTScan

Methods Repeated the entire process using 60 and 6000 cases –Results were almost identical Using 600 cases, repeated entire process with 99 and 9999 Monte Carlo replicates in each of the 1000 simulations –Again, very similar results

Results True p-value = 0.01

Results True p-value = 0.001

Results True p-value =

Results True p-value =

Results True p-value = 0.01

Results True p-value = 0.001

Results True p-value =

Results True p-value =

Results The empirical/asymptotic p-values from the Gumbel distribution appear only slightly conservatively biased Other tested distributions all resulted in anti-conservatively biased p-values The ordinary Monte Carlo p-values reported from SaTScan had greater variance than the Gumbel-based p-values

Conclusions Empirical/asymptotic p-values based on the Gumbel distribution can be preferable to true Monte Carlo p-values Empirical/asymptotic p-values can accurately generate p-values smaller than is possible with Monte Carlo p-values with a given number of replicates We suggest empirical/asymptotic p-values as a hybrid method to accurately obtain small p-values with a relatively small number of Monte Carlo replicates

Future work Results shown today are based on purely spatial analyses – we will also look at space-time analyses An option will be added in SaTScan to allow the user to request the Gumbel-based p-value