Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006.

Slides:



Advertisements
Similar presentations
Dept of Bioenvironmental Systems Engineering National Taiwan University Lab for Remote Sensing Hydrology and Spatial Modeling STATISTICS Hypotheses Test.
Advertisements

STATISTICS Univariate Distributions
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Chapter 7 Sampling and Sampling Distributions
Chapter 3 Some Special Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Exponential Distribution. = mean interval between consequent events = rate = mean number of counts in the unit interval > 0 X = distance between events.
Hotspot/cluster detection methods(1) Spatial Scan Statistics: Hypothesis testing – Input: data – Using continuous Poisson model Null hypothesis H0: points.
Chapter 6 Sampling and Sampling Distributions
Sampling: Final and Initial Sample Size Determination
Statistics review of basic probability and statistics.
 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
A Spatial Scan Statistic for Survival Data Lan Huang, Dep Statistics, Univ Connecticut Martin Kulldorff, Harvard Medical School David Gregorio, Dep Community.
Chapter 9: Inferences for Two –Samples
8. Statistical tests 8.1 Hypotheses K. Desch – Statistical methods of data analysis SS10 Frequent problem: Decision making based on statistical information.
Chapter 7 Sampling and Sampling Distributions
Statistics Lecture 20. Last Day…completed 5.1 Today Parts of Section 5.3 and 5.4.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.
A random variable that has the following pmf is said to be a binomial random variable with parameters n, p The Binomial random variable.
The moment generating function of random variable X is given by Moment generating function.
Part III: Inference Topic 6 Sampling and Sampling Distributions
Inferences About Process Quality
G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 7 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.
Sample Size Determination
Discrete and Continuous Distributions G. V. Narayanan.
Spatial Statistics for Cancer Surveillance Martin Kulldorff Harvard Medical School and Harvard Pilgrim Health Care.
INFERENTIAL STATISTICS – Samples are only estimates of the population – Sample statistics will be slightly off from the true values of its population’s.
One Sample  M ean μ, Variance σ 2, Proportion π Two Samples  M eans, Variances, Proportions μ1 vs. μ2 σ12 vs. σ22 π1 vs. π Multiple.
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
Spatial Statistics Applied to point data.
Modeling and Simulation CS 313
Estimation in Sampling!? Chapter 7 – Statistical Problem Solving in Geography.
Mid-Term Review Final Review Statistical for Business (1)(2)
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
Modular 11 Ch 7.1 to 7.2 Part I. Ch 7.1 Uniform and Normal Distribution Recall: Discrete random variable probability distribution For a continued random.
Discrete Probability Distributions. Random Variable Random variable is a variable whose value is subject to variations due to chance. A random variable.
Math 4030 – 4a Discrete Distributions
Sample Variability Consider the small population of integers {0, 2, 4, 6, 8} It is clear that the mean, μ = 4. Suppose we did not know the population mean.
Exam 2: Rules Section 2.1 Bring a cheat sheet. One page 2 sides. Bring a calculator. Bring your book to use the tables in the back.
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
IE 300, Fall 2012 Richard Sowers IESE. 8/30/2012 Goals: Rules of Probability Counting Equally likely Some examples.
Point Pattern Analysis
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
G. Cowan Lectures on Statistical Data Analysis Lecture 8 page 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem 2Random variables and.
Engineering Probability and Statistics - SE-205 -Chap 3 By S. O. Duffuaa.
Spatially Explicit Capture-recapture Models for Density Estimation 5.11 UF-2015.
Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.
Statistics -Continuous probability distribution 2013/11/18.
Unit 3: Probability.  You will need to be able to describe how you will perform a simulation  Create a correspondence between random numbers and outcomes.
A genetic algorithm for irregularly shaped spatial clusters Luiz Duczmal André L. F. Cançado Lupércio F. Bessegato 2005 Syndromic Surveillance Conference.
The normal distribution
Statistical Inferences for Population Variances
Hypothesis Testing: One Sample Cases
IEE 380 Review.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Engineering Probability and Statistics - SE-205 -Chap 3
Chapter 4. Inference about Process Quality
Sampling distribution
STAT 312 Chapter 7 - Statistical Intervals Based on a Single Sample
Chapter 7 Sampling Distributions.
Probability & Statistics Probability Theory Mathematical Probability Models Event Relationships Distributions of Random Variables Continuous Random.
Distributions and expected value
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 6 Statistical Inference & Hypothesis Testing
CHAPTER 10 Comparing Two Populations or Groups
SMOKERS NONSMOKERS Sample 1, size n1 Sample 2, size n2
Statistical Inference for the Mean: t-test
Introductory Statistics
Presentation transcript:

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006

Background Scan Statistic –A tool to detect cluster in a Point Process –Naus (1965 JASA) studied in one dimension –tests if a 1-dim point process is purely random Point Process –Consider a time interval [a,b] and a window A=[t,t+w] of fixed width w – (A)= # of s arrived in the time window A –n(A) ´ n A = # of junk s = number of points –Arrival times of junk s define a Point Process

Main Idea in Scan Statistic Move a window [t,t+w] of size w < b-a over a time interval [a,b] Over all possible values of t, record the maximum number of points in the window Compare this number with cut off points under the the hypothesis of a purely Poisson Process

p p q

Building block of Scan Test Repeated use of tests for equality of two Binomial or Poisson populations Two populations are defined by the scanning window A and its complement A c As in multiple comparison, these tests are dependent as one moves the scanning window

Spatial Scan Statistic (SSS) Kulldorff (1997) used SSS to detect clusters in spatial process SSS can be used –In multi-dim point process –With variable window size –With baseline process an inhomogeneous Poisson process or Bernoulli Process

SSS (continued) –Scanning window can be any predefined shape –SSS is on a geographical space G with a measure –In traditional point process, G is a line, is a uniform measure –In 2-dim, G is a plane, a Lebesgue measure

p p q

Examples Forestry: –Spatial clustering of trees. –Want to see for clusters of a specific kind of trees after adjusting for uneven spatial distribution of all trees – (A)=Total # of trees in region A –n A =# of trees in A of specific kind

Examples (continued) Epidemiology –Interest in detecting geographical clusters of disease –Need to adjust for uneven population density Rural vs. urban population –For data aggregated into census districts, measure is concentrated at the central coordinates of districts

Examples (continued) If interest is in space-time clusters of a disease, the measure will still be concentrated in the geographical region as in the prior example Adjusting for uneven population distribution is not always enough. Should take confounding factors into account. E.g., in epidemiology measure can reflect standardized expected incidence rate

SS = LR statistic For a fixed size window, scan statistic is the maximum # of points in the window at any given time/geographical region Test Stat is equivalent to LR test statistic for testing H 0 : 1 = 2 vs. H a : 1 > 2 Generalization to LR test is important for variable window

Generalized SS: Notation/Models G= Geographical area / study space A= Window ½ G N(A)= Random # of points in A –A spatial point process Goal to find the prominent cluster Two useful models for point process –(a) Bernoulli model –(b) Poisson model

Standard Models for SS For Bernoulli model, measure is such that (A) is an integer for all subsets A of G –Two states (disease point or no disease) for each unit Location of the points define a point process

LR Test: Bernoulli Model

Poisson Model Under Poisson model, points generated by inhom. Poiss. Proc. There is exactly one zone Z G s.t. N(A) Po(pµ(A Z) + qµ(A Z c )) for all A. Null hypothesis H 0 :p=q Alternative hypo H 1 : p>q, Z. Under H 0, N(A) Po(pµ(A)) for all A. - the parameter Z disappears under H 0

Poisson Model (continued)

Choice of Zones How is selected? Possibilities: (1)All circular subsets (2)All circles centered at any of several foci on a fixed grid, with a possible upper limit on size (3)Same as (2) but with a fixed size (4)All rectangles of fixed size and shape (5)If looking for space-time clusters, use cylinders scanning circular geographical areas over variable time intervals

Bernoulli vs. Posson Model Choice between a Bernoulli or Poisson model does not matter much if n(G) << (G) In other cases, use the model most appropriate for application

A Useful Result An important result on most likely cluster based on these models is given in the paper. It states that as long as the points within the zone constituting the most likely cluster are located where they are, H_0 will be rejected irrespective of the other points in G. If a cluster is located in Seattle, locations of the points in the east coast of U.S. do not matter (Theorem 1)

Computations and MC To find the value of λ, we need to calculate LR maximized over collection of zones in H 1. Seems like a daunting task since # of zones could be infinite. # of observed points finite For a fixed # of points, likelihood decreases as µ(Z) increases

Computations (contd) If the circle size increases for a fixed foci, need to recalculate likelihood whenever a new point enters the circle. For a finite points, # of recalcing likelihood for each foci is finite. Distribution of λ is difficult. MC simulation used to generate histogram of λ. Under H 0, replicate the data sets conditional on n G.

Application of SSS to SIDS Bernoulli and Poisson models are illustrated using the SIDS data from NC For 100 counties in NC, total # of live births and # of SIDS cases for Live births range from 567 to Location of county seats are the coordinates. Measure is the # of live births in a county

Application to SIDS (continued) Zones for scanning window are circles centered at a county coordinate point including at most half of the total population Zones are circular only wrt the aggregated data. As circles around a county seat are drawn, other counties will either be completely part of a zone or else not at all, depending on whether its county seat is within the circle or not

Bernoulli model for SIDS Bernoulli model is very natural. Each birth can correspond to at most one SID. Table 1 summarizes the results of the analysis. From Figure 1, the most likely cluster A, consists of Bladen, Columbus, Hoke, Robeson, and Scotland. Using a conservative test, a secondary cluster is B, consists of Halifax, Hartford and Northampton counties.

Poisson model for SIDS For a rare disease SIDS, Poisson model gives a close approximation to Bernoulli. Results are reported in Table 1 Both models detect the same cluster P-values for the primary cluster are same for both the models; p-values for the secondary cluster are very close

Application to SIDS (continued)

Two significant clusters based on SSS

SSS adjusted for Race For SIDS one useful covariate is race Race is related to SIDS through unobserved covariates such as quality of housing, access to health care Overall incidence of SIDS for white children is per 1000 and for black children is per 1000.

SSS: race-adjusted (continued) Racial distribution differs widely among the counties in NC This analysis leads to the same primary cluster (see Figure 2) Previous secondary cluster disappeared but a third secondary cluster C emerges. Cluster C consists of a bunch of counties in the western part of the state

Application to SIDS (continued)

SSS to SIDS adjusted for race

A Bayesian alternative to SSS Scott and Berger (2006): Idea of Bayesian multiple testing. Observe X j N(µ j, σ 2 ), j=1,…,M, To determine which µ j are nonzero we have M (conditionally) independent tests, each testing H 0j :µ j = 0 vs. H 1j : µ j 0 p 0 = prior probability that µ j is zero Crucial point here: let data estimate p 0. S&B use the hierarchical model 1. X j |µ j, σ 2, γ j ~ N(γ j µ j, σ 2 ), independently 2. µ j | τ 2 ~ I.I.D. N(0, τ 2 ), γ j |p 0 ~ I.I.D. Bern (1-p 0 ) 3. (τ 2, σ 2 ) ~ π (τ 2, σ 2 ) =(τ 2 + σ 2 ) -2, p 0 ~ π(p 0 ) Several choices for π(p 0 ): Uniform, Beta(a,1) S&B computed posterior probability γ j =1.

Modification of S&B Model Assume X j N(µ j, σ 2 ), j=1,…,M, To determine which µ j are positive we have M (conditionally) independent tests, each testing H 0j :µ j = 0 vs. H 1j : µ j > 0 As before 1. X j |µ j, σ 2, γ j ~ N(γ j µ j, σ 2 ), independently 2. µ j | µ (-j), ρ, τ 2 ~ N(ρq jk µ k, τ 2 ), [CAR] γ j |p j ~ Ind. Bern (1-p j ) 3. (τ 2, σ 2, ρ) ~ π (τ 2, σ 2, ρ) =(τ 2 + σ 2 ) CAR model on logit(p j ) Compute posterior probability of µ j >0.