Rapid Detection of Significant Spatial Clusters Daniel B. Neill Andrew W. Moore The Auton Lab Carnegie Mellon University School of Computer Science E-mail:

Slides:



Advertisements
Similar presentations
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Advertisements

Tests of Hypotheses Based on a Single Sample
Chapter 12: Testing hypotheses about single means (z and t) Example: Suppose you have the hypothesis that UW undergrads have higher than the average IQ.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Random Number Generation. Random Number Generators Without random numbers, we cannot do Stochastic Simulation Most computer languages have a subroutine,
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2008.
 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Introduction to Hypothesis Testing
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
AWESOME: A General Multiagent Learning Algorithm that Converges in Self- Play and Learns a Best Response Against Stationary Opponents Vincent Conitzer.
Lecture 2: Thu, Jan 16 Hypothesis Testing – Introduction (Ch 11)
Evaluating Hypotheses
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
ICS-271:Notes 6: 1 Notes 6: Game-Playing ICS 271 Fall 2006.
Experimental Evaluation
Inferences About Process Quality
1 Bayesian Network Anomaly Pattern Detection for Disease Outbreaks Weng-Keen Wong (Carnegie Mellon University) Andrew Moore (Carnegie Mellon University)
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 9. Hypothesis Testing I: The Six Steps of Statistical Inference.
Hypothesis Testing II The Two-Sample Case.
Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.
Stochastic Algorithms Some of the fastest known algorithms for certain tasks rely on chance Stochastic/Randomized Algorithms Two common variations – Monte.
Scale estimation and significance testing for three focused statistics Peter A. Rogerson Departments of Geography and Biostatistics University at Buffalo.
Making decisions about distributions: Introduction to the Null Hypothesis 47:269: Research Methods I Dr. Leonard April 14, 2010.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
T- and Z-Tests for Hypotheses about the Difference between Two Subsamples.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
Chapter 20 Testing hypotheses about proportions
Comp. Genomics Recitation 3 The statistics of database searching.
Games. Adversaries Consider the process of reasoning when an adversary is trying to defeat our efforts In game playing situations one searches down the.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Introduction to sample size and power calculations Afshin Ostovar Bushehr University of Medical Sciences.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Issues concerning the interpretation of statistical significance tests.
Spatial Statistics in Ecology: Point Pattern Analysis Lecture Two.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
© Copyright McGraw-Hill 2004
Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Sequence Alignment.
Adversarial Search 2 (Game Playing)
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
An Efficient Rigorous Approach for Identifying Statistically Significant Frequent Itemsets Adam Kirsch, Michael Mitzenmacher, Havard University Andrea.
Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.
General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group
Data Mining Soongsil University
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
CJT 765: Structural Equation Modeling
CSE572: Data Mining by H. Liu
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Rapid Detection of Significant Spatial Clusters Daniel B. Neill Andrew W. Moore The Auton Lab Carnegie Mellon University School of Computer Science {neill,

Introduction Goals of data mining: –Discover patterns in data. –Distinguish patterns that are significant from those that are likely to have occurred by chance. For example: –In epidemiology, a rise in the number of disease cases in a region may or may not be indicative of an emerging epidemic. –In brain imaging, an increase in measured fMRI activation may or may not represent a real increase in brain activity. This is why significance testing is important!

Problem overview Assume data has been aggregated to an N x N grid. Each grid cell s ij has a count c ij and a population p ij. Our goal is to find overdensities: spatial regions where the counts are significantly higher than expected, given the underlying population. P=5000 C=27 P=3500 C=14 P=4500 C=22 P=3000 C=15 P=1000 C=5 P=5000 C=26 P=4000 C=17 P=3000 C=12 P=2000 C=12 P=1000 C=4 P=5000 C=19 P=5008 C=25 P=4000 C=43 P=3000 C=37 P=4000 C=20 P=4800 C=18 P=4800 C=20 P=4000 C=40 P=3000 C=22 P=4000 C=16 P=4700 C=20 P=3000 C=13 P=3000 C=18 P=2000 C=20 P=1000 C=4 Underlying population of cell Count of cell This region has an overdensity of counts.

Application domains In epidemiology: –Counts c ij represent number of disease cases in a region, or some related observable quantity (Emergency Department visits, sales of OTC medications). –Populations p ij can be obtained from census data or historical counts (e.g. past OTC sales). In brain imaging: –Counts c ij represent fMRI activation in a given voxel. –Populations p ij represent baseline activation under null condition. Also applicable to other domains, e.g. astrophysics, surveillance.

Application domains In epidemiology: –Counts c ij represent number of disease cases in a region, or some related observable quantity (Emergency Department visits, sales of OTC medications). –Populations p ij can be obtained from census data or historical counts (e.g. past OTC sales). In brain imaging: –Counts c ij represent fMRI activation in a given voxel. –Populations p ij represent baseline activation under null condition. Also applicable to other domains, e.g. astrophysics, surveillance. Goal: find clusters of disease cases, allowing early detection of epidemics.

Application domains In epidemiology: –Counts c ij represent number of disease cases in a region, or some related observable quantity (Emergency Department visits, sales of OTC medications). –Populations p ij can be obtained from census data or historical counts (e.g. past OTC sales). In brain imaging: –Counts c ij represent fMRI activation in a given voxel. –Populations p ij represent baseline activation under null condition. Also applicable to other domains, e.g. astrophysics, surveillance. fMRI picture goes here

Application domains In epidemiology: –Counts c ij represent number of disease cases in a region, or some related observable quantity (Emergency Department visits, sales of OTC medications). –Populations p ij can be obtained from census data or historical counts (e.g. past OTC sales). In brain imaging: –Counts c ij represent fMRI activation in a given voxel. –Populations p ij represent baseline activation under null condition. Also applicable to other domains, e.g. astrophysics, surveillance. Goal: find clusters of brain activity corresponding to given cognitive tasks.

Problem overview To detect overdensities: –Find the most significant spatial regions. –Calculate statistical significance of these regions. We focus here on finding the single most significant rectangular region S* (and its p-value). –If p-value > α, no significant clusters exist at level α. –If p-value < α, then S* is significant; we can then examine secondary clusters. P=5000 C=27 P=3500 C=14 P=4500 C=22 P=3000 C=15 P=1000 C=5 P=5000 C=26 P=4000 C=17 P=3000 C=12 P=2000 C=12 P=1000 C=4 P=5000 C=19 P=5008 C=25 P=4000 C=43 P=3000 C=37 P=4000 C=20 P=4800 C=18 P=4800 C=20 P=4000 C=40 P=3000 C=22 P=4000 C=16 P=4700 C=20 P=3000 C=13 P=3000 C=18 P=2000 C=20 P=1000 C=4

Why rectangular regions? We typically expect clusters to be convex; thus inner/outer bounding boxes are reasonably close approximations to shape. We can find clusters with high aspect ratios. –Important in epidemiology since disease clusters are often elongated (e.g. from windborne pathogens). –Important in brain imaging because of the brain’s “folded sheet” structure. We can find non-axis-aligned rectangles by examining multiple rotations of the data.

Calculating significance Define models: –of the null hypothesis H 0 (no clusters). –of the alternative hypothesis H 1 (at least one cluster). Derive a score function D(S) = D(C, P). – Likelihood ratio: D(S) = L(Data | H 1 (S)) / L(Data | H 0 ). –To find the most significant region: S* = arg max S D(S).

Example: Kulldorff’s statistic Kulldorff’s spatial scan statistic (1997) is individually most powerful for finding a single region of elevated disease rate. Given a region with uniform disease rate q inside the region and q’ < q outside, this test is most likely to detect the cluster. Find the region with the maximum value of the (log-) likelihood ratio statistic: q =.02 q’ =.01 Assumption: c ij ~ Po(qp ij )

Properties of D(S) D(S) is increasing with the total count of S, C(S) = ∑ S c ij. z z z ! ! !

Properties of D(S) D(S) is decreasing with the total population of S, P(S) = ∑ S p ij. z z z ! ! !

Properties of D(S) For a constant ratio C / P, D(S) is increasing with P. z z z ! ! !

Multiple hypothesis testing Problem with testing significance: we are simultaneously testing a huge number of regions (>1 billion for a 256 x 256 grid), asking if any of them are significant. If the null hypothesis is true (i.e. no clusters exist), regions’ p-values will be uniformly distributed on [0, 1]: –We expect the p-value to be less than.05, 5% of the time. –So we expect 50 million false positives! * –Moreover, the lowest of these p- values (i.e. the p-value of the most significant region) is almost certain to be less than.05. * Give or take, depending on correlations between region scores… but at least 65,536 of the tests are independent.

The solution: randomization testing 1.Randomly generate a replica grid G’ under the null hypothesis of no clusters. For example, for Kulldorff’s statistic, the replica has the same populations p_ij as grid G, but all counts generated randomly with uniform disease rate. 2.Compute the maximum value of D(S) for the replica G’, and compare to the maximum value of D(S) for the original grid G. If D max (G’) > D max (G), the replica grid beats the original. 3.Repeat steps 1-2 for a large number R of replica grids (typically R = 1000). 4.The p-value is the proportion of replicas G’ beating G. Statistically significant if p-value < α.

Why spatial scan? Previous approaches typically find individual high-density cells and aggregate them using some heuristic method; we search over regions to find the ones which are globally optimal (maximize the score function corresponding to some model). –Clusters found by previous approaches are typically not optimal in any well-defined sense; also, no conclusions can be drawn about the significance of the region as a whole. Detecting regions rather than aggregating single cells allows us to be more sensitive to even small (but significant) changes in density, if they are sufficiently large in spatial extent.

Why spatial scan? The spatial scan statistics framework is both general and powerful: –Simply choose a model (null and alternative hypotheses to test), derive the corresponding score function D(S), and apply the spatial scan to find the globally optimal cluster with respect to this score function. –Assuming the score function has been chosen properly (i.e. likelihood ratio), we will (under certain conditions) have an individually most powerful test with respect to the model. In addition to finding the most significant region, we also compute the significance (p-value) of that region, correctly adjusting for multiple hypothesis testing. –Thus we have a guaranteed bound on false positive rate given the null hypothesis. The spatial scan adjusts for variable underlying populations p_ij instead of simply searching for regions of high count. The main disadvantage: computational intractability! –Requires searching over all spatial regions: both for the original grid and many replicas! This is our motivation for finding a fast spatial scan algorithm!

A naïve spatial scan approach Search all O(N 4 ) rectangular regions, return the highest value of the scan statistic. –We can use the old “cumulative counts” trick to find the score of any region in O(1), so we can search in O(N 4 ). –But in order to perform randomization testing, we must do the same for each replica grid, giving us total complexity O(RN 4 ). This is much too slow for real-time detection! For a 256 x 256 grid, with 1000 replications: 1.03 trillion regions to search  days!

How to speed up our search? Use a space-partitioning tree? –Problem: many subregions of a region are not contained entirely in either “child,” but overlap partially with each. Option #1: search recursively, but at each node also search all of these “shared” regions. –Problem: There are O(N 4 ) such regions even at the top level of the tree! Option #2: find “pieces” of the region, and merge bottom-up. –Problem: combined region may be more significant than either piece.

The solution: Overlap-multiresolution partitioning We propose a partitioning approach in which adjacent regions are allowed to partially overlap. The basic idea is to: –Divide the grid into overlapping regions. –Bound the maximum score of subregions contained in each region. –Prune regions which cannot contain the most significant region. –Find the same region and p-value as the naïve approach… but hundreds or thousands of times faster!

Overlap-multires partitioning Parent region S is divided into four overlapping children: “left child” S 1, “right child” S 2, “upper child” S 3, and “lower child” S 4. Then for any rectangular subregion S’ of S, exactly one of the following is true: –S’ is contained entirely in (at least) one of the children S 1 … S 4. –S’ contains the center region S C, which is common to all four children. Starting with the entire grid G and repeating this partitioning recursively, we obtain the overlap-kd tree structure.

The overlap-kd tree (first two levels) Each node represents a gridded region (thick square) of the entire dataset (thin square and dots).

Properties of the overlap-kd tree Every rectangular region S’ in G is either: – a gridded region (i.e. contained in the overlap-kd tree) –or an outer region of a unique gridded region S (i.e. S’ is contained in S and contains its center S C ). The overlap-kd tree contains O((N log N) 2 ), not O(N 4 ), nodes. –If we can search very few outer regions, can achieve a huge speedup.

Overlap-multires partitioning The basic (exhaustive) algorithm: to search a region S, recursively search S 1 … S 4, then search over all outer regions containing S C. We can improve the basic algorithm by pruning: since all the outer regions of S contain the (large) center region S C, we can calculate tight bounds on the maximum score, often allowing us not to search any of them. Thus our method is a top-down, branch and bound search.

Region pruning In our top-down search, we keep track of the best region S* found so far, and its score D(S*). When we search a region S, we compute upper bounds on the scores: –Of all subregions S’ of S. –Of all outer subregions S’ (subregions of S containing S C ). If the upper bounds for a region are worse than the best score so far, we can prune. –If no subregion can be optimal, prune completely (don’t search any subregions). –If no outer subregion can be optimal, recursively search the child regions, but do not search the outer regions. –If neither case applies, we must recursively search the children and also search over the outer regions.

Tighter score bounds by quartering We precompute global bounds on populations p ij and ratios c ij / p ij, and use these for our initial pruning. If we cannot prune the outer regions of S using the global bounds, we do a second pass which is more expensive but allows much more pruning. We can use quartering to give much tighter bounds on populations and ratios, and compute a better score bound using these. –Requires time quadratic in region size; in effect, we are computing bounds for all irregular but rectangle-like outer regions.

Results: a fast spatial scan Theoretical complexity O((N log N) 2 ) (vs. naïve O(N 4 )), if most significant region is sufficiently dense. –If not, can use several other tricks (racing, early stopping) to speed up. In practice: speedup x. –Western Pennsylvania Emergency Department data, N=256: 21 minutes. –National sales of over-the-counter cough and cold medication, N=256: 47 minutes. –Naïve approach: 14 days! –Similar gains in performance on the other datasets tested; see paper for more results. ED dataset

Results: a fast spatial scan Potential impact: facilitating fast disease surveillance by state and local health departments. Preliminary results indicate that we can detect elongated regions 10-20x faster than the current state of the art (Kulldorff’s SaTScan software) can detect circular regions. ED dataset CAVEAT #1: Inexact comparison! CAVEAT #2: Comparison to SaTScan is preliminary!

Concluding remarks We are currently applying our fast spatial scan algorithm to national-level hospital and pharmacy data, monitoring daily for disease outbreaks. We have also extended the algorithm to arbitrary dimension, and applied these techniques to various multidimensional datasets. –For example, we are using the 3D fast spatial scan on fMRI data, in order to discover regions of brain activity corresponding to given cognitive tasks. In collaboration with the RODS lab at U.Pitt. In collaboration with Tom Mitchell and Francisco Pereira at CMU. See our upcoming NIPS paper for more details.