A genetic algorithm for irregularly shaped spatial clusters Luiz Duczmal André L. F. Cançado Lupércio F. Bessegato 2005 Syndromic Surveillance Conference.

Slides:



Advertisements
Similar presentations
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Advertisements

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006.
What is the true shape of a disease cluster? The multi-objective genetic scan Luiz Duczmal Ricardo C.H. Takahashi André L.F. Cançado Univ. Federal Minas.
Hotspot/cluster detection methods(1) Spatial Scan Statistics: Hypothesis testing – Input: data – Using continuous Poisson model Null hypothesis H0: points.
Global Clustering Tests. Tests for Spatial Randomness H 0 : The risk of disease is the same everywhere after adjustment for age, gender and/or other covariates.
Presented by: GROUP 7 Gayathri Gandhamuneni & Yumeng Wang.
Spatio – Temporal Cluster Detection Using AMOEBA
Statistical approaches for detecting clusters of disease. Feb. 26, 2013 Thomas Talbot New York State Department of Health Bureau of Environmental and Occupational.
 2005 Carnegie Mellon University A Bayesian Scan Statistic for Spatial Cluster Detection Daniel B. Neill 1 Andrew W. Moore 1 Gregory F. Cooper 2 1 Carnegie.
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
A Spatial Scan Statistic for Survival Data Lan Huang, Dep Statistics, Univ Connecticut Martin Kulldorff, Harvard Medical School David Gregorio, Dep Community.
Spatio-Temporal Outlier Detection in Precipitation Data
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Rapid Detection of Significant Spatial Clusters Daniel B. Neill Andrew W. Moore The Auton Lab Carnegie Mellon University School of Computer Science
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Stat 301 – Day 15 Comparing Groups. Statistical Inference Making statements about the “world” based on observing a sample of data, with an indication.
Algorithms for Smoothing Array CGH data
Department of Engineering, Control & Instrumentation Research Group 22 – Mar – 2006 Optimisation Based Clearance of Nonlinear Flight Control Laws Prathyush.
A new crossover technique in Genetic Programming Janet Clegg Intelligent Systems Group Electronics Department.
Stat 301- Day 32 More on two-sample t- procedures.
Mean for sample of n=10 n = 10: t = 1.361df = 9Critical value = Conclusion: accept the null hypothesis; no difference between this sample.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
The Space-Time Scan Statistic for Multiple Data Streams
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Spatial Statistics for Cancer Surveillance Martin Kulldorff Harvard Medical School and Harvard Pilgrim Health Care.
Geographic Information Science
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
SPONSOR JAMES C. BENNEYAN DEVELOPMENT OF A PRESCRIPTION DRUG SURVEILLANCE SYSTEM TEAM MEMBERS Jeffrey Mason Dan Mitus Jenna Eickhoff Benjamin Harris.
Spatial Data Analysis Areas I: Rate Smoothing and the MAUP Gilberto Câmara INPE, Brazil Ifgi, Muenster, Fall School 2005.
Estimating the Number of Data Clusters via the Gap Statistic Paper by: Robert Tibshirani, Guenther Walther and Trevor Hastie J.R. Statist. Soc. B (2001),
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Cluster Detection Comparison in Syndromic Surveillance MGIS Capstone Project Proposal Tuesday, July 8 th, 2008.
Combined Central and Subspace Clustering for Computer Vision Applications Le Lu 1 René Vidal 2 1 Computer Science Department, Johns Hopkins University,
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
1 Shape Segmentation and Applications in Sensor Networks Xianjin Xhu, Rik Sarkar, Jie Gao Department of CS, Stony Brook University INFOCOM 2007.
ICCS 2009 IDB Workshop, 18 th February 2010, Madrid 1 Training Workshop on the ICCS 2009 database Weighting and Variance Estimation picture.
Section 3.3: The Story of Statistical Inference Section 4.1: Testing Where a Proportion Is.
Point Pattern Analysis Point Patterns fall between the two extremes, highly clustered and highly dispersed. Most tests of point patterns compare the observed.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Point Pattern Analysis
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Statistical Significance: Tests for Spatial Randomness.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Towards efficient prospective detection of multiple spatio-temporal clusters Bráulio Veloso, Andréa Iabrudi and Thais Correa. Universidade Federal de Ouro.
AP STATISTICS LESSON 11 – 1 (DAY 2) The t Confidence Intervals and Tests.
Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.
General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Optimization via Search
Cases and controls A case is an individual with a disease, whose location can be represented by a point on the map (red dot). In this table we examine.
Dept of Biostatistics, Emory University
NSF Digital Government surveillance geoinformatics project, federal agency partnership and national applications for digital governance.
Maximal Independent Set
Clustering (3) Center-based algorithms Fuzzy k-means
Applying GIS to Cancer Epidemiology A brief overview
AIM: Clustering the Data together
Modifiable Attribute Cell Problem and a Method of Solution for Population Synthesis in Land-Use Microsimulation Noriko Otani (Tokyo City University)
Discrete Event Simulation - 4
I. Statistical Tests: Why do we use them? What do they involve?
Overcoming Resolution Limits in MDL Community Detection
Interval Estimation and Hypothesis Testing
Boltzmann Machine (BM) (§6.4)
A Block Based MAP Segmentation for Image Compression
Topic 5: Cluster Analysis
Presentation transcript:

A genetic algorithm for irregularly shaped spatial clusters Luiz Duczmal André L. F. Cançado Lupércio F. Bessegato 2005 Syndromic Surveillance Conference Statistics Department, Universidade Federal de Minas Gerais, Brazil

We propose a new approach to the detection and inference of irregularly shaped spatial clusters, using a genetic algorithm. We minimize the graph-related operations by means of a fast offspring generation and evaluation of the Kulldorff´s scan likelihood ratio statistic. This algorithm is more than ten times faster and exhibits less variance compared to a similar approach using simulated annealing, and thus gives better confidence intervals for the Monte Carlo inference process of significance evaluation for the most likely cluster found. An application to spatial disease cluster detection is discussed. ABSTRACT

Spatial Scan Statistics Kulldorff (1997) Map with m regions Total population N C cases Under the null hypothesis there is no cluster in the map, and the number of cases in each region is Poisson distributed.

For each circle centered in each centroid’s region, let z be the collection of regions that lie inside it. Let = number of cases inside z = expected cases inside z z if and one otherwise. The scan statistic is defined as

The collection (or zone) z with the highest L(z) is the most likely cluster. We sweep through all the m 2 possible circular zones, looking for the highest L(z) value. The whole procedure is repeated for thousands of times, for each set of randomly distributed cases. (Monte Carlo, Dwass(1957)). We need to compare this value against the max L(z) for maps with cases distributed randomly under the null hypothesis.

Duczmal L, Kulldorff M, Huang L. (2006) Extreme example of an irregularly shaped cluster

A(z)=area of the zone z H(z)=perimeter of the convex hull of z Compactness: Intuitively, the convex hull of a planar object is the cell inside a rubber band stretched around it. K(z) = the area of z divided by the area of the circle with perimeter H(z).

Circle: K(z) = 1 Square: K(z) = π/4 Compactness for some common shapes

Penalty function for the log of the likelihood ratio (LLR(z)) K(z).LLR(z).LLR(z) Generalized compactness correction: a = 1 : full compactness correction a = 0.5 : medium compactness correction a = 0.0 : no compactness correction

OBJECTIVE: Find a quasi-optimal solution for a maximization problem. Initial population. Random crossing-over of parents and offspring generation. Selection of children and parents for the next generation. Random mutation. Repeat the previous steps for a predefined number of generations or until there is no improvement in the functional. Genetic Algorithms

Initial population construction Start at a region of the map.

Initial population construction Add the neighbor which forms the highest LLR 2-cell zone.

Initial population construction Add the neighbor which forms the highest LLR 3-cell zone.

Initial population construction Add the neighbor which forms the highest LLR 4-cell zone.

Initial population construction Stop. (It is impossible to form a higher LLR 5-cell zone)

Initial population construction Start at another region of the map.

Initial population construction Add the neighbor which forms the highest LLR 2-cell zone.

Initial population construction etc. Repeat the previous steps for all the regions of the map.

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example) Another possible numbering

THE OFFSPRING GENERATION (a more sofisticated example)

One instance of two parent trees

Advantages: The offspring generation is very inexpensive; All the children zones are automatically connected; Random mutations are easy to implement; The selection for the next generation is straightforward; Fast evolution convergence; The variance between different test runs is small.

Population Evolution Performance

Irregularly shaped clusters benchmark, Northeast US counties map. Duczmal L, Kulldorff M, Huang L. (2006) Evaluation of spatial scan statistics for irregularly shaped clusters. To appear in J. Comput. Graph. Stat.

Power evaluation of the genetic algorithm, compared to the simulated annealing algorithm.

0 100 km Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting. Compactness correction: 1.0 Cluster cases: 2,924 Cluster population: 346,024 Incidence: LLR: p-value:0.001 Data source: DATASUS, G.L.Souza

0 100 km Compactness correction: 0.5 Cluster cases: 3,078 Cluster population: 361,373 Incidence: LLR: p-value:0.001 Data source: DATASUS, G.L.Souza Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting.

0 100 km Compactness correction: 0.0 Cluster cases: 3,324 Cluster population: 394,294 Incidence: LLR: p-value:0.001 Data source: DATASUS, G.L.Souza Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting.

Conclusions The genetic algorithm for disease cluster detection is fast and exhibits less variance compared to similar approaches; The potential use for epidemiological studies and syndromic surveillance is encouraged; The need of penalty functions for the irregularity of cluster’s shape is clearly demonstrated by the power evaluation tests; The power of detection of clusters is similar to the simulated annealing algorithm; The flexibility of shape control gives to the practitioner more insight of the geographic cluster delineation.

 Duczmal L, Kulldorff M, Huang L. (2006) Evaluation of spatial scan statistics for irregularly shaped clusters. To appear in J. Comput. Graph. Stat.  Duczmal L, Assunção R. (2004), A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters, Comp. Stat. & Data Anal., 45,  Kulldorff M, Huang L, Pickle L, Duczmal L. (2005) An Elliptic Spatial Scan Statistic. Submitted.  Patil GP, Taillie C. (2004) Upper level set scan statistic for detecting arbitrarily shaped hotspots. Envir. Ecol. Stat., 11,  Tango T, Takahashi K. (2005) A flexibly shaped spatial scan statistic for detecting clusters. Int. J. Health Geogr., 4:11.  Kulldorff M. (1997), A Spatial Scan Statistic, Comm. Statist. Theory Meth., 26(6),  Kulldorff M, Tango T, Park PJ. (2003) Power comparisons for disease clustering sets, Comp. Stat. & Data Anal., 42,  Kulldorff M, Feuer EJ, Miller BA, Freedman LS. (1997) Breast cancer clusters in the Northeast United States: a geographic analysis. Amer. J. Epidem., 146:  de Souza Jr. GL (2005) The Detection of Clusters of Breast Cancer in São Paulo State, Brazil. M.Sc. Dissertation, Univ. Fed. Minas Gerais. References