What is the true shape of a disease cluster? The multi-objective genetic scan Luiz Duczmal Ricardo C.H. Takahashi André L.F. Cançado Univ. Federal Minas.

Slides:



Advertisements
Similar presentations
Copula Representation of Joint Risk Driver Distribution
Advertisements

Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta SAMSI September 29, 2005.
Summary of A Spatial Scan Statistic by M. Kulldorff Presented by Gauri S. Datta Mid-Year Meeting February 3, 2006.
Lecture 2 ANALYSIS OF VARIANCE: AN INTRODUCTION
Inference in the Simple Regression Model
Review bootstrap and permutation
Chapter 4 Inference About Process Quality
Comparison of 2 Population Means Goal: To compare 2 populations/treatments wrt a numeric outcome Sampling Design: Independent Samples (Parallel Groups)
“Students” t-test.
A Partition Modelling Approach to Tomographic Problems Thomas Bodin & Malcolm Sambridge Research School of Earth Sciences, Australian National University.
3.3 Hypothesis Testing in Multiple Linear Regression
Hotspot/cluster detection methods(1) Spatial Scan Statistics: Hypothesis testing – Input: data – Using continuous Poisson model Null hypothesis H0: points.
A.M. Alonso, C. García-Martos, J. Rodríguez, M. J. Sánchez Seasonal dynamic factor model and bootstrap inference: Application to electricity market forecasting.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Spatio – Temporal Cluster Detection Using AMOEBA
Zakaria A. Khamis GE 2110 GEOGRAPHICAL STATISTICS GE 2110.
Empirical/Asymptotic P-values for Monte Carlo-Based Hypothesis Testing: an Application to Cluster Detection Using the Scan Statistic Allyson Abrams, Martin.
A Spatial Scan Statistic for Survival Data Lan Huang, Dep Statistics, Univ Connecticut Martin Kulldorff, Harvard Medical School David Gregorio, Dep Community.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.
A new crossover technique in Genetic Programming Janet Clegg Intelligent Systems Group Electronics Department.
Course overview Tuesday lecture –Those not presenting turn in short review of a paper using the method being discussed Thursday computer lab –Turn in short.
The Space-Time Scan Statistic for Multiple Data Streams
Evaluating Hypotheses
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
Today Concepts underlying inferential statistics
Bootstrap spatobotp ttaoospbr Hesterberger & Moore, chapter 16 1.
Geographic Information Science
Hypothesis Tests and Confidence Intervals in Multiple Regressors
Using ArcGIS/SaTScan to detect higher than expected breast cancer incidence Jim Files, BS Appathurai Balamurugan, MD, MPH.
Inference for regression - Simple linear regression
The Spatial Scan Statistic. Null Hypothesis The risk of disease is the same in all parts of the map.
Spatial Data Analysis Areas I: Rate Smoothing and the MAUP Gilberto Câmara INPE, Brazil Ifgi, Muenster, Fall School 2005.
Efficient Model Selection for Support Vector Machines
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Neural and Evolutionary Computing - Lecture 6
Lecture 8 Simple Linear Regression (cont.). Section Objectives: Statistical model for linear regression Data for simple linear regression Estimation.
PPA 501 – Analytical Methods in Administration Lecture 6a – Normal Curve, Z- Scores, and Estimation.
Interval Estimation and Hypothesis Testing Prepared by Vera Tabakova, East Carolina University.
Section 3.3: The Story of Statistical Inference Section 4.1: Testing Where a Proportion Is.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Point Pattern Analysis Point Patterns fall between the two extremes, highly clustered and highly dispersed. Most tests of point patterns compare the observed.
1 CLUSTER VALIDITY  Clustering tendency Facts  Most clustering algorithms impose a clustering structure to the data set X at hand.  However, X may not.
Alice E. Smith and Mehmet Gulsen Department of Industrial Engineering
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Learning Objectives After this section, you should be able to: The Practice of Statistics, 5 th Edition1 DESCRIBE the shape, center, and spread of the.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Metaheuristics for the New Millennium Bruce L. Golden RH Smith School of Business University of Maryland by Presented at the University of Iowa, March.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Spatial Scan Statistic for Geographical and Network Hotspot Detection C. Taillie and G. P. Patil Center for Statistical Ecology and Environmental Statistics.
General Elliptical Hotspot Detection Xun Tang, Yameng Zhang Group
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
A genetic algorithm for irregularly shaped spatial clusters Luiz Duczmal André L. F. Cançado Lupércio F. Bessegato 2005 Syndromic Surveillance Conference.
CHAPTER 10 Comparing Two Populations or Groups
Virtual University of Pakistan
CHAPTER 10 Comparing Two Populations or Groups
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Clustering (3) Center-based algorithms Fuzzy k-means
Discrete Event Simulation - 4
Interval Estimation and Hypothesis Testing
CHAPTER 10 Comparing Two Populations or Groups
Presentation transcript:

What is the true shape of a disease cluster? The multi-objective genetic scan Luiz Duczmal Ricardo C.H. Takahashi André L.F. Cançado Univ. Federal Minas Gerais, Brazil, Statistics Dept., Electrical Engineering Dept., Mathematics Dept. Geoinfo 2006

Irregularly shaped spatial disease clusters occur commonly in epidemiological studies, but their geographic delineation is poorly defined. Most current spatial scan software usually displays only one of the many possible cluster solutions with different shapes, from the most compact round cluster to the most irregularly shaped one, corresponding to varying degrees of penalization parameters imposed to the freedom of shape. Even when a fairly complete set of solutions is available, the choice of the most appropriate parameter setting is left to the practitioner, whose decision is often subjective.

We propose quantitative criteria for choosing the best cluster solution, through multi-objective optimization, by finding the Pareto-set in the solution space. Two competing objectives are involved in the search: regularity of shape, and scan statistic value. Instead of running sequentially a cluster finding algorithm with varying degrees of penalization, the complete set of solutions is found in parallel, employing a genetic algorithm.

The cluster significance concept is extended for this set in a natural and unbiased way, being employed as a decision criterion for choosing the optimal solution. The Gumbel distribution is used to approximate the empiric scan statistic distribution, speeding up the significance estimation. The method is fast, with good power of detection. An application to breast cancer clusters is discussed. Keywords: spatial scan statistic, disease clusters, geometric compactness penalty correction, Pareto-sets, multi-objective optimization, vector optimization, Gumbel distribution, genetic algorithm.

Spatial Scan Statistics Kulldorff (1997) Map with m regions Total population N C cases Under the null hypothesis there is no cluster in the map, and the number of cases in each region is Poisson distributed.

For each circle centered in each centroid’s region, let z be the collection of regions that lie inside it. Let = number of cases inside z = expected cases inside z z if and one otherwise. The scan statistic is defined as

The collection (or zone) z with the highest L(z) is the most likely cluster. We sweep through all the m 2 possible circular zones, looking for the highest L(z) value. The whole procedure is repeated for thousands of times, for each set of randomly distributed cases. (Monte Carlo, Dwass(1957)). We need to compare this value against the max L(z) for maps with cases distributed randomly under the null hypothesis.

Penalty function to control the freedom of shape (joint work with Kulldorff and Huang) Extreme example of an irregularly shaped cluster

A(z)=area of the zone z H(z)=perimeter of the convex hull of z Compactness: Intuitively, the convex hull of a planar object is the cell inside a rubber band stretched around it. K(z) = the area of z divided by the area of the circle with perimeter H(z).

Circle: K(z) = 1 Square: K(z) = π/4 Compactness for some common shapes

Penalty function for the log of the likelihood ratio (LLR(z)) K(z).LLR(z).LLR(z) Generalized compactness correction: a = 1 : full compactness correction a = 0.5 : medium compactness correction a = 0.0 : no compactness correction

The Elliptic Scan Statistic (joint work with Kulldorff, Huang and Pickle) The scanning window has variable location, size, shape and angle. A penalty function may be used.

Breast Cancer Mortality Rates Most likely cluster Pickle et al., Atlas of United States Mortality, NCHS, 1996 Circular Elliptical, axis ratio = 2 Elliptical, axis ratio = 5

penalty correction 1 0 circular

penalty correction 1 0 elliptical

penalty correction 1 0 irregular

no penalty correction 1 0 = disaster ! irregular

(joint work with Martin Kulldorff and Lan Huang) Extreme example of an irregularly shaped cluster

Homicide average Minas Gerais State, Brazil Hom./100,000 inhab./year 853 municipalities Source: DATASUS Map by Ricardo Tavares 100 km

OBJECTIVE: Find a quasi-optimal solution for a maximization problem. Initial population. Random crossing-over of parents and offspring generation. Selection of children and parents for the next generation. Random mutation. Repeat the previous steps for a predefined number of generations or until there is no improvement in the functional. Genetic Algorithms (joint work with Cançado, Takahashi and Bessegato)

We minimize the graph-related operations by means of a fast offspring generation and evaluation of the Kulldorff´s scan likelihood ratio statistic. This algorithm is more than ten times faster and exhibits less variance compared to a similar approach using simulated annealing, and thus gives better confidence intervals for the Monte Carlo inference process of significance evaluation for the most likely cluster found.

Incidence of Malaria Deaths in the Brazilian Amazon ( )

Initial population construction Start at a region of the map.

Initial population construction Add the neighbor which forms the highest LLR 2-cell zone.

Initial population construction Add the neighbor which forms the highest LLR 3-cell zone.

Initial population construction Add the neighbor which forms the highest LLR 4-cell zone.

Initial population construction Stop. (It is impossible to form a higher LLR 5-cell zone)

Initial population construction Start at another region of the map.

Initial population construction Add the neighbor which forms the highest LLR 2-cell zone.

Initial population construction etc. Repeat the previous steps for all the regions of the map.

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example)

THE OFFSPRING GENERATION (a simple example) Another possible numbering

THE OFFSPRING GENERATION (a more sofisticated example)

One instance of two parent trees

Advantages: The offspring generation is very inexpensive; All the children zones are automatically connected; Random mutations are easy to implement; The selection for the next generation is straightforward; Fast evolution convergence; The variance between different test runs is small.

Population Evolution Performance

Irregularly shaped clusters benchmark, Northeast US counties map. Duczmal L, Kulldorff M, Huang L. (2006) Evaluation of spatial scan statistics for irregularly shaped clusters. J. Comput. Graph. Stat.

Power evaluation of the genetic algorithm, compared to the simulated annealing algorithm.

Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting.

0 100 km Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting. Compactness correction: 1.0 Cluster cases: 2,924 Cluster population: 346,024 Incidence: LLR: p-value:0.001 Data source: DATASUS, G.L.Souza

0 100 km Compactness correction: 0.5 Cluster cases: 3,078 Cluster population: 361,373 Incidence: LLR: p-value:0.001 Data source: DATASUS, G.L.Souza Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting.

0 100 km Compactness correction: 0.0 Cluster cases: 3,324 Cluster population: 394,294 Incidence: LLR: p-value:0.001 Data source: DATASUS, G.L.Souza Cluster of high incidence of breast cancer. São Paulo State, Brazil, Population adjusted for age and under-reporting.

The genetic algorithm for disease cluster detection is fast and exhibits less variance compared to similar approaches; The potential use for epidemiological studies and syndromic surveillance is encouraged; The need of penalty functions for the irregularity of cluster’s shape is clearly demonstrated by the power evaluation tests; The power of detection of clusters is similar to the simulated annealing algorithm; The flexibility of shape control gives to the practitioner more insight of the geographic cluster delineation.

Northeast US counties map with observed cases: Age adjusted female breast cancer, Kulldorff M., Feuer E.J., Miller B.A., Freedman L.S. (1997) Breast cancer clusters in the Northeast United States: a geographic analysis. American Journal of Epidemiology, 146: Percent below/above expected > 20% 12% to 20% 4% to 12% -4% to +4% -12% to -4% -20% to -12% < -20%

The Gumbel parametric approximation to the log likelihhod ratio scan. Joint work with Cançado and Takahashi. Based on the results of Abrams, Kulldorff and Kleinmann. LLR

Pareto Sets The detection of irregularly shaped disease clusters through multi-objective optimization.

The genetic algorithm is used to maximize two objectives: -the scan statistic. -the regularity of shape (compactness).

log likelihood ratio compactness Elite (red dots): Each red dot is not surpassed by any other point on all variables simultaneously.

log likelihood ratio compactness Elite (red dots): Each red dot is not surpassed by any other point on all variables simultaneously.

log likelihood ratio compactness Elite (red dots): Each red dot is not surpassed by any other point on all variables simultaneously.

log likelihood ratio compactness Elite (red dots): Each red dot is not surpassed by any other point on all variables simultaneously.

log likelihood ratio compactness The Pareto Surface is formed joining the elite points.

Null Hypothesis Critical Value Pareto Surface, 95 percentile (circles). 100 elites (from 100 simulations under the null hypothesis). log likelihood ratio compactness

log likelihood ratio Power Test Pareto Surface, 95 percentile under null hypothesis (red circles). 100 elites (from 100 simulations under the alternative hypothesis).

Northeast US counties map with observed cases: Age adjusted female breast cancer, Kulldorff M., Feuer E.J., Miller B.A., Freedman L.S. (1997) Breast cancer clusters in the Northeast United States: a geographic analysis. American Journal of Epidemiology, 146: Percent below/above expected > 20% 12% to 20% 4% to 12% -4% to +4% -12% to -4% -20% to -12% < -20%

 Duczmal L, Kulldorff M, Huang L. (2006) Evaluation of spatial scan statistics for irregularly shaped clusters. J. Comput. Graph. Stat. 15;2,1-15.  Duczmal L, Cançado ALF, Takahashi RHC, Bessegato LF, A genetic algorithm for irregularly shaped spatial scan statistics (submitted).  Duczmal L, Cançado ALF, Takahashi RHC, Delineation of Irregularly Shaped Disease Clusters through Multi-Objective Optimization (submitted).  Duczmal L, Assunção R. (2004), A simulated annealing strategy for the detection of arbitrarily shaped spatial clusters, Comp. Stat. & Data Anal., 45,  Kulldorff M, Huang L, Pickle L, Duczmal L. (2005) An Elliptic Spatial Scan Statistic. Statistics in Medicine (to appear).  Patil GP, Taillie C. (2004) Upper level set scan statistic for detecting arbitrarily shaped hotspots. Envir. Ecol. Stat., 11,  Kulldorff M. (1997), A Spatial Scan Statistic, Comm. Statist. Theory Meth., 26(6),  Kulldorff M, Tango T, Park PJ. (2003) Power comparisons for disease clustering sets, Comp. Stat. & Data Anal., 42,  Kulldorff M, Feuer EJ, Miller BA, Freedman LS. (1997) Breast cancer clusters in the Northeast United States: a geographic analysis. Amer. J. Epidem., 146:  de Souza Jr. GL (2005) The Detection of Clusters of Breast Cancer in São Paulo State, Brazil. M.Sc. Dissertation, Univ. Fed. Minas Gerais. References