A High-Throughput Computational Approach to Environmental Health Study Based on CyberGIS Xun Shi 1, Anand Padmanabhan 2, and Shaowen Wang 2 1 Department of Geography, Dartmouth College 2 Department of Geography and Geographic Information Science, National Center for Supercomputing Applications (NCSA), University of Illinois at Urbana Champaign September, 2013
Basic functionality of CyberGIS Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGIS Gateway; Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructure environments; Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.
Basic functionality of CyberGIS Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGIS Gateway; Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructure environments; Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.
Disaggregate polygon-level location data using restricted and controlled Monte Carlo (RCMC). Calculate local statistics, e.g., calculate intensity of disease occurrence using kernel ratio estimation (KRE). Estimate statistical significance of the intensity using unrestricted and controlled Monte Carlo (UCMC). A computational approach to spatial epidemiology
Disaggregate polygon-level location data 23 births with defects 1202 births Birth with defect(s) Normal birth Population High Low
Restricted and Controlled Monte Carlo (RCMC) for Disaggregation Assign polygon-level addresses to random locations. The randomization is restricted by the smallest polygon to which a polygon-level address belongs. The randomization is controlled by the detailed background data. The randomization is repeated many times (Monte Carlo).
Advantages of RCMC Allows analyses designed for individual/precise locations to be conducted. Maximize the utilization of available spatial information. Explicitly evaluate the spatial uncertainty caused by the imprecision in the data.
Kernel ratio estimation (KRE) for Estimating Local Disease Intensity Birth with defect(s) Normal birth Essentially, calculate the ratio between cases and cohort for each and every location.
Setting of KRE fixed bandwidth vs. adaptive bandwidth site-side kernel vs. case-side kernel
Types of KRE Site-side fixed bandwidth Case-side fixed bandwidth Site-side adaptive bandwidthCase-side adaptive bandwidth
Unrestricted and Controlled Monte Carlo (UCMC) for Estimating Statistical Significance RCMC KRE UCMC KRE Compare P-value
MalesFemales AGE countrateAGE countrate >0<= >0<= >29<= >29<= >39<= >39<= >49<= >49<= >54<= >54<= >59<= >59<= >64<= >64<= >69<= >69<= >74<= >74<= total3498total2969 Epidemiological Confounding factors 2
mean P-value Std dev of P-value hot spots
RCMC-UCMC-based Simulated Case-Control Study for Detecting Disease-Environment Association Case location from RCMC Control location from UCMC Environmental exposure
Spatial variation in disease-environment association: A map of P-value 1 P-value
Computational Demand I: Number of local statistic computing (e.g. KRE) iterations in RCMC and UCMC RCMC iterations: No. of Strata X No. of iterations for cases X No. of iterations for cohort e.g. 2 X 100 X 100 = 20,000 UCMC iterations: No. of Strata X No. of iterations for simulation X No. of iterations for cohort e.g. 2 X 99 X 100 = 19,800 Scenario: Stratification is needed for addressing confounding factors Case data are at the polygon level Cohort data are at the polygon level Detailed background data are available
No. of iterations for cases X No. of iterations for simulation X No. of iterations for cohort e.g. 100 X 99 X 100 = 990,000 Computational Demand II: Number of layer-on-layer comparisons for estimating P-value
No. of pixels that are not “nodata” pixels e.g. About 3 million in a 1652 X 2912 raster Major operations, use case-side adaptive bandwidth KRE as example: Expand the kernel in a spinning way Accumulate the distance-decayed kernel value for each case encountered Accumulate the cohort value Check if the threshold is met Computational Demand III: Pixel-wise statistic computing
Number of raster layers generated during the process: No. of RCMC iterations + No. of UCMC iterations + No. of Parallel Comparisons e.g. 20, , ,000 = 49,800 Memory: Size of data type X No. of columns X No. of rows X No. of raster layers e.g. 4 bytes X 1652 X 2912 X 49,800 = 550 gigabytes Computational Demand IV – Memory
On a HP Z800 Workstation (2 Xeon CPUs 3.07GHz, 32GB RAM) Mapping birth defects for New Hampshire 1400 birth defect cases for ,000 births for age categories 220 town polygons 100-m resolution female population raster (1652 x 2912) 100 RCMC iterations for cases 100 RCMC iterations for cohort 99 URMC iterations 40 hours
Migrating to cyberGIS Setup infrastructure – New repository created in CyberGIS SVN – Establish a development environment Define the application interface using GISolve Open Service APIs Build and deploy the code on cyberinfrastructure resources from SVN Publish the application Test application execution
Computation Management through GISolve Open Service APIs Compress input into a single zip file and make it available on a Web accessible location – Input to the program include files for point cases, zone cases, cohort, background, zone file, and associated settings need by the application – The URL of the zip file is the single parameter to the Open service APIs Code execution and input/output data are put into a computation sandbox Simply run php job-submit.php and the GISolve middleware will take care of the rest
Parallel computing through CIGI local cluster and XSEDE Original MFC (Windows) code was extracted and adapted to run on the Linux environment Application code has been checked into the CyberGIS SVN for co-development and deployment on a CIGI local cluster and XSEDE Developed a set of parallel and distributed computing strategies based on a spatial computational domain construct Optimizing computational performance of these strategies
Ongoing … Accessibility: Making GIS capabilities accessible to a large of number of users for research and education, through online cyberGIS Gateway; Computational Capability: Embedding geospatial software capabilities into advanced cyberinfrastructure environments; Interoperability: Managing heterogeneous and distributed resources and services through GISolve middleware.
Designing and constructing secured data transporting protocol and tunnel …
Acknowledgements National Science Foundation - OCI XSEDE SES NIH P20RO18787 NIH P20ES and EPA RD Dartmouth Neukom/IQBS CompX Faculty Grant
Thanks! Questions …