Running high-dimensional genomic models on COMPS

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

The Matching Hypothesis Jeff Schank PSC 120. Mating Mating is an evolutionary imperative Much of life is structured around securing and maintaining long-term.
Medical Image Registration Kumar Rajamani. Registration Spatial transform that maps points from one image to corresponding points in another image.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Information Bottleneck EM School of Engineering & Computer Science The Hebrew University, Jerusalem, Israel Gal Elidan and Nir Friedman.
Sampling distributions of alleles under models of neutral evolution.
More MR Fingerprinting
Copyright © 2005 Department of Computer Science 1 Solving the TCP-incast Problem with Application-Level Scheduling Maxim Podlesny, University of Waterloo.
ISSPIT Ajman University of Science & Technology, UAE
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Hierarchical Distributed Genetic Algorithm for Image Segmentation Hanchuan Peng, Fuhui Long*, Zheru Chi, and Wanshi Siu {fhlong, phc,
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
BASIC FACTS ABOUT MALARIA n Four Plasmodium species cause human malaria: P. falciparum (the most virulent), P. vivax, P. malariae, and P. ovale. Human.
Boltzmann Machine (BM) (§6.4) Hopfield model + hidden nodes + simulated annealing BM Architecture –a set of visible nodes: nodes can be accessed from outside.
An Iterative Heuristic for State Justification in Sequential Automatic Test Pattern Generation Aiman H. El-MalehSadiq M. SaitSyed Z. Shazli Department.
Comparing droplet activation parameterisations against adiabatic parcel models using a novel inverse modelling framework Warsaw: April 20 th 2015: Eulerian.
Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.
1. Process Gather Input – Today Form Coherent Consensus – Next two months.
Mixture Models, Monte Carlo, Bayesian Updating and Dynamic Models Mike West Computing Science and Statistics, Vol. 24, pp , 1993.
*Partially funded by the Austrian Grid Project (BMBWK GZ 4003/2-VI/4c/2004) Making the Best of Your Data - Offloading Visualization Tasks onto the Grid.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
1 On the Performance of Internet Worm Scanning Strategies Authors: Cliff C. Zou, Don Towsley, Weibo Gong Publication: Journal of Performance Evaluation,
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
WCRP Extremes Workshop Sept 2010 Detecting human influence on extreme daily temperature at regional scales Photo: F. Zwiers (Long-tailed Jaeger)
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Parameter Sweep and Resources Scaling Automation in Scalarm Data Farming Platform J. Liput, M. Paciorek, M. Wrona, M. Orzechowski, R. Slota, and J. Kitowski.
MULTIPLE POPULATIONS OF ARTEMISININ-RESISTANT PLASMODIUM FALCIPARUM IN CAMBODIA MIOTTO ET. AL Presented by Josie Benson.
Population genetics approach to understanding changing malaria transmission dynamics. Evidence for clonal expansion and epidemic propagation of malaria.
Artificial Intelligence By Mr. Ejaz CIIT Sahiwal Evolutionary Computation.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.
High Performance Computing (HPC)
Paper Review for ENGG6140 Memetic Algorithms
Introduction to Sampling based inference and MCMC
Modeling Big Data Execution speed limited by: Model complexity
Demand Point Aggregation for Location Models Chapter 7 – Facility Location Text Adam Bilger 7/15/09.
Dan Klein, IDM Symposium, 4/18/2017
ECRG High-Performance Computing Seminar
DTK-Tools Benoit Raybaud, Research Software Manager.
Population Genetics As we all have an interest in genomic epidemiology we are likely all either in the process of sampling and ananlysising genetic data.
Calibration and Optimization Methods for Stochastic Epi Models
The Matching Hypothesis
Spatial Analysis With Big Data
Machine Learning I & II.
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Buffer Insertion with Adaptive Blockage Avoidance
CPSC 531: System Modeling and Simulation
Computer Science Life Cycle Models.
Complex World 2015 Workshop
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Genetic Algorithms: A Tutorial
Neural Networks Geoff Hulten.
Overfitting and Underfitting
Boltzmann Machine (BM) (§6.4)
First Hop Offloading of Mobile DAG Computations
Dtk-tools Benoit Raybaud, Research Software Manager.
Hawk: Hybrid Datacenter Scheduling
Lecture 4. Niching and Speciation (1)
Genetic Signals of Plasmodium falciparum Reveal Transmission Dynamics and Track Infections Sarah K. Volkman.
Traveling Salesman Problem by Genetic Algorithm
Human-centered Machine Learning
Rohan Yadav and Charles Yuan (rohany) (chenhuiy)
Genetic Algorithms: A Tutorial
Coevolutionary Automated Software Correction
Device Failure Prediction
Stochastic Methods.
Presentation transcript:

Running high-dimensional genomic models on COMPS Albert Lee, Sharon Chen, Zhaowei Du, Jeff Steinkraus, Benoit Raybaud, Joshua L. Proctor IDM Symposium, April 2019

Epidemiology + Computation Understanding driving factor for sustained transmission Can we understand malaria transmission using genetic data? We can use models to connect genetics to transmission Genomics lets us relate infections to one another based on the similarity of genetic information. Previous work has shown that genetic features of the population of parasites can be used to infer transmission dynamics using a stochastic model. Recent work has been extending this with spatial and temporal metadata. This is important because a lot of potential info in the data and the model needs to keep up with it I will talk about the computational challenges of extending this model to explain more complex features of genomic data. R. Daniels, W. Wong, E. Wenger, J. Proctor, et al (2015)

Epidemiology + Computation Spatio-temporal dynamic genomic model Link parasite clones to ancestors using genome, time, and location Iterative markov model Must be sequential (long run times) High-dimensional 3 parameters → 10+ 40+ summary statistics (and growing) Genetic data → 100s of SNPs 2009 CGATATG 2010 CGATATG Basic idea is that as strains are introduced to a population either by evolution or importation, they spread throughout space over some period of time and we can characterize this process with a model The model is stochastic, markovian

A dynamic spatial model with genetics Nodes ID HostIDs 00 [01, 02, 04] 01 [00, 03, 05] CGATATG Hosts ID BarcodeIDs Node 00 [03, __, __] 01 01 [00, __, __] 00 02 [02, __, __] 00 03 [__, __, __] 01 04 [__, __, __] 00 05 [01, __, __] 01 Hosts ID BarcodeIDs Node 00 [03, __, __] 01 01 [00, __, __] 00 02 [02, __, __] 00 03 [05, __, __] 01 04 [__, __, __] 00 05 [01, 04, __] 01 CTAGATG CTATACG CGATATG Reservoir ID Barcode Host 00 CGATATG 01 01 CTATACG 05 02 CTAGATG 02 03 CTAGACG 00 04 CGATATG 05 05 CTAGATG 03 CTAGACG Show upgrade to model – means I need more params CTAGATG

Model Mechanics: A detailed look at a single simulation step New state Current state Time step Reservoir ID Barcode 0 CTANAAGAC 1 GTAGATGAC 2 CTAGATGAC 4 GTCGATGAC 5 CTAGATGAC 6 CTAGAAGAC Reservoir ID Barcode 0 CTANAAGAC 1 GTAGATGAC 2 CTAGATGAC 3 CTCGAATAC 4 GTCGATGAC 1. deaths 2. births a. clonal b. meiotic 3. importation History ID Barcode 0 CTACAAGAC 1 CTAGATGAC 2 CTAGAAGAC History ID Barcode 0 CTACAAGAC

Dynamic Model Overview Parameters Floating: R0A, R0B, R0C Fixed: Extinction rate Outcrossing rate Importation rate Reservoir, hosts, nodes Mechanics: Outcrossing Importation Clonal propagation Infection history Sampling model Sampled infections Constants: Allele frequency SNP positions Chromosomes Rainfall Host population Diversity metrics: COI Repeated clones Number of strains Senegal data Likelihood

Parameter sweep Model Simulation Parameters: (R0A, R0B, R0C) Senegal data Parameters: (R0A, R0B, R0C) = (1.2, 1.2, 1.2) Likelihood Model Simulation Parameters: (R0A, R0B, R0C) = (1.2, 1.2, 1.8) Likelihood Max out at around 5 parallel jobs locally Model Simulation Parameters: (R0A, R0B, R0C) = (1.2, 1.2, 2.4) Likelihood

Parameter sweep results Malaria incidence (simulation) R0a = 2.4 R0c = 2.1 R0b likelihood R0b = 1.8 R0a

Some geeky trivia Python Make timesteps, random draws, array ops cheap Numpy, pandas Operate on arrays as much as possible Try to build dataframes only once Appends are costly Make timesteps, random draws, array ops cheap Replace searches with lookups Barcode comparisons are bitwise logic operations vs. Typical simulations: 3 to 300 seconds

Architecture on local LOCAL COMPS or cloud Analysis script 1.2 output Batch script output 1.8 output 2.4 1.2 1.8 2.4

Architecture on COMPS model launcher output queue model launcher COMPS (Windows HPC) model launcher output LOCAL output Python script Analyzer output 1.2 1.8 2.4 output

Likelihood surfaces from 3 parameterizations (1-day per sweep) pday = 0.0001 pday = 0.02 pday = 0.5 Summary statistics cannot distinguish between models on yearly timescales (it took an entire weekend just to be able to know this)

Parameter sweeps are expensive Current model needs R0 for every quarter: 4 parameters, 10 grid points each = 104 100 sims per 4-param vector = 106 sims ~1-5 minutes per sim = ~36 hours on 1000 cores Future models will need 10+ parameters: 1010 grid points → already intractable Need better way to explore parameter space

Optimizing Parameter space Most of the grid evaluations are wasted If we knew where the solution were beforehand we could save time Solution: adaptive sampling Metropolis-Hastings Importance sampling Gradient descent All of these require talking to cluster and updating the list of sims I want to run

Architecture with IMIS (Calibtools) ... Q1 Q2 COMPS ... LOCAL Python script (init params) Python script (refine params) Analyzer Analyzer

Conclusions and next steps COMPS gives >100x speed up via parallelization More flexible parameter fits Explanatory power of model greatly increased HPC architecture allows us to analyze simulations in aggregate and then iterate Next steps Implementing IMIS Theoretically unlimited number of parameters Potentially more robust framework for model evaluation Dynamic job scheduler Dynamic job scheduler? *** Bring more science in why do we need more params

What about the science… Sarah Volkman, Tuesday 9am Genetic Signals of Plasmodium falciparum Reveal Transmission Dynamics and Track Infections Albert Lee, Wednesday 11am Molecular surveillance reveals spatio-temporal trends of malaria transmission in Senegal

End

Appendix

Barcode Data CTAGAATAC CTAGAAGAC CTACAAGAC CTAGAATAC CTAGAAGAC …TTACG… …CAT… …GAC… …G… …AGTC… …TTACG… …CAT… …GAC… …G… …AGTC… CTANAANAC CTAGA CTAGA?GAC 14 chromosomes ID Barcode 0 CTANAANAC 1 GTAGATGAC 2 CTAGAXGAC 3 CTCGAATAC : CTAGAXGAC

Dynamic genomic model CGATATG Reservoir CGATATG CTAGATG CTAGACG Infection history CGATATG CTAGATG Ask Caitlin about mosquito  CTAGACG CTAGATG