Integrated genome analysis using Makeflow + friends Scott Emrich UND Director of Bioinformatics Computer Science & Engineering University of Notre Dame
VectorBase is a Bioinformatics Resource Center VectorBase is a genome resource for invertebrate vectors of human pathogens Funded by NIH-NIAID as part of a wider group of NIAID BRCs (see above) for biodefense and emerging and re-emerging infectious diseases 3rd contract started Fall 2014 (for up to 2 more years)
VectorBase: relevant species of interest
Assembly required…
Current challenges genome informaticians are focusing on Refactoring genome mapping tools to use HPC/HTC for speed-up Esp. when new faster algorithms are not yet available Using “data intensive” frameworks: mapreduce/Hadoop and Spark Efficiently harnessing resources from heterogenous systems Scalable, elastic workflows with flexible granularity
Accelerating Genomics Workflows In Distributed Environments Research Update Accelerating Genomics Workflows In Distributed Environments March 8, 2016 Olivia Choudhury, Nicholas Hazekamp, Douglas Thain, Scott Emrich Department of Computer Science and Engineering, University of Notre Dame, IN.
Scaling Up Bioinformatic Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow Nicholas Hazekamp, Joseph Sarro, Olivia Choudhury, Sandra Gesing, Scott Emrich and Douglas Thain Cooperative Computing Lab: http://ccl.cse.nd.edu University of Notre Dame
Using makeflow to express genome variation workflow WorkQueue master-worker framework Sun Grid Engine (SGE) batch system
Overview of CCL-based solution We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system
Realized concurrency (in practice)
Related Work (HPC/HTC; not extensive!) Jarvis et al - Performance models efficiently manage workloads on clouds Ibrahim et al, Grossman - Balance number of resources and duration of usage Grandl et al, Ranganathan et al, Buyya et al - Scheduling techniques reduce resource utilization Hadoop, Dryad, CIEL support data-intensive workload How to write + discuss related work? Why are we not doing scheduling?
Observations Multi-level concurrency is not high with current bioinformatics tools
Observations Task-level parallelism can get worse Balancing multi-level concurrency and task-level parallelism easy w/ work queue
Results – Predictive Capability for three tools Avg. MAPE = 3.1
Estimated Azure Cost ($) Results – Cost Optimization # Cores/ Task # Tasks Predicted Time (min) Speedup Estimated EC2 Cost ($) Estimated Azure Cost ($) 1 360 70 6.6 50.4 64.8 2 180 38 12.3 25.2 32.4 4 90 24 19.5 18.9 8 45 27 17.3
Galaxy Popular with Biologist and Bioinformatics Emphasis on reproducibility Varying level of difficult, but mostly boils down to once a tool is installed it has turn-key execution (If everything is defined properly it runs) Provides interface to chaining tools into a workflows, storing, and sharing.
Workflows in Galaxy Intro of short Galaxy Workflow To the user each tools is a black box that they don’t have to know whats happening in the back Turn-Key execution User doesn’t have to see any of this interaction, just tool execution success or failure Define GALAXY JOB
User-System Interaction
Workflow Dynamically Expanded behind Galaxy The user needs to know nothing of the specific execution. Padding the complexities and verification behind the Galaxy façade As computational needs increase, so to do the resources needed and how we interact with them. A programmer with a better grasp on the workings of the software can determine a safe means of decomp that then can be harnessed by many different scientists.
New User-System Interaction
Results – Optimal Configuration For the given dataset, K* = 90, N* = 4
Best Data Partitioning Approaches Split Ref Split SAM SAMBAM ReadGroups Granularity-based partitioning for parallelized BWA Alignment-based partitioning for parallelized HaplotypeCaller
Full Scale Run 61.5X speedup (Galaxy) Time (HH:MM) 61.5X speedup (Galaxy) Test tools – BWA and GATK’s HaplotypeCaller Test data - 100-fold coverage ILLUMINA HiSeq single-end genome data of 50 northern red oak individuals
Comparison of Sequential and Parallelized Pipelines BWA Intermediate Steps HaplotypeCaller Pipeline Sequential 4 hrs. 04 mins. 5 hrs. 37 mins. 12 days Parallel 0 hr. 56 mins. 2 hrs. 45 mins. 0 hr. 24 mins. 4 hrs. 05 mins. Run time of parallelized BWA-HaplotypeCaller pipeline with optimized data partitioning
Performance in Real-Life (summer 2016) 100+ Different runs through Workflow Utilizing 500+ Cores with heavy load Data sets ranging from >1GB to 50GB+
VectorBase production example
VB running blast (before) Condor jobs blast Frontend blast condor Talk directly to condor. Custom condor submit scripts per database. One condor job designated to wait on the rest. blast idle-wait
VB running blast (now) Condor jobs blast Frontend blast makeflow Makeflow manages workflow and connections to condor. Makeflow files are created on the fly (php code, o custom scripts per database) All condor slots run computations. Jobs take 1/3 the time (Saves about 18s in response time). Similar changes for hmmer and clustal. blast condor blast
VB jobs- future? We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system
Acknowledgements Questions? Notre Dame Bioinformatics Lab (http://www3.nd.edu/~semrich/) and The Cooperative Computing Lab (http://www3.nd.edu/~ccl/), University of Notre Dame NIH/NIAID grant HHSN272200900039C and NSF grants SI2-SSE-1148330 and OCI-1148330 Questions?
Small Scale Run Query: 600MB Ref: 36MB
Data Transfer – A Hindrance Workers Data Transferred (MB) Transfer Time (s.) 2 64266 594 5 65913 593 10 67522 598 20 70350 623 50 74534 754 100 80267 765 Amount and time of data transfer with increasing workers
MinHash from 1,000 feet Similarity Signatures Sequence s1 Sequence s2 ACGTGCGAAATTTCTC Sequence s2 SIM(s1,s2) = Intersection / Union AAGTGCGAAATTACTT Signatures SIG(s2) = [h1(s2), h2(s2),...,hk(s2)] SIG(s1) SIG(s2) **Comparing 2 sequences, requires k Integer comparison, where k is constant
Three stages of scaffolding
E. coli K12 50 rearrangements
E. coli K12 500 rearrangements
Application-level Model for Runtime
Application-level Model for Memory
System-level Model for Runtime
System-level Model for Memory
Distribution of Regression Coefficients