Download presentation
Presentation is loading. Please wait.
1
Integrated genome analysis using
Makeflow + friends Scott Emrich UND Director of Bioinformatics Computer Science & Engineering University of Notre Dame
2
VectorBase is a Bioinformatics Resource Center
VectorBase is a genome resource for invertebrate vectors of human pathogens Funded by NIH-NIAID as part of a wider group of NIAID BRCs (see above) for biodefense and emerging and re-emerging infectious diseases 3rd contract started Fall (for up to 2 more years)
3
VectorBase: relevant species of interest
4
Assembly required…
5
Current challenges genome informaticians are focusing on
Refactoring genome mapping tools to use HPC/HTC for speed-up Esp. when new faster algorithms are not yet available Using “data intensive” frameworks: mapreduce/Hadoop and Spark Efficiently harnessing resources from heterogenous systems Scalable, elastic workflows with flexible granularity
6
Accelerating Genomics Workflows In Distributed Environments
Research Update Accelerating Genomics Workflows In Distributed Environments March 8, 2016 Olivia Choudhury, Nicholas Hazekamp, Douglas Thain, Scott Emrich Department of Computer Science and Engineering, University of Notre Dame, IN.
7
Scaling Up Bioinformatic Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow Nicholas Hazekamp, Joseph Sarro, Olivia Choudhury, Sandra Gesing, Scott Emrich and Douglas Thain Cooperative Computing Lab: University of Notre Dame
8
Using makeflow to express genome variation workflow
WorkQueue master-worker framework Sun Grid Engine (SGE) batch system
9
Overview of CCL-based solution
We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system
10
Realized concurrency (in practice)
11
Related Work (HPC/HTC; not extensive!)
Jarvis et al - Performance models efficiently manage workloads on clouds Ibrahim et al, Grossman - Balance number of resources and duration of usage Grandl et al, Ranganathan et al, Buyya et al - Scheduling techniques reduce resource utilization Hadoop, Dryad, CIEL support data-intensive workload How to write + discuss related work? Why are we not doing scheduling?
12
Observations Multi-level concurrency is not high with current bioinformatics tools
13
Observations Task-level parallelism can get worse
Balancing multi-level concurrency and task-level parallelism easy w/ work queue
14
Results – Predictive Capability for three tools
Avg. MAPE = 3.1
15
Estimated Azure Cost ($)
Results – Cost Optimization # Cores/ Task # Tasks Predicted Time (min) Speedup Estimated EC2 Cost ($) Estimated Azure Cost ($) 1 360 70 6.6 50.4 64.8 2 180 38 12.3 25.2 32.4 4 90 24 19.5 18.9 8 45 27 17.3
16
Galaxy Popular with Biologist and Bioinformatics
Emphasis on reproducibility Varying level of difficult, but mostly boils down to once a tool is installed it has turn-key execution (If everything is defined properly it runs) Provides interface to chaining tools into a workflows, storing, and sharing.
17
Workflows in Galaxy Intro of short Galaxy Workflow
To the user each tools is a black box that they don’t have to know whats happening in the back Turn-Key execution User doesn’t have to see any of this interaction, just tool execution success or failure Define GALAXY JOB
18
User-System Interaction
19
Workflow Dynamically Expanded behind Galaxy
The user needs to know nothing of the specific execution. Padding the complexities and verification behind the Galaxy façade As computational needs increase, so to do the resources needed and how we interact with them. A programmer with a better grasp on the workings of the software can determine a safe means of decomp that then can be harnessed by many different scientists.
20
New User-System Interaction
21
Results – Optimal Configuration
For the given dataset, K* = 90, N* = 4
22
Best Data Partitioning Approaches
Split Ref Split SAM SAMBAM ReadGroups Granularity-based partitioning for parallelized BWA Alignment-based partitioning for parallelized HaplotypeCaller
23
Full Scale Run 61.5X speedup (Galaxy)
Time (HH:MM) 61.5X speedup (Galaxy) Test tools – BWA and GATK’s HaplotypeCaller Test data fold coverage ILLUMINA HiSeq single-end genome data of 50 northern red oak individuals
24
Comparison of Sequential and Parallelized Pipelines
BWA Intermediate Steps HaplotypeCaller Pipeline Sequential 4 hrs. 04 mins. 5 hrs. 37 mins. 12 days Parallel 0 hr. 56 mins. 2 hrs. 45 mins. 0 hr. 24 mins. 4 hrs. 05 mins. Run time of parallelized BWA-HaplotypeCaller pipeline with optimized data partitioning
25
Performance in Real-Life (summer 2016)
100+ Different runs through Workflow Utilizing 500+ Cores with heavy load Data sets ranging from >1GB to 50GB+
26
VectorBase production example
27
VB running blast (before)
Condor jobs blast Frontend blast condor Talk directly to condor. Custom condor submit scripts per database. One condor job designated to wait on the rest. blast idle-wait
28
VB running blast (now) Condor jobs blast Frontend blast makeflow
Makeflow manages workflow and connections to condor. Makeflow files are created on the fly (php code, o custom scripts per database) All condor slots run computations. Jobs take 1/3 the time (Saves about 18s in response time). Similar changes for hmmer and clustal. blast condor blast
29
VB jobs- future? We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system
30
Acknowledgements Questions?
Notre Dame Bioinformatics Lab ( and The Cooperative Computing Lab ( University of Notre Dame NIH/NIAID grant HHSN C and NSF grants SI2-SSE and OCI Questions?
31
Small Scale Run Query: 600MB Ref: 36MB
32
Data Transfer – A Hindrance
Workers Data Transferred (MB) Transfer Time (s.) 2 64266 594 5 65913 593 10 67522 598 20 70350 623 50 74534 754 100 80267 765 Amount and time of data transfer with increasing workers
33
MinHash from 1,000 feet Similarity Signatures Sequence s1 Sequence s2
ACGTGCGAAATTTCTC Sequence s2 SIM(s1,s2) = Intersection / Union AAGTGCGAAATTACTT Signatures SIG(s2) = [h1(s2), h2(s2),...,hk(s2)] SIG(s1) SIG(s2) **Comparing 2 sequences, requires k Integer comparison, where k is constant
34
Three stages of scaffolding
35
E. coli K12 50 rearrangements
36
E. coli K12 500 rearrangements
37
Application-level Model for Runtime
38
Application-level Model for Memory
39
System-level Model for Runtime
40
System-level Model for Memory
41
Distribution of Regression Coefficients
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.