Integrated genome analysis using

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Experience with Adopting Clouds at Notre Dame Douglas Thain University of Notre Dame IEEE CloudCom, November 2010.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Institute for Software Science – University of ViennaP.Brezany 1 Databases and the Grid Peter Brezany Institute für Scientific Computing University of.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

Deconstructing Clusters for High End Biometric Applications NSF CCF June Douglas Thain and Patrick Flynn University of Notre Dame 5 August.

Cooperative Computing for Data Intensive Science Douglas Thain University of Notre Dame NSF Bridges to Engineering 2020 Conference 12 March 2008.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Applying Twister to Scientific Applications CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Scott Emrich Assistant Professor, Computer Science and Engineering Scientific Manager, VectorBase University of Notre Dame A flexible, scalable genomics.

Abstract Although transposable elements (TEs) were discovered over 50 years ago, the robust discovery of them in newly sequenced genomes remains a difficult.

Elastic Applications in the Cloud Dinesh Rajan University of Notre Dame CCL Workshop, June 2012.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Jason Stowe Condor Week 2009 April 22 nd, Coming to Condor Week since Started as a User.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

Experiences Using Cloud Computing for A Scientific Workflow Application Jens Vöckler, Gideon Juve, Ewa Deelman, Mats Rynge, G. Bruce Berriman Funded by.

Building Scalable Scientific Applications using Makeflow Dinesh Rajan and Douglas Thain University of Notre Dame.

IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.

CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.

Accelerating Scientific Exploration Using Workflow Automation Systems Terence Critchlow (LLNL) Ilkay Altintas (SDSC) Scott Klasky(ORNL) Mladen Vouk (NCSU)

The Future of the iPlant Cyberinfrastructure: Coming Attractions.

The Cooperative Computing Lab  We collaborate with people who have large scale computing problems in science, engineering, and other fields.  We operate.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

1 Large-Scale Profile-HMM on the Grid Laurent Falquet Swiss Institute of Bioinformatics CH-1015 Lausanne, Switzerland Borrowed from Heinz Stockinger June.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Vectorbase and Galaxy Jarek Nabrzyski On behalf of VectorBase Center for Research Computing University of Notre Dame VectorBase Bioinformatics Resource.

MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.

LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

Introduction to Scalable Programming using Work Queue Dinesh Rajan and Mike Albrecht University of Notre Dame October 24 and November 7, 2012.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.

Building Scalable Scientific Applications with Work Queue Douglas Thain and Dinesh Rajan University of Notre Dame Applied Cyber Infrastructure Concepts.

Introduction to Makeflow and Work Queue Nicholas Hazekamp and Ben Tovar University of Notre Dame XSEDE 15.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

HPC In The Cloud Case Study: Proteomics Workflow

HPC In The Cloud Case Study: Proteomics Workflow

Introduction to Distributed Platforms

Tools and Services Workshop

University of Chicago and ANL

Joslynn Lee – Data Science Educator

GWE Core Grid Wizard Enterprise (

Spark Presentation.

Scaling Up Scientific Workflows with Makeflow

Collaborative Offloading for Distributed Mobile-Cloud Apps

Introduction to Makeflow and Work Queue

Applying Twister to Scientific Applications

Haiyan Meng and Douglas Thain

Support for ”interactive batch”

Data-Intensive Computing: From Clouds to GPU Clusters

Managing batch processing Transient Azure SQL Warehouse Resource

Module 01 ETICS Overview ETICS Online Tutorials

Saranya Sriram Developer Evangelist | Microsoft

What’s New in Work Queue

Creating Custom Work Queue Applications

Overview of Workflows: Why Use Them?

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Integrated genome analysis using Makeflow + friends Scott Emrich UND Director of Bioinformatics Computer Science & Engineering University of Notre Dame

VectorBase is a Bioinformatics Resource Center VectorBase is a genome resource for invertebrate vectors of human pathogens Funded by NIH-NIAID as part of a wider group of NIAID BRCs (see above) for biodefense and emerging and re-emerging infectious diseases 3rd contract started Fall 2014 (for up to 2 more years)

VectorBase: relevant species of interest

Assembly required…

Current challenges genome informaticians are focusing on Refactoring genome mapping tools to use HPC/HTC for speed-up Esp. when new faster algorithms are not yet available Using “data intensive” frameworks: mapreduce/Hadoop and Spark Efficiently harnessing resources from heterogenous systems Scalable, elastic workflows with flexible granularity

Accelerating Genomics Workflows In Distributed Environments Research Update Accelerating Genomics Workflows In Distributed Environments March 8, 2016 Olivia Choudhury, Nicholas Hazekamp, Douglas Thain, Scott Emrich Department of Computer Science and Engineering, University of Notre Dame, IN.

Scaling Up Bioinformatic Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow Nicholas Hazekamp, Joseph Sarro, Olivia Choudhury, Sandra Gesing, Scott Emrich and Douglas Thain Cooperative Computing Lab: http://ccl.cse.nd.edu University of Notre Dame

Using makeflow to express genome variation workflow WorkQueue master-worker framework Sun Grid Engine (SGE) batch system

Overview of CCL-based solution We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system

Realized concurrency (in practice)

Related Work (HPC/HTC; not extensive!) Jarvis et al - Performance models efficiently manage workloads on clouds Ibrahim et al, Grossman - Balance number of resources and duration of usage Grandl et al, Ranganathan et al, Buyya et al - Scheduling techniques reduce resource utilization Hadoop, Dryad, CIEL support data-intensive workload How to write + discuss related work? Why are we not doing scheduling?

Observations Multi-level concurrency is not high with current bioinformatics tools

Observations Task-level parallelism can get worse Balancing multi-level concurrency and task-level parallelism easy w/ work queue

Results – Predictive Capability for three tools Avg. MAPE = 3.1

Estimated Azure Cost ($) Results – Cost Optimization # Cores/ Task # Tasks Predicted Time (min) Speedup Estimated EC2 Cost ($) Estimated Azure Cost ($) 1 360 70 6.6 50.4 64.8 2 180 38 12.3 25.2 32.4 4 90 24 19.5 18.9 8 45 27 17.3

Galaxy Popular with Biologist and Bioinformatics Emphasis on reproducibility Varying level of difficult, but mostly boils down to once a tool is installed it has turn-key execution (If everything is defined properly it runs) Provides interface to chaining tools into a workflows, storing, and sharing.

Workflows in Galaxy Intro of short Galaxy Workflow To the user each tools is a black box that they don’t have to know whats happening in the back Turn-Key execution User doesn’t have to see any of this interaction, just tool execution success or failure Define GALAXY JOB

User-System Interaction

Workflow Dynamically Expanded behind Galaxy The user needs to know nothing of the specific execution. Padding the complexities and verification behind the Galaxy façade As computational needs increase, so to do the resources needed and how we interact with them. A programmer with a better grasp on the workings of the software can determine a safe means of decomp that then can be harnessed by many different scientists.

New User-System Interaction

Results – Optimal Configuration For the given dataset, K* = 90, N* = 4

Best Data Partitioning Approaches Split Ref Split SAM SAMBAM ReadGroups Granularity-based partitioning for parallelized BWA Alignment-based partitioning for parallelized HaplotypeCaller

Full Scale Run 61.5X speedup (Galaxy) Time (HH:MM) 61.5X speedup (Galaxy) Test tools – BWA and GATK’s HaplotypeCaller Test data - 100-fold coverage ILLUMINA HiSeq single-end genome data of 50 northern red oak individuals

Comparison of Sequential and Parallelized Pipelines BWA Intermediate Steps HaplotypeCaller Pipeline Sequential 4 hrs. 04 mins. 5 hrs. 37 mins. 12 days Parallel 0 hr. 56 mins. 2 hrs. 45 mins. 0 hr. 24 mins. 4 hrs. 05 mins. Run time of parallelized BWA-HaplotypeCaller pipeline with optimized data partitioning

Performance in Real-Life (summer 2016) 100+ Different runs through Workflow Utilizing 500+ Cores with heavy load Data sets ranging from >1GB to 50GB+

VectorBase production example

VB running blast (before) Condor jobs blast Frontend blast condor Talk directly to condor. Custom condor submit scripts per database. One condor job designated to wait on the rest. blast idle-wait

VB running blast (now) Condor jobs blast Frontend blast makeflow Makeflow manages workflow and connections to condor. Makeflow files are created on the fly (php code, o custom scripts per database) All condor slots run computations. Jobs take 1/3 the time (Saves about 18s in response time). Similar changes for hmmer and clustal. blast condor blast

VB jobs- future? We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system

Acknowledgements Questions? Notre Dame Bioinformatics Lab (http://www3.nd.edu/~semrich/) and The Cooperative Computing Lab (http://www3.nd.edu/~ccl/), University of Notre Dame NIH/NIAID grant HHSN272200900039C and NSF grants SI2-SSE-1148330 and OCI-1148330 Questions?

Small Scale Run Query: 600MB Ref: 36MB

Data Transfer – A Hindrance Workers Data Transferred (MB) Transfer Time (s.) 2 64266 594 5 65913 593 10 67522 598 20 70350 623 50 74534 754 100 80267 765 Amount and time of data transfer with increasing workers

MinHash from 1,000 feet Similarity Signatures Sequence s1 Sequence s2 ACGTGCGAAATTTCTC Sequence s2 SIM(s1,s2) = Intersection / Union AAGTGCGAAATTACTT Signatures SIG(s2) = [h1(s2), h2(s2),...,hk(s2)] SIG(s1) SIG(s2) **Comparing 2 sequences, requires k Integer comparison, where k is constant

Three stages of scaffolding

E. coli K12 50 rearrangements

E. coli K12 500 rearrangements

Application-level Model for Runtime

Application-level Model for Memory

System-level Model for Runtime

System-level Model for Memory

Distribution of Regression Coefficients