Globus.org/genomics Optimizations in running large-scale Genomics workloads in Globus Genomics using HTCondor Ravi Madduri Joint work with.

Slides:



Advertisements
Similar presentations
Hello i am so and so, title/role and a little background on myself (i.e. former microsoft employee or anything interesting) set context for what going.
Advertisements

DTI Image Processing Pipeline and Cloud Computing Environment Kyle Chard Computation Institute University of Chicago.
SLA-Oriented Resource Provisioning for Cloud Computing
University of Notre Dame
System Center 2012 R2 Overview
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
Cloud Computing Brandon Hixon Jonathan Moore. Cloud Computing Brandon Hixon What is Cloud Computing? How does it work? Jonathan Moore What are the key.
Infrastructure as a Service (IaaS) Amazon EC2
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Steve Litster PhD Global Head of Scientific Computing.
HTCondor at Cycle Computing: Better Answers. Faster. Ben Cotton Senior Support Engineer.
Nikolay Tomitov Technical Trainer SoftAcad.bg.  What are Amazon Web services (AWS) ?  What’s cool when developing with AWS ?  Architecture of AWS 
Cloud Computing (101).
SaaS, PaaS & TaaS By: Raza Usmani
M.A.Doman Model for enabling the delivery of computing as a SERVICE.
SOFTWARE AS A SERVICE PLATFORM AS A SERVICE INFRASTRUCTURE AS A SERVICE.
EA and IT Infrastructure - 1© Minder Chen, Stages in IT Infrastructure Evolution Mainframe/Mini Computers Personal Computer Client/Sever Computing.
Building Data-intensive Pipelines Ravi K Madduri Argonne National Lab University of Chicago.
Cloud Computing Why is it called the cloud?.
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
PhD course - Milan, March /09/ Some additional words about cloud computing Lionel Brunie National Institute of Applied Science (INSA) LIRIS.
CLOUD COMPUTING  IT is a service provider which provides information.  IT allows the employees to work remotely  IT is a on demand network access.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
Globus Genomics – Science as a Service for large scale NGS analysis
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Ian Alderman A Little History…
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Large Scale Sky Computing Applications with Nimbus Pierre Riteau Université de Rennes 1, IRISA INRIA Rennes – Bretagne Atlantique Rennes, France
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
NML Bioinformatics Service— Licensed Bioinformatics Tools High-throughput Data Analysis Literature Study Data Mining Functional Genomics Analysis Vector.
Grid and Cloud Computing Globus Provision Dr. Guy Tel-Zur.
The CRI compute cluster CRUK Cambridge Research Institute.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
VMware vSphere Configuration and Management v6
Chad Collins CEO Henry Chan CTO In Latin, nubifer means “bringing the clouds”
Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware A Cloud Computing Methodology Study of.
Web Technologies Lecture 13 Introduction to cloud computing.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
Globus.org/genomics Globus Galaxies Science Gateways as a Service Ravi K Madduri, University of Chicago and Argonne National Laboratory
Virtual Cluster Computing in IHEPCloud Haibo Li, Yaodong Cheng, Jingyan Shi, Tao Cui Computer Center, IHEP HEPIX Spring 2016.
INTRODUCTION TO GRID & CLOUD COMPUTING U. Jhashuva 1 Asst. Professor Dept. of CSE.
© 2012 Eucalyptus Systems, Inc. Cloud Computing Introduction Eucalyptus Education Services 2.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
© 2015 MetricStream, Inc. All Rights Reserved. AWS server provisioning © 2015 MetricStream, Inc. All Rights Reserved. By, Srikanth K & Rohit.
The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.
Welcome To We have registered over 5,000 domain names and host over 1,500 cloud servers for individuals and organizations, Our fast and reliable.
Cloud Computing Service Architectures V. Arun College of Computer Science University of Massachusetts Amherst 1.
ChinaGrid: National Education and Research Infrastructure Hai Jin Huazhong University of Science and Technology
Introduction To Cloud Computing By Diptee Chikmurge And Minakshi Vharkate Asst.Professor MIT AOE Alandi(D),Pune.
Agenda  What is Cloud Computing?  Milestone of Cloud Computing  Common Attributes of Cloud Computing  Cloud Service Layers  Cloud Implementation.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
By: Raza Usmani SaaS, PaaS & TaaS By: Raza Usmani
AWS Integration in Distributed Computing
Elastic Computing Resource Management Based on HTCondor
University of Chicago and ANL
Brad Sutton Assoc Prof, Bioengineering Department
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
AWS COURSE DEMO BY PROFESSIONAL-GURU. Amazon History Ladder & Offering.
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Cloud Computing Dr. Sharad Saxena.
Basic Grid Projects – Condor (Part I)
AWS Cloud Computing Masaki.
Setting up PostgreSQL for Production in AWS
The Database World of Azure
Presentation transcript:

globus.org/genomics Optimizations in running large-scale Genomics workloads in Globus Genomics using HTCondor Ravi Madduri Joint work with Paul Davé, Lukasz Lacinski, Alex Rodriguez, Dinanath Sulakhe, Ryan Chard and Ian Foster

globus.org/genomics Globus Genomics is developed, operated, and supported by researchers, developers, and bioinformaticians at the Computation Institute – University of Chicago/Argonne National Lab We are a non-profit organization building solutions for non- profit researchers Our goal is to support the advancement of science by bringing together our strengths and capabilities to help meet the unique needs of researchers and research institutions Who We Are

globus.org/genomics 90% of cancer patients carry a mutation that may be responsive to a known drug Mark Rubin, Weill Cornell Medical College and NewYork-Presbyterian Hospital in New York in Nature, April, 2015

Trying to find a single causative gene for diseases with a complex genetic background is like looking for the proverbial needle in a haystack – Nancy Cox (Vanderbilt)

globus.org/genomics How do we accelerate discovery without requiring that every lab acquire a haystack-sorting machine? Clayton & Shuttleworth thresher, 1910: Museum Victoria, Australia

globus.org/genomics Our answer: Globus Genomics Sequencing Centers Public Data Public Data Storage Local Cluster/ Cloud Seq Center Research Lab Globus provides for High-performance Fault-tolerant Secure file transfer between all data-endpoints Data management Data analysis Picard GATK Fastq Ref Genome Alignment Variant Calling Galaxy Data Libraries Globus Genomics on Amazon EC2 Analytical tools are automatically run on the scalable compute resources when possible Globus integrated within Galaxy Web-based UI Drag-Drop workflow creations Easily modify workflows with new tools Galaxy-based workflow management FTP, SCP, others FTP, SCP SCP Globus Genomics FTP, SCP, HTTP

globus.org/genomics Our Science Stack Galaxy –Interactive execution –Creation, Execution, Sharing, Discovering Workflows Globus –Data management –Identity Management AWS –HTCondor, Chef, EC2, EBS, S3, SNS –Spot, Route 53, Cloud Formation SaaS PaaS IaaS

globus.org/genomics Key Technical Bits HTCondor Computational Profiles for various analysis tools Elastic Spot instance provisioner Chef Nagios + Munin Support

globus.org/genomics Challenges/Opportunities Optimize for cost, performance and both Deliver on SLAs AWS has an awesome spot market with low prices for compute AWS billing is in hourly increments It takes roughly 8mins on an average to ask for a spot instance and have it join Global view of workloads across multiple customers

globus.org/genomics Globus Galaxies Each GG gateway operates a persistent head node, exposing the Galaxy service The head node operates an HTCondor pool and periodically executes a resource provisioning application The provisioner monitors the HTCondor queue for idle jobs The provisioner processes idle job ClassAds and monitors current EC2 spot market prices to cost-effectively acquiring resources

globus.org/genomics Globus Galaxies and HTCondor We have created execution profiles for frequently run genomics applications, mapping CPU, memory, and IO requirements to EC2 resource types Resources are dynamically contextualized to meet the requirements of GG workloads Each worker resource is configured as a single HTCondor slot, utilizing all of the resources advertised cores Once configured, the worker resource joins the pool operated by the gateway’s head node

globus.org/genomics HTCondor Resource Migration To improve utilization and reduce cost we want to apply idle resources to another GG gateway’s jobs, moving the resources between HTCondor pools Centralize provisioning and resource monitoring Employ a hierarchy of HTCondor collectors, having each gateway’s HTCondor service report to additional collectors –Enables logical separation between Globus Galaxies services (Genomics, Materials science, etc.) –Global queue data can be processed Opt to move an idle worker resource between HTCondor masters

globus.org/genomics HTCondor Resource Migration Migrating resources: –Advertise and set attributes stating a resource is leaving a pool –Execute condor_off –peaceful –Reconfigure the resource Modify slot definition to meet requirements Modify host address to new gateway –Restart the condor service

globus.org/genomics Considerations Data transfer rates and limits across regions, and even between zones, are significant Facilitates the inclusion of gateway priority Enables a new billing methodology, distinct of AWS

globus.org/genomics 134 samples and 4 workflows 4 TB data 2200 core hours in 6 days Cox lab, UChicago

globus.org/genomics Olopade lab, UChicago A profile of inherited predisposition to breast cancer among Nigerian women Y. Zheng, T. Walsh, F. Yoshimatsu, M. Lee, S. Gulsuner, S. Casadei, A. Rodriguez, T. Ogundiran, C. Babalola, O. Ojengbede, D. Sighoko, R. Madduri, M.-C. King, O. Olopade 200 targeted exomes 200 GB data 76,920 core hours in 1.25 days

globus.org/genomics Innovation Center for Biomedical Informatics - Georgetown A case study for high throughput analysis of NGS data for translational research using Globus Genomics D. Sulakhe, A. Rodriguez, K. Bhuvaneshwar, Y. Gusev, R. Madduri, L. Lacinski, U. Dave, I. Foster, S. Madhavan 78 exomes from lung cancer study 2 TB data 125,936 core hours in 1.7 days

globus.org/genomics Other Globus Genomics users Dobyns Lab Cox Lab Volchenboum Lab Olopade Lab Nagarajan Lab

globus.org/genomics Pricing includes Estimated compute Storage (one month) Globus Genomics platform usage Support Costs are remarkably low

globus.org/genomics Globus Genomics – Making it routine to find needles in NGS haystacks

globus.org/genomics More information on Globus Genomics and to sign up for a free trial : More information on Globus:

globus.org/genomics Our work is supported by: U.S. DEPARTMENT OF ENERGY 22

globus.org/genomics Thank