HPC In The Cloud Case Study: Proteomics Workflow

Slides:



Advertisements
Similar presentations
Implementing Tableau Server in an Enterprise Environment
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Ian D. Alderman CycleComputing
1. Topics Is Cloud Computing the way to go? ARC ABM Review Configuration Basics Setting up the ARC Cloud-Based ABM Hardware Configuration Software Configuration.
Faster, More Scalable Computing in the Cloud Pavan Pant, Director Product Management.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
HTCondor at Cycle Computing: Better Answers. Faster. Ben Cotton Senior Support Engineer.
©Company confidential 1 Performance Testing for TM & D – An Overview.
An Introduction to Cloud Computing. The challenge Add new services for your users quickly and cost effectively.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Computer System Architectures Computer System Software
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
March 3rd, 2006 Chen Peng, Lilly System Biology1 Cluster and SGE.
Jason Stowe Condor Week 2009 April 22 nd, Coming to Condor Week since Started as a User.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Sep 21, 20101/14 LSST Simulations on OSG Sep 21, 2010 Gabriele Garzoglio for the OSG Task Force on LSST Computing Division, Fermilab Overview OSG Engagement.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Cracow Grid Workshop October 2009 Dipl.-Ing. (M.Sc.) Marcus Hilbrich Center for Information Services and High Performance.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
Server to Server Communication Redis as an enabler Orion Free
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
| nectar.org.au NECTAR TRAINING Module 4 From PC To Cloud or HPC.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
Parallel Applications And Tools For Cloud Computing Environments CloudCom 2010 Indianapolis, Indiana, USA Nov 30 – Dec 3, 2010.
Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-2.
Tackling I/O Issues 1 David Race 16 March 2010.
Climate-SDM (1) Climate analysis use case –Described by: Marcia Branstetter Use case description –Data obtained from ESG –Using a sequence steps in analysis,
Next Generation of Apache Hadoop MapReduce Owen
Distributed Process Discovery From Large Event Logs Sergio Hernández de Mesa {
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
HPC need and potential of ANSYS CFD and mechanical products at CERN A. Rakai EN-CV-PJ2 5/4/2016.
SEMINAR ON.  OVERVIEW -  What is Cloud Computing???  Amazon Elastic Cloud Computing (Amazon EC2)  Amazon EC2 Core Concept  How to use Amazon EC2.
Scientific Data Processing Portal and Heterogeneous Computing Resources at NRC “Kurchatov Institute” V. Aulov, D. Drizhuk, A. Klimentov, R. Mashinistov,
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Canadian Bioinformatics Workshops
SUSE Linux Enterprise Server for SAP Applications
Overview of Scientific Workflows: Why Use Them?
Platform as a Service (PaaS)
Happy Endings: Reengineering Wesleyan’s Software Deployment to Labs and Classrooms Kyle Tousignant 03/22/2016.
Review of the WLCG experiments compute plans
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING
HPC In The Cloud Case Study: Proteomics Workflow
Platform as a Service (PaaS)
Processes and threads.
Platform as a Service (PaaS)
AWS Integration in Distributed Computing
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Dag Toppe Larsen UiB/CERN CERN,
U.S. ATLAS Grid Production Experience
Infrastructure Orchestration to Optimize Testing
Dag Toppe Larsen UiB/CERN CERN,
Next Generation Health Checks
An Introduction to Cloud Computing
Provisioning 160,000 cores with HEPCloud at SC17
Platform as a Service.
Architecture & System Overview
Recap: introduction to e-science
AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.
INFO 344 Web Tools And Development
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Overview of Workflows: Why Use Them?
Implementation of a small-scale desktop grid computing infrastructure in a commercial domain    
Presentation transcript:

HPC In The Cloud Case Study: Proteomics Workflow Jason Stowe jstowe@cyclecomputing.com http://twitter.com/jasonastowe http://twitter.com/cyclecomputing

Who am I? CEO for Cycle Computing, Started in 2005 Leader in open HPC solutions and operational telemetry on desktops, servers, in the Cloud Provide management / operational telemetry tools for managing 30,000 core environments using Condor, Hadoop, SGE, PBS, others Founded to write software for easy computation management that scaled to large environments. ( user job submission, admin automation, reporting, usage visualization, audit, monitoring, and chargeback to multiple schedulers including Condor/PBS/SGE/HDFS through one interface) In a prior life, I worked in movies at a Disney Production: Ran 75+ Million renders on #84 of Top 100 using Condor to make “The Wild” Computer Scientist by education, CMU/Cornell, worked at PSC and Theory Center

Running Proteomics Workflow in the Cloud

Workflow Summary Two input files: control and data. Sequence of preprocessing steps, many loosely parallel short jobs. Main computation phase: Many small jobs, some up to three hours. Different tools (OMSSA, Tandem) have different advantages, so several run in parallel. Post processing includes many loosely parallel short jobs and comparison between tools.

Proteomics Workflow txtextract msmsfilter msmsfeatures partition makemgf Custom Perl … … OMSSA Tandem Pilot Parallel FileSystem … … pepid

Workflow Characteristics Challenges 80+ Perl scripts, 40+ R scripts Complex dependencies between scripts Reliance on shared file system Large databases Advantages Well organized code Few entry points to SGE (qblastem, sweeper) High compute to I/O ratio Relatively static databases

Workflow Conversion Process Phase I: Analyze existing workflow structure with domain experts. For each job, obtain compute and I/O requirements. Find entry points into job scheduler Make location aware via ENV. Make use of exit status. Generate DAG structure. Test pieces in isolation. Test whole workflow in old, new environments. Phase II: Efficiency. Robustness. Maintainability.

Before txtextract msmsfilter msmsfeatures partition makemgf Custom Perl … … OMSSA Tandem Pilot Parallel FileSystem … … pepid

Phase I Implemented changes to work with Condor in CycleCloud environment. Minor modifications to code allow Condor to use exit status to know when to retry jobs. DAGMan replaces SGE job dependencies, improves robustness. Minor changes permit workflow to run in CycleCloud. Testing SGE and Condor on Brainiac. Code is location aware – uses Condor when available. Sample: ~900 core-hours, completed in ~<3 hours.

Phase II Improved robustness by splitting scatter/gather. Developed script for converting SGE job arrays to DAGMan jobs Improved scalability by submitting large sets of jobs using Condor DAGMan. Improved efficiency by using job runtime prediction for OMSSA and Tandem jobs. Moved retry handling to Condor.

Before txtextract msmsfilter msmsfeatures partition makemgf Custom Perl … … OMSSA Tandem Pilot Parallel FileSystem … … pepid

After txtextract msmsfilter … msmsfeatures partition makemgf … I/O Via Run More OMSSA Multi-Threaded Tandem Pilot … Changes: Moved retry handling to Condor. IBRIX -> s3backer r/o FS (blue bar on left thinner). Dagman wraps each level (squares for split scatter/gather). Can run more OMSSA jobs concurrently in Cloud. Can run multi-threaded Tandem jobs on dedicated 8-core machines (faster). Condor helps with job-level robustness (colored nodes). … pepid

Results Workflows that took 5 hours or overnight now run reliably in 2.5 hours. Workflows more robust using DAGMan, no intricate Perl Scripting Tools for DAGMan were adapted to other workflows @Lilly Cost of individual run now quantifiable:   Internal: ~900 hours * internal costs In EC2: First run $130, each additional $75 Spot instances can reduce this cost to $45 / $25

How does this apply to you? Memory? Amazon released new instances with more RAM per core recently (8+ GB per core) Data Size? Many people use X0 TB to Y PB of storage Still have to provision filers, if that’s required AWS Import can help get data into S3 (single 2TB costs about $154.70) – you already do something similar with boat data Interconnect? Like Memory amounts, maybe they’re working on this? Oil/Gas, Engineering, Finance, etc. need this

Special Cases Cloud is UnCommon: If you’re large enough that you have a well planned refresh cycle without bursts Cloud is Good: If you’re smaller, and don’t want to provision resources before you get a job Remember benefits of Cloud Resources are they are available on short notice, with low overhead, at large scales, with a pay as you go model = great for bursts or bursty usage