HPC In The Cloud Case Study: Proteomics Workflow

HPC In The Cloud Case Study: Proteomics Workflow
Jason Stowe

Who am I? CEO for Cycle Computing, Started in 2005 Leader in open HPC solutions and operational telemetry on desktops, servers, in the Cloud Provide management / operational telemetry tools for managing 30,000 core environments using Condor, Hadoop, SGE, PBS, others Founded to write software for easy computation management that scaled to large environments. ( user job submission, admin automation, reporting, usage visualization, audit, monitoring, and chargeback to multiple schedulers including Condor/PBS/SGE/HDFS through one interface) In a prior life, I worked in movies at a Disney Production: Ran 75+ Million renders on #84 of Top 100 using Condor to make “The Wild” Computer Scientist by education, CMU/Cornell, worked at PSC and Theory Center

Running Proteomics Workflow in the Cloud

Workflow Summary Two input files: control and data.
Sequence of preprocessing steps, many loosely parallel short jobs. Main computation phase: Many small jobs, some up to three hours. Different tools (OMSSA, Tandem) have different advantages, so several run in parallel. Post processing includes many loosely parallel short jobs and comparison between tools.

Proteomics Workflow txtextract msmsfilter msmsfeatures partition
makemgf Custom Perl … … OMSSA Tandem Pilot Parallel FileSystem … … pepid

Workflow Characteristics
Challenges 80+ Perl scripts, 40+ R scripts Complex dependencies between scripts Reliance on shared file system Large databases Advantages Well organized code Few entry points to SGE (qblastem, sweeper) High compute to I/O ratio Relatively static databases

Workflow Conversion Process
Phase I: Analyze existing workflow structure with domain experts. For each job, obtain compute and I/O requirements. Find entry points into job scheduler Make location aware via ENV. Make use of exit status. Generate DAG structure. Test pieces in isolation. Test whole workflow in old, new environments. Phase II: Efficiency. Robustness. Maintainability.

Before txtextract msmsfilter msmsfeatures partition makemgf
Custom Perl … … OMSSA Tandem Pilot Parallel FileSystem … … pepid

Phase I Implemented changes to work with Condor in CycleCloud environment. Minor modifications to code allow Condor to use exit status to know when to retry jobs. DAGMan replaces SGE job dependencies, improves robustness. Minor changes permit workflow to run in CycleCloud. Testing SGE and Condor on Brainiac. Code is location aware – uses Condor when available. Sample: ~900 core-hours, completed in ~<3 hours.

Phase II Improved robustness by splitting scatter/gather.
Developed script for converting SGE job arrays to DAGMan jobs Improved scalability by submitting large sets of jobs using Condor DAGMan. Improved efficiency by using job runtime prediction for OMSSA and Tandem jobs. Moved retry handling to Condor.

Before txtextract msmsfilter msmsfeatures partition makemgf
Custom Perl … … OMSSA Tandem Pilot Parallel FileSystem … … pepid

After txtextract msmsfilter … msmsfeatures partition makemgf … I/O Via
Run More OMSSA Multi-Threaded Tandem Pilot … Changes: Moved retry handling to Condor. IBRIX -> s3backer r/o FS (blue bar on left thinner). Dagman wraps each level (squares for split scatter/gather). Can run more OMSSA jobs concurrently in Cloud. Can run multi-threaded Tandem jobs on dedicated 8-core machines (faster). Condor helps with job-level robustness (colored nodes). … pepid

Results Workflows that took 5 hours or overnight now run reliably in 2.5 hours. Workflows more robust using DAGMan, no intricate Perl Scripting Tools for DAGMan were adapted to other Cost of individual run now quantifiable: Internal: ~900 hours * internal costs In EC2: First run $130, each additional $75 Spot instances can reduce this cost to $45 / $25

How does this apply to you?
Memory? Amazon released new instances with more RAM per core recently (8+ GB per core) Data Size? Many people use X0 TB to Y PB of storage Still have to provision filers, if that’s required AWS Import can help get data into S3 (single 2TB costs about $154.70) – you already do something similar with boat data Interconnect? Like Memory amounts, maybe they’re working on this? Oil/Gas, Engineering, Finance, etc. need this

Special Cases Cloud is UnCommon: If you’re large enough that you have a well planned refresh cycle without bursts Cloud is Good: If you’re smaller, and don’t want to provision resources before you get a job Remember benefits of Cloud Resources are they are available on short notice, with low overhead, at large scales, with a pay as you go model = great for bursts or bursty usage

HPC In The Cloud Case Study: Proteomics Workflow

Similar presentations

Presentation on theme: "HPC In The Cloud Case Study: Proteomics Workflow"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HPC In The Cloud Case Study: Proteomics Workflow

Similar presentations

Presentation on theme: "HPC In The Cloud Case Study: Proteomics Workflow"— Presentation transcript:

Similar presentations

About project

Feedback