Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Operating System.
Workflows: from Development to Automated Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin -
Sun Grid Engine Grid Computing Assignment – Fall 2005 James Ruff Senior Department of Mathematics and Computer Science Western Carolina University.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Basic Unix Dr Tim Cutts Team Leader Systems Support Group Infrastructure Management Team.
MultiJob PanDA Pilot Oleynik Danila 28/05/2015. Overview Initial PanDA pilot concept & HPC Motivation PanDA Pilot workflow at nutshell MultiJob Pilot.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Step By Step Windows Server 2003 Installation Guide Step By Step Windows Server 2003 Installation Guide.
Workflows: from Development to Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Parallel Algorithms & Distributed Computing Matt Stimmel Matt White.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Sep 13, 2006 Scientific Computing 1 Managing Scientific Computing Projects Erik Deumens QTP and HPC Center.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
Canadian Bioinformatics Workshops
Grid Computing: An Overview and Tutorial Kenny Daily BIT Presentation 22/09/2016.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to High Throughput Computing and HTCondor Monday AM, Lecture 1 Ian Ross Center for High Throughput Computing University of Wisconsin-Madison.
Threads prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University 1July 2016Processes.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Christina Koch Research Computing Facilitators
Lesson 9: SOFTWARE ICT Fundamentals 2nd Semester SY
INTRODUCTION TO WEB HOSTING
Large Output and Shared File Systems
Memory Management.
Jonathan Walpole Computer Science Portland State University
Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am
Operating System.
Example: Rapid Atmospheric Modeling System, ColoState U
Lecture 16: Data Storage Wednesday, November 6, 2006.
Advanced QlikView Performance Tuning Techniques
MCTS Guide to Microsoft Windows 7
Chapter 2: System Structures
Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison
Andy Wang CIS 5930 Computer Systems Performance Analysis
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Introduction to client/server architecture
Database Performance Tuning and Query Optimization
Chapter 1: Introduction
湖南大学-信息科学与工程学院-计算机与科学系
Haiyan Meng and Douglas Thain
Database Management Systems (CS 564)
Objective of This Course
Lecture 7: Index Construction
Introduction to High Throughput Computing and HTCondor
Tonga Institute of Higher Education IT 141: Information Systems
Sun Grid Engine.
Chapter 11 Database Performance Tuning and Query Optimization
Tonga Institute of Higher Education IT 141: Information Systems
External Sorting.
Overview of Workflows: Why Use Them?
LO2 – Understand Computer Software
Tonga Institute of Higher Education IT 141: Information Systems
rvGAHP – Push-Based Job Submission Using Reverse SSH Connections
Tonga Institute of Higher Education IT 141: Information Systems
PU. Setting up parallel universe in your pool and when (not
Thursday AM, Lecture 1 Lauren Michael
Thursday AM, Lecture 2 Brian Lin OSG
Introduction to research computing using Condor
Presentation transcript:

Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin - Madison

OSG User School 2016 Why are we here? 2

OSG User School 2016 Why are we here? 3 To do SCIENCE!!! A lot of science is best-done with computing – sometimes, LOTS of computing Science needs to be reproducible And, we’d really like science to happen FAST(er)

OSG User School 2016 GETTING THE MOST OUT OF COMPUTING (FOR SCIENCE) 4

OSG User School 2016 Computing types At the beginning of the week, we talked about two different approaches for tackling large compute tasks... 5 high-performance (e.g.MPI) high-throughput

OSG User School 2016 Two Strategies High Throughput Focus: Workflows with many small, largely independent compute tasks Optimize: throughput, or time from submission to overall completion High Performance Focus: Workflows with large, highly coupled tasks Optimize: software, communication between processes 6

OSG User School 2016 Making Good Choices How do you choose the best approach? Guiding question: Is your problem “HTC-able”? 7

OSG User School 2016 Typical HTC Problems batches of similar program runs (>10) “loops” over independent tasks others you might not think of …  programs/functions that  process files that are already separate  process columns or rows, separately  iterate over a parameter space  a lot of programs/functions that use multiple CPUs on the same server Ultimately: Can you break it up? 8

OSG User School 2016 What is not HTC? fewer numbers of jobs jobs individually requiring significant resources  RAM, Data/Disk, # CPUs, time (though, “significant” depends on the HTC compute system you use) restrictive licensing 9

OSG User School 2016 The Real World However, it’s not just about finding the right computing approach to your problem. These approaches will be *most* effective if they’re running on appropriate compute systems. 10

OSG User School 2016 The Real World Not all compute systems are created equal. Two questions to ask: What resources are available to me? Which one is the best match for the kind of computing I want to do? 11

OSG User School 2016 Campus Resources Start with your local campus compute system Some considerations:  Who has access? Are there allocations?  What kind of system? What is it optimized for? An HPC cluster may not handle lots of jobs well, in the same way that an HTC system has limited multicore capabilities - be aware of how a system matches/doesn’t match your computation strategy. Ask questions! Be a good citizen! If local resources are limited, explore other options. 12

OSG User School 2016 Beyond your campus Open Science Grid!  This afternoon, Tim will talk about ways to access OSG after the school is over Other grids  European Grid Infrastructure  Other national and regional grids  Commercial cloud systems 13

OSG User School 2016 The payoff HTC is, beyond everything, scalable  If you can run 10 jobs, you can run 10,000, maybe even 10 million Worth pursuing the right kind of resources (if you can) for the right kind of problem. 14 time n processors

OSG User School 2016 GETTING THE MOST OUT OF HTC 15

OSG User School 2016 Key HTC Tactics 16 1.Increase Overall Throughput 2.Don’t Overuse Resources! 3.Bring Dependencies With You 4.Scale Gradually, Testing Generously 5.Automate As Many Steps As Possible

OSG User School 2016 Throughput, revisited In HTC, we optimize throughput: time from submission to overall completion We do this by breaking large processes into smaller pieces 17 Instead of making individual jobs as fast as possible......optimize how long it takes for *all* jobs to finish

OSG User School 2016 Breaking up is hard to do… Ideally into parallel (separate) jobs  reduced job requirements = more matches  not always easy or possible Strategies  break HTC-able steps out of a single program  break up loops  break up input Self-checkpointing if jobs are too long 18

OSG User School 2016 Batching (Merging) is easy A single job can  execute multiple independent tasks  execute multiple short, sequential steps  avoid transfer of intermediate files Use scripts!  need adequate error reporting for each “step”  easily handle multiple commands and arguments 19

OSG User School 2016 Key HTC Tactics 20 1.Increase Overall Throughput 2.Utilize Resources Efficiently! 3.Bring Dependencies With You 4.Scale Gradually, Testing Generously 5.Automate As Many Steps As Possible

OSG User School 2016 Know and Optimize Job Use of Resources! CPUs (“1” is best for matching; essential for OSG)  restrict, if necessary/possible  software that uses all available CPUs is BAD! CPU Time > ~5 min, < ~1 day; Ideal: 1-2 hours RAM (not always easily modified) Disk per-job (execute) and in-total (submit) Network Bandwidth  minimize transfer: filter/trim/delete, compress 21

OSG User School 2016 The job log shows all 001 ( ) 06/07 11:57:57 Job executing on host: 005 ( ) 06/07 14:12:55 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 5 - Run Bytes Sent By Job Run Bytes Received By Job 5 - Total Bytes Sent By Job Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : Memory (MB) :

OSG User School 2016 Key HTC Tactics 23 1.Increase Overall Throughput 2.Don’t Overuse Resources! 3.Bring Dependencies With You 4.Scale Gradually, Testing Generously 5.Automate As Many Steps As Possible

OSG User School 2016 Bring What with You?  Parameters and random numbers: generate and record ahead of time (for reproducibility!) What else? 24 Software (covered Wednesday) Data and other input files

OSG User School 2016 Wrapper Scripts are Essential 25 Before task execution (bring it with you!)  transfer/prepare files and directories  setup/configure software environment and other dependencies Task execution  prepare complex commands and arguments  batch together many ‘small’ tasks After task execution  filter/combine/compress files and directories  check for and report on errors

OSG User School 2016 Key HTC Tactics 26 1.Increase Overall Throughput 2.Don’t Overuse Resources! 3.Bring Dependencies With You 4.Scale Gradually, Testing Generously 5.Automate As Many Steps As Possible

OSG User School 2016 Testing, testing, testing! Will be a major focus of our exercises today. Allows you to optimize resource use (see HTC tactic #2) Just because it worked for 10 jobs, doesn’t mean it will work for 10,000 jobs (scaling issues)  Data transfer (in and out)  Discover site-specific problems 27

OSG User School 2016 Key HTC Tactics 28 1.Increase Overall Throughput 2.Don’t Overuse Resources! 3.Bring Dependencies With You 4.Scale Gradually, Testing Generously 5.Automate As Many Steps As Possible

OSG User School 2016 What to Automate? Submitting many jobs (using HTCondor) Writing submit files using scripts Running a series of jobs, or workflow 29

OSG User School 2016 What is a workflow? 30 Steps Connections (Metadata) A series of ordered steps

OSG User School 2016 We workflows non-computing “workflows” are all around you … especially in science  instrument setup  experimental procedures and protocols 31 when planned/documented, workflows help with:  organizing and managing processes  saving time with automation  objectivity, reliability, and reproducibility (THE TENETS OF GOOD SCIENCE!)

OSG User School 2016 process ’99’ DAGs Automate Workflows! 32 data prep/split process ‘0’ combine, assess results...

OSG User School 2016 Automating workflows can save you time

OSG User School 2016 … but there are even more benefits of automating workflows 34 Reproducibility Building knowledge and experience New ability to imagine greater scale, functionality, possibilities, and better SCIENCE!!

OSG User School 2016 GETTING THE MOST OUT OF WORKFLOWS, PART 1 35

OSG User School 2016 From schematics… 36

OSG User School 2016 … to the real world 37

OSG User School 2016 Building a Good Workflow 1.Draw out the general workflow 2.Define details (test ‘pieces’ with HTCondor jobs)  divide or consolidate ‘pieces’  determine resource requirements  identify steps to be automated or checked 3.Build it modularly; test and optimize 4.Scale-up gradually 5.Make it work consistently 6.What more can you automate or error-check? (And remember to document!) 38

OSG User School 2016 Workflow, version 1 39 data prep (minutes) processing (days) assess results (minutes)

OSG User School 2016 process ’99’ Workflow, version 2 (HTC) 40 data prep/split process ‘0’ combine, assess results...

OSG User School 2016 Building a Good Workflow 1.Draw out the general workflow 2.Define details (test ‘pieces’ with HTCondor jobs)  divide or consolidate ‘pieces’  determine resource requirements  identify steps to be automated or checked 3.Build it modularly; test and optimize 4.Scale-up gradually 5.Make it work consistently 6.What more can you automate or error-check? (And remember to document!) 41

OSG User School 2016 Determine Resource Usage Run locally first Then get one job running remotely  (on execute machine, not submit machine)!  get the logistics correct! (HTCondor submission, file and software setup, etc.) Once working, run a couple of times  If big variance in resource needs, should you take the… Average? Median? Worst case? 42

OSG User School 2016 process ‘99’ (filter output) End Up with This 43 (special transfer) file prep and split (POST-RETRY) process ‘0’ (filter output) combine, transform results (POST-RETRY)... 1 GB RAM 2 GB Disk 1.5 hours 100 MB RAM 500 MB Disk 40 min (each) 300 MB RAM 1 GB Disk 15 min (PRE) (POST-RETRY)

OSG User School 2016 Exercise Task 1 (days) Task 2 (minutes)

OSG User School 2016 process ‘?’ Exercise process ‘0’ Task 2...

OSG User School 2016 process ‘?’ (filter output) Exercise process ‘0’ (filter output) Task 2 (POST?)... ? RAM ? MB Disk ? min (each) ? RAM ? GB Disk ? min (PRE?) (POST?)

OSG User School 2016 Questions? Now: “Joe’s Workflow” Exercise 1.1,1.2  In groups of 2-3  Read carefully! Later:  Lecture: From Workflow to Production  Exercises 2.1,