Workflows: from Development to Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Slides:



Advertisements
Similar presentations
Cluster Computing at IQSS Alex Storer, Research Technology Consultant.
Advertisements

Workflows: from Development to Automated Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin -
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
1 Virtual Memory vs. Physical Memory So far, all of a job’s virtual address space must be in physical memory However, many parts of programs are never.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
CVMFS: Software Access Anywhere Dan Bradley Any data, Any time, Anywhere Project.
Introduction to Linux and Shell Scripting Jacob Chan.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Communicating with Users about HTCondor and High Throughput Computing Lauren Michael, Research Computing Facilitator HTCondor Week 2015.
When and How to Use Large-Scale Computing: CHTC and HTCondor Lauren Michael, Research Computing Facilitator Center for High Throughput Computing STAT 692,
How Do I Find a Job to Apply to?
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
TH-OCR NK. content introduction go to next page background assumptions overall structure chart IPO for overall structure dataflow diagram of overall structure.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.
© Janice Regan, CMPT 128, Jan CMPT 128 Introduction to Computing Science for Engineering Students Creating a program.
Ways to Connect to OSG Tuesday afternoon, 3:00 pm Lauren Michael Research Computing Facilitator University of Wisconsin-Madison.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.
Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, UWisc Condor Week April 13, 2010.
Compiled Matlab on Condor: a recipe 30 th October 2007 Clare Giacomantonio.
So far we have covered … Basic visualization algorithms Parallel polygon rendering Occlusion culling They all indirectly or directly help understanding.
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
Cluster 2004 San Diego, CA A Client-centric Grid Knowledgebase George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison September 23 rd,
Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Sep 13, 2006 Scientific Computing 1 Managing Scientific Computing Projects Erik Deumens QTP and HPC Center.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
Am I RAM Or am I ROM?.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.
Alien and GSI Marian Ivanov. Outlook GSI experience Alien experience Proposals for further improvement.
OCR A Level F453: The function and purpose of translators Translators a. describe the need for, and use of, translators to convert source code.
Five todos when moving an application to distributed HTC.
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
Getting the Most out of HTC with Workflows Friday Christina Koch Research Computing Facilitator University of Wisconsin.
Turning science problems into HTC jobs Tuesday, Dec 7 th 2pm Zach Miller Condor Team University of Wisconsin-Madison.
Christina Koch Research Computing Facilitators
CHTC Policy and Configuration
Intermediate HTCondor: More Workflows Monday pm
Condor DAGMan: Managing Job Dependencies with Condor
Licenses and Interpreted Languages for DHTC Thursday morning, 10:45 am
Intermediate HTCondor: Workflows Monday pm
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Thursday AM, Lecture 2 Lauren Michael CHTC, UW-Madison
So far we have covered … Basic visualization algorithms
Troubleshooting Your Jobs
湖南大学-信息科学与工程学院-计算机与科学系
What’s Different About Overlay Systems?
Mats Rynge USC Information Sciences Institute
Job Submission Via File Transfer
PU. Setting up parallel universe in your pool and when (not
Thursday AM, Lecture 1 Lauren Michael
Presentation transcript:

Workflows: from Development to Production Thursday morning, 10:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison

2012 OSG User School ‘Engineering’ a Good Workflow 1.Draw out the general workflow 2.Define details (test ‘pieces’ with HTCondor) 3.Define the computing workflow  divide or consolidate processes  off-load file transfers and consider file transfer times  identify steps to be automated or checked 4.Build it piece-by-piece; test and optimize 5.Scale-up: data and computing resources 6.What more can you automate or error check? (And document, document, document) 2

2012 OSG User School process ‘99’ (filter output) From This... 3 (special transfer) file prep and split (POST-RETRY) process ‘0’ (filter output) combine, transform results (POST-RETRY)... 1 GB RAM 2 GB Disk 1.5 hours 100 MB RAM 500 MB Disk 40 min (each) 300 MB RAM 1 GB Disk 15 min (PRE) (POST-RETRY)

2012 OSG User School End Up with This 4 DATA Nobel

2012 OSG User School Scaling and Optimization Your ‘small’ DAG runs (once)! Now what?  Need to make it run full-scale  Need to make it run everywhere  Need to make it run every time  Need to make it run unattended  Need to make it run when someone else tries 5

2012 OSG User School Scaling Up – Things to Think About More jobs:  50-MB files may be fine for 10 or 100 jobs, but not for 1000 jobs. Why?  More difficult to identify errors Larger files:  execute RAM and disk space  matching time vs. execute time  data transfer time vs. computation time Scale-up gradually 6

2012 OSG User School Data Scaling Solutions File manipulations  cut into pieces, filter  compression/decompression Listen to Derek’s methods (Yesterday):  Sandbox  Caching  Prestaging  Storage Element (SE) horsepower 7

2012 OSG User School Make It Run Everywhere What does an OSG machine have?  Assume the worst: nothing Bring as much as possible with you:  Won’t that slow me down? Bring:  Executable  Environment  Parameters (using parameter files)  Random numbers (generate before- submission) 8

2012 OSG User School Scaling Out to OSG: Rules of Thumb CPU (single-threaded)  Run for 5 minutes and 2 hours  Upper limit somewhat soft Disk  Keep scratch working space < 20 GB  Intermediate needs (/tmp?)  Submit disk: think quota and I/O transfer Batch or Break Up Jobs Use squid caching where appropriate 9

2012 OSG User School The expanding onion Laptop (1 machine)  You control everything! Local cluster (1000 cores)  You can ask an admin nicely Campus (5000 cores)  It better be important/generalizable OSG (50,000 cores)  Good luck finding the pool admins 10

2012 OSG User School Bringing It With You: MATLAB What’s the problem with MATLAB?  license limitations What’s the solution?  “compiling” Similar measures for other interpreter languages (R, Python, etc) 11

2012 OSG User School How to bring MATLAB along 1)Purchase & install MATLAB “compiler” 2)Run compiler as follows: $ mcc -m -R -singleCompThread -R -nodisplay -R -nojvm –nocache foo.m 3) This creates run_foo.sh (et. al.) 4) Create tarball of the runtime $ cd /usr/local/mathworks-R2009bSP1 ; cd.. $ tar cvzf ~/m.tgz mathworks-R2009bSP1 12

2012 OSG User School More MATLAB 5) Edit the run_foo.sh tar xzf m.tgz mkdir cache chmod 0777 cache export MCR_CACHE_ROOT=`pwd`/cache Make a submit file: universe = vanilla executable = run_foo.sh arguments =./mathworks-R2009bSP1 should_transfer_files = yes when_to_transfer_output = on_exit transfer_input_files = m.tgz, foo queue 13

2012 OSG User School Make It Work Everytime What could possibly go wrong?  Eviction  File corruption  Performance surprises  Network  Paging  Disk  …  Maybe even a bug in your code 14

2012 OSG User School Error Checks Are Essential If you don’t check, it will happen… Check expected file existence (transfer or creation) with a loop  better yet, check rough file size Check expected file contents  specific line text, # of columns, etc. Advanced:  Error-check with wrapper-POST combo 15

2012 OSG User School What to do if a check fails Understand something about failure Use DAG RETRY Let rescue dag continue…  workflow-specific 16

2012 OSG User School Performance Surprises One bad node can ruin your whole day “Black Hole” machines  Avoidance tricks and their auto-clustering costs  GLIDEIN whitelist/blacklist if a site is somehow bad. (But talk to the GOC first) REALLY slow machines  Use PERIODIC_HOLD / RELEASE 17

2012 OSG User School Make It Work Unattended The ultimate goal! Need to automate:  Data collection  Data cleansing  Submission (condor cron)  Analysis and verification  LaTeX and paper submission 18

2012 OSG User School Make Science Work Unattended? 19 Well, maybe not, but a scientist can dream …

2012 OSG User School Make It Run(-able) for Someone Else If others can’t reproduce your work, it isn’t real science!  Work hard to make this happen.  It’s their throughput, too. Only ~10% of published cancer research is reproducible! Another argument for automation 20

2012 OSG User School Documentation at Multiple Levels In job files: comment lines  submit files, wrapper scripts, executables In README files  describe file purposes  define overall workflow, justifications In a Document  draw the workflow! 21

2012 OSG User School Make It Run Faster? Maybe. Throughput, throughput, throughput  Resource reductions (match more slots!)  Wall-time reductions  if significant per workflow  Why not per job? Think in orders of magnitude:  Say you have 1000 hour-long jobs that are matched at a rate of 1 per minute … Waste the computer’s time, not yours. 22

2012 OSG User School Maybe Not Worth It Maybe Not Worth It:  Rewriting your code in a different language  Targeting “faster” machines  Targeting machines that will run longer jobs  Others? Waste the computer’s time, not yours. 23

2012 OSG User School Testing, Testing, … ALWAYS test a subset after making changes  How big of a change? Scale up gradually Avoid making problems for others (and for yourself) 24

2012 OSG User School If this were a test… 20 points for finishing at all 10 points for the right answer 1 point for every error check 1 point per documentation line Out of 100 points? 200 points? 25 finishing right error checks & documen- tation

2012 OSG User School Questions? Feel free to contact me:  Now: Break  10:30-10:45am Next:  10:45am-12:15pm: Exercises 6.2, 6.3  12:15pm: Lunch  1:15-2:30pm: Principles of High- Throughput Computing 26