Download presentation
Presentation is loading. Please wait.
Published byGyles Davidson Modified over 9 years ago
1
When and How to Use Large-Scale Computing: CHTC and HTCondor Lauren Michael, Research Computing Facilitator Center for High Throughput Computing STAT 692, November 15, 2013
2
› Why to Access Large-Scale Computing resources › CHTC Services and Campus-Shared Computing › What is High-Throughput Computing (HTC)? › What is HTCondor and How Do You Use It? › Maximizing Computational Throughput › How to Run R on Campus-Shared Resources Topics We’ll Cover Today 2
3
1. your computing work won’t run at all on your computer(s) (lack sufficient RAM, disk, etc.) 2. your computing work will take too long on your own computer(s) 3. you would like to off-load certain processes in favor of running others on your computer(s) When should you use outside computing resources? 3
4
Center for High Throughput Computing, est. 2006 › Large-scale, campus-shared computing systems high-throughput computing (HTC) grid and high-performance computing (HPC) cluster all standard services provided free-of-charge automatic access to the national Open Science Grid (OSG) hardware buy-in options for priority access information about other computing resources › Support for using our systems consultation services, training, and proposal assistance solutions for numerous software (including Python, Matlab, R) CHTC Services 4
5
HTCondor: CHTC’s R&D Arm › R&D for HTCondor and other HTC software › Services provided to the campus community HTC Software HTCondor: manage your compute cluster DAGMan: manage computing workflows Bosco: submit locally, run globally Software Engineering Expertise & Consulting CHTC-operated Build-and-Test Lab (BaTLab) Software Security Consulting Your Problems become Our Research!
6
Jul’10- Jun’11 Jul’11- Jun’12 Jul’12- Jun’13 Quick Facts 457097Million Hours Served 54106120Research Projects 3552 Departments 101315Off-Campus Projects Researchers who use the CHTC are located all over campus (red buildings) http://chtc.cs.wisc.edu
7
Director, Miron Livny miron@cs.wisc.edu (also OSG Technical Director and WIDs CTO) Campus Support: chtc@cs.wisc.edu 2+ Research Computing Facilitators › Lauren Michael (lead) lmichael@wisc.edu 3 Systems Administrators +4-8 Part-time Students HTCondor Development Team OSG Software Team CHTC Staff 7
8
› high-throughput computing (HTC) many independent processes that can run on 1 or few processors (“cores” or “threads”) on the same computer mostly standard programming methods best accelerated by: access to as many cores as possible › high-performance computing (HPC) sharing the workload of interdependent processes over multiple cores to reduce overall compute time OpenMP and MPI programming methods, or multi-thread requires: access to many servers of cores within the same tightly-networked cluster; access to shared files HTC versus HPC 8
9
› essentially means: spread computing work out over multiple processors › Use of the words “parallel” and “parallelize” can apply to HTC or HPC when referring to programs › It’s important to be clear! “parallel” is confusing 9
10
› Why to Access Large-Scale Computing resources › CHTC Services and Campus-Shared Computing › What is High-Throughput Computing (HTC)? › What is HTCondor and How Do You Use It? › Maximizing Computational Throughput › How to Run R on Campus-Shared Resources Topics We’ll Cover Today 10
11
› match-maker of computing work and computers › “job scheduler” matches are made based upon necessary RAM, CPUs, disk space, etc., as requested by the user jobs re-run if interrupted › works beyond “clusters” to coordinate distributed computers for maximum throughput › coordinates data transfers between users and distributed computers › can coordinate servers, desktops, and laptops What is HTCondor? 11
12
Queue job1.1user1 job1.2user1 job2.1user2 Submit Node(s) (where jobs are submitted) input How HTCondor Works Central Manager (of the pool) Execute Node(s) (where jobs run) Machine ClassAd Job ClassAd output 12 input
13
13
14
Submit hostCS PoolCHTC PoolCampus GridOpen Science Grid Stat dept servers default simon.stat.wisc.edu default CHTC submit nodes defaultflocking“glidein” Submit nodes available to YOU 14
15
› Prepare programs and files › Write submit file(s) › Submit jobs to the queue › Monitor the jobs › (Remove bad jobs) Basic HTCondor Submission 15
16
› Make programs portable compile code to a simple binary statically-link code dependencies consider CHTC’s tools for packaging Matlab, Python, and R › Consider using a shell script (or other “wrapper”) to run multiple commands for you create a local install of software set environment variables then, run your code › Stage all files on a submit node Preparing Programs and Files 16
17
1. Cut up computing work into many independent pieces (CHTC can consult) 2. Make programs portable, minimize dependencies (CHTC can consult, or may have prepared solutions) 3. Learn how to submit jobs (CHTC can help you a lot!) 4. Maximize your overall throughput on available computational resources (CHTC can help you a lot!) HTC Components 17
18
# This is a comment universe = vanilla output = process.out error = process.err log = process.log executable = cosmos arguments = cosmos.in 4 should_transfer_files = YES transfer_input_files = cosmos.in when_to_transfer_output = ON_EXIT request_memory = 100 request_disk = 100000 request_cpus = 1 queue Basic HTCondor Submit File 18 basic jobs are vanilla universe executable is your single program or a shell script log is where HTCondor stores info about how your job ran output and error are where system output and error will go The program will be run as:./cosmos cosmos.in 4 The program will be run as:./cosmos cosmos.in 4 queue with no number after it will submit only one job memory in MB and disk in KB
19
# This is a comment universe = vanilla output = process.out error = process.err log = process.log executable = cosmos arguments = cosmos.in 4 should_transfer_files = YES transfer_input_files = cosmos.in when_to_transfer_output = ON_EXIT request_memory = 100 request_disk = 100000 request_cpus = 1 queue Basic HTCondor Submit File 19 Initial File Organization In folder test/ cosmos cosmos.in submit.txt Initial File Organization In folder test/ cosmos cosmos.in submit.txt
20
# This is a comment universe = vanilla output = $(Process).out error = $(Process).err log = $(Cluster).log executable = cosmos arguments = cosmos_$(Process).in should_transfer_files = YES transfer_input_files = cosmos_$(Process).in when_to_transfer_output = ON_EXIT request_memory = 100 request_disk = 100000 request_cpus = 1 queue 3 HTCondor Multi-Job Submit File 20 test/ cosmos cosmos_0.in cosmos_1.in cosmos_2.in submit.txt test/ cosmos cosmos_0.in cosmos_1.in cosmos_2.in submit.txt
21
# This is a comment universe = vanilla InitialDir = $(Process) output = $(Process).out error = $(Process).err log = /home/user/test/$(Cluster).log executable = /home/user/test/cosmos arguments = cosmos.in should_transfer_files = YES transfer_input_files = cosmos.in when_to_transfer_output = ON_EXIT request_memory = 100 request_disk = 100000 request_cpus = 1 queue 3 HTCondor Multi-Folder Submit File 21 test/ cosmos cosmos.in submit.txt 0/ cosmos.in 1/ cosmos.in 2/ cosmos.in test/ cosmos cosmos.in submit.txt 0/ cosmos.in 1/ cosmos.in 2/ cosmos.in
22
Submitting Jobs 22 [lmichael@simon test]$ condor_submit submit.txt Submitting job(s)... 3 job(s) submitted to cluster 29747. [lmichael@simon test]$
23
Checking the Queue 23 [lmichael@simon test]$ condor_q lmichael -- Submitter: simon.stat.wisc.edu : : simon.stat.wisc.edu ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD 29747.0 lmichael 2/15 09:06 0+00:01:34 R 0 9.8 cosmos cosmos.in 29747.1 lmichael 2/15 09:06 0+00:00:00 I 0 9.8 cosmos cosmos.in 29747.2 lmichael 2/15 09:06 0+00:00:00 I 0 9.8 cosmos cosmos.in 3 jobs; 0 completed, 0 removed, 2 idle, 1 running, 0 held, 0 suspended [lmichael@simon test]$ View all user jobs in the queue: condor_q
24
Log Files 24 000 (29747.001.000) 02/15 09:29:17 Job submitted from host:... 001 (29747.001.000) 02/15 09:33:59 Job executing on host:... 005 (29747.001.000) 02/15 09:39:01 Job terminated. (1) Normal termination (return value 0) Usr 0 00:00:00, Sys 0 00:00:00 - Run Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Run Local Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Remote Usage Usr 0 00:00:00, Sys 0 00:00:00 - Total Local Usage 0 - Run Bytes Sent By Job 0 - Run Bytes Received By Job 0 - Total Bytes Sent By Job 0 - Total Bytes Received By Job Partitionable Resources : Usage Request Allocated Cpus : 1 1 Disk (KB) : 225624 100000 645674 Memory (MB) : 85 1000 1024
25
› Remove a single job: condor_rm 29747.0 › Remove all jobs of a cluster: condor_rm 29747 › Remove all of your jobs: condor_rm lmichael Removing Jobs 25
26
› Why to Access Large-Scale Computing resources › CHTC Services and Campus-Shared Computing › What is High-Throughput Computing (HTC)? › What is HTCondor and How Do You Use It? › Maximizing Computational Throughput › How to Run R on Campus-Shared Resources Topics We’ll Cover Today 26
27
› The Philosophy of HTC › The Art of HTC › Other Best-Practices Maximizing Throughput 27
28
› break up your work into many ‘smaller’ jobs single CPU, short run times, small input/output data › run on as many processors as possible single CPU and low RAM needs take everything with you; make programs portable use the “right” submit node for the right “resources” › automate as much as you can › (share your processors with others to increase everyone’s throughput) The Philosophy of HTC 28
29
› Edgar Spalding: studies effect of gene on plant growth outcomes › GeoDeepDive Project: extracts and comprises “dark data” from PDFs of publications in Geosciences We want HTC to revolutionize your research! Success Stories 29
30
carrying out the philosophy, well › Tuning job requests for memory and disk › Matching run times to the maximum number of available processors › Automation The Art of HTC 30
31
Problem: Don’t know what your job needs? › If you don’t ask for enough memory and disk: Your jobs will be kicked off for going over, and will have to be retried (though, HTCondor will automatically request more for you) › If you ask for too much: Your jobs won’t match to as many available “slots” as they could Tuning Job Resource Requests 31
32
Solution: Testing is Key!!! 1. Run just a few jobs at first to determine memory and disk needs from log files If your first request is not enough, HTCondor will retry the jobs and request more until they finish. It’s okay to request a lot (1 GB each) for a few tests. 2. Change the “request” lines to a better value 3. Submit a large batch Tuning Job Resource Requests 32
33
Submit hostCS Pool (4 hrs?) CHTC Pool <24 hrs (up to 72)* Campus Grid <4 hrs Open Science Grid <2 hrs Stat dept servers default simon.stat.wisc.ed u default CHTC submit nodes default +WantFlocking = true +WantGlidein = true Time-Matching (submit file additions) 33
34
› Problem: Jobs less than 5 minutes are bad for overall throughput more time spent on matching and data transfers than on your job’s processes Ideal time is between 5 minutes and 2 hours (OSG) › Solution: Use a shell script (or other method) to run multiple processes within a single job avoids transfer of intermediate files between sequential, related processes debugging can be a bit trickier Time-Tuning: Batching 34
35
› The best way to run longer jobs without losing progress to eviction. Two Ways: 1. Compile your code with condor_compile and use the “standard” universe within HTCondor 2. Implement self-checkpointing *Consult HTCondor’s online manual or contact the CHTC for help Time-Tuning: Checkpointing 35
36
› Use $(Process) › Shell scripts to run multiple tasks within the same job including environment preparation › Hardcode arguments, calculate them (random number generation), or use parameter files/tables › Use HTCondor’s DAGMan feature “directed acyclic graph” create complex workflows of dependent jobs, and submit them all at once additional helpful features: success checks and more Automate Tasks 36
37
Remember that you are sharing with others › “Be Kind to Your Submit Node” avoid transfers of large files through the submit node (large: >10GB per batch; ~10 MB/job x 1000+ jobs) transfer files from another server as part of your job ( wget and curl ) compress where appropriate; delete unnecessary files remember: “new” files are copied back to submit nodes avoid running multiple CPU-intensive executables › Test all new batches, and scale up gradually 3 jobs, then 100s, then 1000s, then Non-Throughput Considerations 37
38
› Why to Access Large-Scale Computing resources › CHTC Services and Campus-Shared Computing › What is High-Throughput Computing (HTC)? › What is HTCondor and How Do You Use It? › Maximizing Computational Throughput › How to Run R on Campus-Shared Resources Topics We’ll Cover Today 38
39
› Problem: R programs don’t easily compile to a binary › Solution: Take R with your job! CHTC has tools just for R (and Python, and Matlab) › Installed on CS/Stat submit nodes, simon, and CHTC submit nodes Running R on HTC Resources: The Best Way 39
40
40
41
› Copy your R code and any R library tar.gz files to the submit node › Run the following command: chtc_buildRlibs --rversion=sl5-R-2.10.1 \ --.tar.gz,.tar.gz › R versions supported: 2.10.1, 2.13.1, 2.15.1 (use the closest version below yours) › Get back sl5-RLIBS.tar.gz and sl6-RLIBS.tar.gz (you’ll use these in the next step) 1. Build R Code with chtc_buildRlibs 41
42
42
43
› download ChtcRun.tar.gz, according to the guide ( wget ) › un-tar it: tar xzf ChtcRun.tar.gz › View ChtcRun contents: process.template (submit file template) mkdag (script that will ‘create’ jobs based upon your staged data) Rin/ (example data staging folder) 2. Download the “ChtcRun” Package 43
44
› Stage data as such: ChtcRun/ data/ 1/ input.in 2/ input.in job3/ input.in test4/ input.in shared/.R › Modify process.template with respect to: request_memory and request_disk, if you know +WantFlocking = true OR +WantGlidein = true 3. Prepare data and process.template 44
45
› In ChtcRun, execute the mkdag script (Examples at the top of “./mkdag --help ”)./mkdag --data=Rin –outputdir=Rout \ --cmdtorun=soartest.R --type=R \ --version=R-2.10.1 --pattern=meanx “pattern” indicates a portion of a filename that you expect to be created by successful completion of any single job › A successful mkdag run will instruct you to navigate to the ‘outputdir’, and submit the jobs as a single DAG: condor_submit_dag mydag.dag 4. Run mkdag and submit jobs 45
46
› Check jobs in the queue as they’re gradually added and completed ( condor_q ) › Check other files in your ‘outputdir’: Rout/ mydag.dag.dagman.out (updated table of job stats) 1/ process.log process.out,err ChtcWrapper1.out 2/ process.log process.out,err ChtcWrapper2.out …/ After testing a small number of jobs, submit many! (up to many 10,000s; # submitted is throttled for you) 5. Monitor Job Completion 46
47
1. Use a Stat server to submit shorter jobs to the CS pool. 2. Obtain access to simon.stat.wisc.edu from Mike Camilleri (mikec@stat.wisc.edu), and submit longer jobs to the CHTC Pool.mikec@stat.wisc.edu 3. Meet with the CHTC to submit jobs to the entire UW Grid and to the national Open Science Grid. chtc.cs.wisc.edu, click “Get Started” User support for HTCondor users at UW: chtc@cs.wisc.edu What Next? 47
48
48
49
49
50
50
51
1. Use a Stat server to submit shorter jobs to the CS pool. 2. Obtain access to simon.stat.wisc.edu from Mike Camilleri (mikec@stat.wisc.edu), and submit longer jobs to the CHTC Pool.mikec@stat.wisc.edu 3. Meet with the CHTC to submit jobs to the entire UW Grid and to the national Open Science Grid. chtc.cs.wisc.edu, click “Get Started” User support for HTCondor users at UW: chtc@cs.wisc.edu What Next? 51
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.