Building the International Data Placement Lab Greg Thain Center for High Throughput Computing.

Slides:

Advertisements

Similar presentations

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Advertisements

Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Condor DAGMan Warren Smith. 12/11/2009 TeraGrid Science Gateways Telecon2 Basics Condor provides workflow support with DAGMan Directed Acyclic Graph –Each.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

High Throughput Parallel Computing (HTPC) Dan Fraser, UChicago Greg Thain, Uwisc.

1 Integrating GPUs into Condor Timothy Blattner Marquette University Milwaukee, WI April 22, 2009.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

HTCondor workflows at Utility Supercomputing Scale: How? Ian D. Alderman Cycle Computing.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

IntroductiontotoHTCHTC 2015 OSG User School, Monday, Lecture1 Greg Thain University of Wisconsin– Madison Center For High Throughput Computing.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.

Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid job submission using HTCondor Andrew Lahiff.

Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.

Workflows: from development to Production Thursday morning, 10:00 am Greg Thain University of Wisconsin - Madison.

Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.

Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.

Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.

Intermediate HTCondor: More Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.

How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.

Condor Project Computer Sciences Department University of Wisconsin-Madison Running Interpreted Jobs.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Five todos when moving an application to distributed HTC.

Christina Koch Research Computing Facilitators

CHTC Policy and Configuration

Intermediate HTCondor: More Workflows Monday pm

Condor DAGMan: Managing Job Dependencies with Condor

Operations Support Manager - Open Science Grid

HTCondor Networking Concepts

Pegasus WMS Extends DAGMan to the grid world

HTCondor Networking Concepts

Intermediate HTCondor: Workflows Monday pm

An Introduction to Workflows with DAGMan

IW2D migration to HTCondor

High Availability in HTCondor

Using Stork An Introduction Condor Week 2006

Accounting in HTCondor

Philip Papadopoulos, Ph.D University of California, San Diego

International Data Placement Laboratory

Troubleshooting Your Jobs

Privilege Separation in Condor

International Data Placement Laboratory

Introduction to High Throughput Computing and HTCondor

HTCondor Training Florentia Protopsalti IT-CM-IS 1/16/2019.

Condor: Firewall Mirroring

Late Materialization has (lately) materialized

Frieda meets Pegasus-WMS

Troubleshooting Your Jobs

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Building the International Data Placement Lab Greg Thain Center for High Throughput Computing

What is the IDPL? How we built it Examples of use Overview 2

Who is the IDPL? 3 Phil Papadapolus -- UCSD Miron Livny -- Wisconsin Collaborators in China:Beihang // CNIC

Beihang CNIC UCSD (2x) Wisc

A network engineer worldview: 5 BUAA / CNIC UCSD Wisconsin 200 ms ping 50 ms ping

A Topologist’s World View 6 Beihang CNIC UCSD1 Wisconsin UCSD2 Run full set hourly

Wait? What about PerfSonar? 7

8

Wait? What about PerfSonar? 9 Necessary, not sufficient Network only

1. Co-scheduling of jobs Start client and server simultaneously 2. Timed start of jobs To prevent overlap and race conditions 3. Returning and managing of results Requirements for data placement 10

› Using static slots › One slot per host › Submit two-host job  Proc 0 is client  Proc 1 is server (just a convention) Co-Scheduling with Parallel Universe 11

Universe = parallel Executable = startup_script.sh +ParallelShutdownPolicy = “WAIT_FOR_ALL” Requirements = (machine == “client-machine-name”) Queue Requirements = (machine == “server-machine-name”) Queue Example Submit file 12

#!/bin/sh if [ $_CONDOR_PROCNO = 0 ] then do_client_stuff else do_server_stuff fi Startup Script (mark 1) 13

› Synchronization problem  Not actually co-scheduled  Servers need to tell clients their port numbers But this doesn’t work… 14

Unix ProcessCondor Job fork/execcondor_submit kill -9condor_rm Environment variablesJob Attributes Standard error fileCondor job log Unix Process :: Condor Job 15

› condor_chirp set_job_attr › Uses job ad as a blackboard whiteboard › Always read/writes to Proc 0 (in parallel universe) condor_chirp to the rescue! 16

Start Server Listen on ephemeral port condor_chirp set_job_attr Port # Simplified Workflow 17 condor_chirp get_job_attr Port # repeat as needed run test Server Side Client Side

› Need to do this periodically › Condor cron – two features on_exit_remove = false cron_minute = 23 cron_hour = 0-23/2 cron_month = * cron_day_of_week = * That works – once 18

on_exit_remove = false Means one job for all runs stale set_job_attr’s held jobs a headache Really want one condor job per run Problems with condor cron 19

Enter… “CronMan” 20

› Have dagman itself be the restarter  Creates a new job every time › With a one-node dag › Whose one node has a delayed start time  Gives the placement job the job-nature The Idea 21

JOB A placement4-submit SCRIPT POST A /bin/false RETRY The dag file 22

Universe = parallel Executable = startup_script.sh +ParallelShutdownPolicy = “WAIT_FOR_ALL” cron_minute = 30 cron_window = 400 Requirements = (machine == “client-machine-name”) Queue Requirements = (machine == “server-machine-name”) Queue Modified Submit file 23

$ condor_q ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD gthain 1/20 23: :26:06 R condor_dagman gthain 5/11 10: :00:00 I wrapper_script gthain 5/11 10: :00:00 I wrapper_script condor_q output 24

› What about reporting results? › condor_chirp ulog “server start” › condor_chirp ulog “client results x y z” Two of three problems solved 25

000 ( ) 05/11 10:22:48 Job submitted from host: DAG Node: A 014 ( ) 05/11 11:17:00 Node 0 executing on host: client 008 ( ) 05/11 11:17:00 ‘client:start' 014 ( ) 05/11 11:17:04 Node 1 executing on host: 001 ( ) 05/11 11:17:04 Job executing on host: MPI_job 008 ( ) 05/11 11:17:06 'server:start' 033 ( ) 05/11 11:17:08 Setting job attribute iperfServer to '20650' 008 ( ) 05/11 11:17:31 ‘results, , ,1, , ' 008 ( ) 05/11 11:17:31 'komatsu.chtc.wisc.edu(iperf) client:end' 008 ( ) 05/11 11:17:31 'komatsu.chtc.wisc.edu(irods) client:start' 008 ( ) 05/11 11:17:31 'davos.cyverse.org(iperf) server:end' Job log output 26

Then forward to Graphite 27

Application of IDPL: The Phytomorph problem 28 PI in the Spalding lab researching corn seedlings Runs analysis jobs at Wisconsin HTCondor pool Pulling data from Cyverse (nee iPlant) in Arizona Claims data xfer rates slow

29

30

› All performance testing had been LAN › Our Networking folks ID’d problems › Made one big fix  Don’t set TCP_SNDBUF explicitly! › Gave us new clients & servers After talking with iRODS devs 31

32

Add more protocols More sites Work with clouds Future Work 33

› Try CronMan pattern yourself!  Even with vanilla universe jobs › Think about Unix process patterns › Example on slides simplified! › Real code on git hub at  Thank you! 34