Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

1 Using Stork Barcelona, 2006 Condor Project Computer Sciences Department University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Stork and DAGMan.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor and Grid Challenges.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

1 Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G and DAGMan.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Nick LeRoy & Jeff Weber Computer Sciences Department University of Wisconsin-Madison Managing.

Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

Alain Roy Computer Sciences Department University of Wisconsin-Madison I/O Access in Condor and Grid.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

Review of Condor,SGE,LSF,PBS

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.

Dynamic Deployment of VO Specific Condor Scheduler using GT4

U.S. ATLAS Grid Production Experience

Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)

Example: Rapid Atmospheric Modeling System, ColoState U

Grid Computing.

Building Grids with Condor

Condor: Job Management

Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.

Basic Grid Projects – Condor (Part I)

STORK: A Scheduler for Data Placement Activities in Grid

Status of Grids for HEP and HENP

Condor-G Making Condor Grid Enabled

Condor-G: An Update.

Presentation transcript:

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning in Condor-G ISGC 2005 Taipei, Taiwan

2 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

3 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:  face software engineering challenges in a distributed UNIX/Linux/NT environment  are involved in national and international grid collaborations,  actively interact with academic and commercial users,  maintain and support large distributed production environments,  and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …

4 A Multifaceted Project › Harnessing the power of clusters – dedicated and/or opportunistic (Condor) › Job management services for Grid applications (Condor-G, Stork) › Fabric management services for Grid resources (Condor, GlideIns, NeST) › Distributed I/O technology (Parrot, Kangaroo, NeST) › Job-flow management (DAGMan, Condor, Hawk) › Distributed monitoring and management (HawkEye) › Technology for Distributed Systems (ClassAD, MW) › Packaging and Integration (NMI, VDT)

5 Some software produced by the Condor Project › Condor System › ClassAd Library › DAGMan › Fault Tolerant Shell (FTSH) › Hawkeye › GCB › MW › NeST › Stork › Parrot › VDT › And others… all as open source Data!

6 Who uses Condor? › Commercial  Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobile, Shell, Alterra, Texas Instruments, … › Research Community  Universities, Govt Labs  Bundles: NMI, VDT  Grid Communities: EGEE/LCG/gLite, Particle Physics Data Grid (PPDG), USCMS, LIGO, iVDGL, NSF Middleware Initiative GRIDS Center, …

7 Condor Pool Schedd Startd Schedd MatchMaker Jobs

8 Condor Pool Schedd Startd Schedd MatchMaker Jobs

9 Condor-G Globus 2 Globus 4 Unicore (Nordugrid) Startd Schedd Jobs LSF PBS Schedd - Condor-G - Condor-C

10 User/Application/Portal Fabric ( processing, storage, communication ) Grid Condor Pool Middleware (Globus 2, Globus 4, Unicore, …) Condor-G

11 › Transfer of responsibility to schedule and execute a job  Stage in executable and data files  Transfer policy “instructions”  Securely transfer (and refresh?) credentials, obtain local identities  Monitor and present job progress (tranparency!)  Return results Job Delegation › Multiple delegations can be combined in interesting ways

12 Simple Job Delegation in Condor-G Condor-G Globus GRAM Batch System Front-end Execute Machine

13 Expanding the Model › What can we do with new forms of job delegation? › Some ideas  Mirroring  Load-balancing  Glide-in schedd, startd  Multi-hop grid scheduling

14 Mirroring › What it does  Jobs mirrored on two Condor-Gs  If primary Condor-G crashes, secondary one starts running jobs  On recovery, primary Condor-G gets job status from secondary one › Removes Condor-G submit point as single point of failure

15 Mirroring Example Condor-G 1 Execute Machine Condor-G 2 Jobs

16 Mirroring Example Condor-G 1 Execute Machine Condor-G 2 Jobs

17 Load-Balancing › What it does  Front-end Condor-G distributes all jobs among several back-end Condor-Gs  Front-end Condor-G keeps updated job status › Improves scalability › Maintains single submit point for users

18 Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2

19 Glide-In › Schedd and Startd are separate services that do not require any special privledges  Thus we can submit them as jobs! › Glide-In Schedd  What it does Drop a Condor-G onto the front-end machine of a remote cluster Delegate jobs to the cluster through the glide-in schedd  Can apply cluster-specific policies to jobs Not fork-and-forget…  Send a manager to the site, instead of manage across the internet

20 Glide-In Schedd Example Condor-G Glide-In Schedd Batch System Jobs Frontend Middleware

21 Glide-In Startd Example Condor-G (Schedd) Batch System Frontend Middleware Startd Job

22 Glide-In Startd › Why?  Restores all the benefits that may have been washed away by the middleware  End-to-end management solution Preserves job semantic guarantees Preserves policy  Enables lazy planning

23 Sample Job Submit file universe = grid grid_type = gt2 globusscheduler = cluster1.cs.wisc.edu/jobmanager-lsf executable = find_particle arguments = …. output = …. log = … But we want metascheduling…

24 Represent grid clusters as ClassAds › ClassAds  are a set of uniquely named expressions; each expression is called an attribute and is an attribute name/value pair  combine query and data  extensible  semi-structured : no fixed schema (flexibility in an environment consisting of distributed administrative domains)  Designed for “MatchMaking”

25 Example of a ClassAd that could represent a compute cluster in a grid: Type = "GridSite"; Name = "FermiComputeCluster"; Arch = “Intel-Linux”; Gatekeeper_url = "globus.fnal.gov/lsf" Load = [ QueuedJobs = 42; RunningJobs = 200; ]; Requirements = ( other.Type == "Job" && Load.QueuedJobs < 100 ); GoodPeople = { "howard", "harry" }; Rank = member(other.Owner, GoodPeople) * 500

26 Another Sample - Job Submit universe = grid grid_type = gt2 owner = howard executable = find_particle.$$(Arch) requirements = other.Arch == “Intel-Linux” || other.Arch == “Sparc-Solaris” rank = 0 – other.Load.QueuedJobs; globusscheduler = $$(gatekeeper_url) … Note: We introduced augmentation of the job ClassAd based upon information discovered in its matching resource ClassAd.

27 Multi-Hop Grid Scheduling › Match a job to a Virtual Organization (VO), then to a resource within that VO › Easier to schedule jobs across multiple VOs and grids

28 Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler HEPCMS

29 Endless Possibilities › These new models can be combined with each other or with other new models › Resulting system can be arbitrarily sophisticated

30 Job Delegation Challenges › New complexity introduces new issues and exacerbates existing ones › A few…  Transparency  Representation  Scheduling Control  Active Job Control  Revocation  Error Handling and Debugging

31 Transparency › Full information about job should be available to user  Information from full delegation path  No manual tracing across multiple machines › Users need to know what’s happening with their jobs

32 Representation › Job state is a vector › How best to show this to user  Summary Current delegation endpoint Job state at endpoint  Full information available if desired Series of nested ClassAds?

33 Scheduling Control › Avoid loops in delegation path › Give user control of scheduling  Allow limiting of delegation path length?  Allow user to specify part or all of delegation path

34 Active Job Control › User may request certain actions  hold, suspend, vacate, checkpoint › Actions cannot be completed synchronously for user  Must forward along delegation path  User checks completion later

35 Active Job Control (cont) › Endpoint systems may not support actions  If possible, execute them at furthest point that does support them › Allow user to apply action in middle of delegation path

36 Revocation › Leases  Lease must be renewed periodically for delegation to remain valid  Allows revocation during long-term failures › What are good values for lease lifetime and update interval?

37 Error Handling and Debugging › Many more places for things to go horribly wrong › Need clear, simple error semantics › Logs, logs, logs  Have them everywhere

38 From earlier › Transfer of responsibility to schedule and execute a job  Transfer policy “instructions”  Stage in executable and data files  Securely transfer (and refresh?) credentials, obtain local identities  Monitor and present job progress (tranparency!)  Return results

39 Job Failure Policy Expressions › Condor/Condor-G augemented so users can supply job failure policy expressions in the submit file. › Can be used to describe a successful run, or what to do in the face of failure. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =

40 Job Failure Policy Examples › Do not remove from queue (i.e. reschedule) if exits with a signal: on_exit_remove = ExitBySignal == False › Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) › Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)

41 Data Placement * (DaP) must be an integral part of the end-to-end solution Space management and Data transfer *

42 Stork › A scheduler for data placement activities in the Grid › What Condor is for computational jobs, Stork is for data placement › Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”

43 Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs

44 DAGMan DAG with DaP Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F

45 Why Stork? › Stork understands the characteristics and semantics of data placement jobs. › Can make smart scheduling decisions, for reliable and efficient data placement.

46 Failure Recovery and Efficient Resource Utilization › Fault tolerance  Just submit a bunch of data placement jobs, and then go away.. › Control number of concurrent transfers from/to any storage system  Prevents overloading › Space allocation and De-allocations  Make sure space is available

47 Support for Heterogeneity Protocol translation using Stork memory buffer.

48 Support for Heterogeneity Protocol translation using Stork Disk Cache.

49 Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]

50 Run-time Adaptation › Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]

51 Run-time Adaptation › Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]

52 Planner DAGMan Condor-G Stork RFT GRAM SRM StartD SRB NeST GridFTP Application Parrot

53 Thank You! › Questions?