Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor and Grid Challenges HEPiX Spring 2005 Karlsruhe, Germany
2 Outline › Introduction to the Condor Project › Introduction to Condor / Condor-G › The Challenge: Take Grid Planning to the next level How to address mismatch between local –vs- grid scheduler capabilities › Lessons learned
3 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.
4 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: face software engineering challenges in a distributed UNIX/Linux/NT environment are involved in national and international grid collaborations, actively interact with academic and commercial users, maintain and support large distributed production environments, and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …
5 A Multifaceted Project › Harnessing the power of clusters – dedicated and/or opportunistic (Condor) › Job management services for Grid applications (Condor-G, Stork) › Fabric management services for Grid resources (Condor, GlideIns, NeST) › Distributed I/O technology (Parrot, Kangaroo, NeST) › Job-flow management (DAGMan, Condor, Hawk) › Distributed monitoring and management (HawkEye) › Technology for Distributed Systems (ClassAD, MW) › Packaging and Integration (NMI, VDT)
6 Some free software produced by the Condor Project › Condor System › ClassAd Library › DAGMan › Fault Tolerant Shell (FTSH) › Hawkeye › GCB › MW › NeST › Stork › Parrot › VDT › And others… all as open source Data!
7 Full featured system › Flexible scheduling policy engine via ClassAds Preemption, suspension, requirements, preferences, groups, quotas, settable fair-share, system hold… › Facilities to manage BOTH dedicated CPUs (clusters) and non-dedicated resources (desktops) › Transparent Checkpoint/Migration for many types of serial jobs › No shared file-system required › Workflow management (inter-dependencies) › Support for many job types – serial, parallel, etc. › Fault-tolerant: can survive crashes, network outages, no single point of failure. › API: via SOAP / web services, DRMAA (C), Perl package, GAHP, flexible command-line tools › Platforms: Linux i386/IA64, Windows 2k/XP, MacOS, Solaris, IRIX, HP-UX, Compaq Tru64, … lots.
8 Who uses Condor? › Commercial Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobil, Shell, Alterra, Texas Instruments, … › Research Community Universities, Gov Labs Bundles: NMI, VDT Grid Communities: EGEE/LCG/gLite, Particle Physics Data Grid (PPDG), Fermi, USCMS, LIGO, iVDGL, NSF Middleware Initiative GRIDS Center, …
9 Condor in a nutshell › Condor is a set of distributed services A scheduler service (schedd) job policy A resource manager service (startd) resource policy A matchmaker service administrator policy › These services use ClassAds as the lingua franca
10 ClassAds are a set of uniquely named expressions; each expression is called an attribute and is an attribute name/value pair combine query and data extensible semi-structured : no fixed schema (flexibility in an environment consisting of distributed administrative domains) Designed for “MatchMaking”
11 Arch = “Intel-Linux”; KFlops = 2200 Memory = 1024 Disk = Requirements = other.ImageSize < Memory preferred_projects = { “atlas", “uscms" } Rank = member(other.project, preferred_projects) * 500 … (70+ out of the box, can be customized) Sample – Machine ClassAd
12 Another Sample - Job Submit project = “uscms” imagesize = 260 requirements = (other.Arch == “Intel-Linux” || other.Arch == “Sparc-Solaris”) && Disk > 500 && ( fooapp_path =!= UNDEFINED ) rank = other.KFlops * other.Memory executable = find_particle.$$(Arch) arguments = $$(fooapp_path) … Note: We introduced augmentation of the job ClassAd based upon information discovered in its matching resource ClassAd.
13 First make some matches Schedd Startd Schedd MatchMaker Jobs NOTE: Have as many schedds as you want!
14 Then claim and schedule Schedd Startd Schedd MatchMaker Jobs
15 Then claim and schedule Schedd Startd Schedd MatchMaker Jobs Note: Each schedd decides for itself which job to run.
16 How long does this claim exist? › Until it is explicitly preempted Or on job boundaries if you insist › Who can preempt? The startd due to resource owner policy The matchmaker due to organization policy Wants to give the node to a higher priority user The schedd due to job scheduling policy Stop runaway job, etc. › Responsibility is placed at the appropriate level
17 Condor-G Globus 2 Globus 4 Unicore (Nordugrid) Startd Schedd Jobs LSF PBS Schedd - Condor-G - Condor-C -Thanks INFN!
18 User/Application/Portal Fabric ( processing, storage, communication ) Grid Condor Pool Middleware (Globus 2, Globus 4, Unicore, …) Condor-G
19 Two ways to do the “grid thing” (share resources) › “I’ll loan you a machine, you do whatever you want to do with it”. This is what Condor does via matchmaking › “Give me your job, and I’ll run it for you”. Traditional grid broker
20 How Flocking Works › Add a line to your condor_config : FLOCK_HOSTS = Site-A, Site-B Schedd Collector Negotiator Matchmaker Collector Negotiator Site-A Matchmaker Collector Negotiator Site-B Matchmaker Submit Machine
21 Example of “Give me a job” Broker (Condor-G) Grid Middleware Batch System Front-end Execute Machine
22 Any problems? › Sure. Lack of scheduling information Job monitoring details Loss of job semantic guarantees › What can we do? Glide-In Schedd and Startd are separate services that do not require any special privledges Thus we can submit them as jobs! Grow a Condor pool out of Grid resources
23 Glide-In Startd Example Condor-G (Schedd) Batch System Frontend Middleware Startd Job
24 Glide-In Startd › Why? Restores all the benefits that may have been washed away by the middleware, such as Lack of scheduling information Job monitoring details End-to-end management solution Preserves job semantic guarantees Preserves policy Enables lazy planning
25 Glide-In Startd (aka ‘sneaky job’) › Why not? Time wasted on scheduling overhead for black hole-jobs Alternative? Wait-while-idle. Which is more evil? Glide-ins can be removed Glide-in “broadcast” is not the only model Makes accurate prediction time difficult (impossible?) But can still give worst case › Would not be needed if schedulers could delegate machines instead of just run jobs…
26 What’s brewing for after v6.8.0? › More data, data, data Stork distributed w/ v6.8.0, incl DAGMan support NeST manage Condor spool files, ckpt servers Stork used for Condor job data transfers › Virtual Machines (and the future of Standard Universe) › Condor and Shibboleth (with Georgetown Univ) › Least Privilege Security Access (with U of Cambridge) › Dynamic Temporary Accounts (with EGEE, Argonne) › Leverage Database Technology (with UW DB group) › ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) › Easier Updates › New ClassAds (integration with Optena) › Hierarchical Matchmaking Can I commit this to CVS??
27 BIG 10 MM UW MM CS MM Theory Group MM CC R R R R “I need more resources” A Tree of Matchmakers Fault Tolerance Flexibility MM now manage other MMs Erdos MM A Match
28 Data Placement * (DaP) must be an integral part of the end-to-end solution Space management and Data transfer *
29 Stork › A scheduler for data placement activities in the Grid › What Condor is for computational jobs, Stork is for data placement › Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”
30 Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs
31 DAGMan DAG with DaP Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F
32 Why Stork? › Stork understands the characteristics and semantics of data placement jobs. › Can make smart scheduling decisions, for reliable and efficient data placement.
33 Failure Recovery and Efficient Resource Utilization › Fault tolerance Just submit a bunch of data placement jobs, and then go away.. › Control number of concurrent transfers from/to any storage system Prevents overloading › Space allocation and De-allocations Make sure space is available
34 Support for Heterogeneity Protocol translation using Stork memory buffer.
35 Support for Heterogeneity Protocol translation using Stork Disk Cache.
36 Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]
37 Run-time Adaptation › Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]
38 Run-time Adaptation › Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]
39 Planner DAGMan Condor-G Stork RFT GRAM SRM StartD SRB NeST GridFTP Application Parrot
40 Some bold assertions for discussion… › Need clean separation between resource allocation and work delegation (scheduling) Sites just worry about resource allocation › Scheduling policy should be at the VO level and deal with resources first Don’t ship jobs – ship requests for resources if you send a job to site, do not assume that it will actually run there. do not depend on what the site tells you, use your own experience to evaluate how well the site serves you. Keep in mind that grids are not about running one job, it is about running MANY jobs. Want CoD? Get your resources first!!
41 Some bold assertions for discussion, cont… › Data Handling Treat data movement as a “first class citizen” › “Lazy Planning” good › Application scheduling “intelligence” should start in a workflow manager and push downwards › Schedulers must be dynamic Deal gracefully with resources coming and going.
42 Thank You! › Questions?