Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor and Grid Challenges.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.

The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.

Miron Livny Computer Sciences Department University of Wisconsin-Madison From Compute Intensive to Data.

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Workflow Management in Condor Gökay Gökçay. DAGMan Meta-Scheduler The Directed Acyclic Graph Manager (DAGMan) is a meta-scheduler for Condor jobs. DAGMan.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

1 Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.

Grid Resource Allocation and Management (GRAM) Execution management Execution management –Deployment, scheduling and monitoring Community Scheduler Framework.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.

Grid Computing at The Hartford Condor Week 2008 Robert Nordlund

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

Nick LeRoy & Jeff Weber Computer Sciences Department University of Wisconsin-Madison Managing.

Peter F. Couvares (based on material from Tevfik Kosar, Nick LeRoy, and Jeff Weber) Associate Researcher, Condor Team Computer Sciences Department University.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

Review of Condor,SGE,LSF,PBS

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.

Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

1 Stork: State of the Art Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison

Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.

Nick LeRoy Computer Sciences Department University of Wisconsin-Madison Hawkeye.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

Run-time Adaptation of Grid Data Placement Jobs George Kola, Tevfik Kosar and Miron Livny Condor Project, University of Wisconsin.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning.

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Condor: Job Management

Basic Grid Projects – Condor (Part I)

STORK: A Scheduler for Data Placement Activities in Grid

Condor-G Making Condor Grid Enabled

Condor-G: An Update.

Presentation transcript:

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor and Grid Challenges HEPiX Spring 2005 Karlsruhe, Germany

2 Outline › Introduction to the Condor Project › Introduction to Condor / Condor-G › The Challenge: Take Grid Planning to the next level  How to address mismatch between local –vs- grid scheduler capabilities › Lessons learned

3 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

4 The Condor Project (Established ‘85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who:  face software engineering challenges in a distributed UNIX/Linux/NT environment  are involved in national and international grid collaborations,  actively interact with academic and commercial users,  maintain and support large distributed production environments,  and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …

5 A Multifaceted Project › Harnessing the power of clusters – dedicated and/or opportunistic (Condor) › Job management services for Grid applications (Condor-G, Stork) › Fabric management services for Grid resources (Condor, GlideIns, NeST) › Distributed I/O technology (Parrot, Kangaroo, NeST) › Job-flow management (DAGMan, Condor, Hawk) › Distributed monitoring and management (HawkEye) › Technology for Distributed Systems (ClassAD, MW) › Packaging and Integration (NMI, VDT)

6 Some free software produced by the Condor Project › Condor System › ClassAd Library › DAGMan › Fault Tolerant Shell (FTSH) › Hawkeye › GCB › MW › NeST › Stork › Parrot › VDT › And others… all as open source Data!

7 Full featured system › Flexible scheduling policy engine via ClassAds  Preemption, suspension, requirements, preferences, groups, quotas, settable fair-share, system hold… › Facilities to manage BOTH dedicated CPUs (clusters) and non-dedicated resources (desktops) › Transparent Checkpoint/Migration for many types of serial jobs › No shared file-system required › Workflow management (inter-dependencies) › Support for many job types – serial, parallel, etc. › Fault-tolerant: can survive crashes, network outages, no single point of failure. › API: via SOAP / web services, DRMAA (C), Perl package, GAHP, flexible command-line tools › Platforms: Linux i386/IA64, Windows 2k/XP, MacOS, Solaris, IRIX, HP-UX, Compaq Tru64, … lots.

8 Who uses Condor? › Commercial  Oracle, Micron, Hartford Life Insurance, CORE, Xerox, Exxon/Mobil, Shell, Alterra, Texas Instruments, … › Research Community  Universities, Gov Labs  Bundles: NMI, VDT  Grid Communities: EGEE/LCG/gLite, Particle Physics Data Grid (PPDG), Fermi, USCMS, LIGO, iVDGL, NSF Middleware Initiative GRIDS Center, …

9 Condor in a nutshell › Condor is a set of distributed services  A scheduler service (schedd) job policy  A resource manager service (startd) resource policy  A matchmaker service administrator policy › These services use ClassAds as the lingua franca

10 ClassAds  are a set of uniquely named expressions; each expression is called an attribute and is an attribute name/value pair  combine query and data  extensible  semi-structured : no fixed schema (flexibility in an environment consisting of distributed administrative domains)  Designed for “MatchMaking”

11 Arch = “Intel-Linux”; KFlops = 2200 Memory = 1024 Disk = Requirements = other.ImageSize < Memory preferred_projects = { “atlas", “uscms" } Rank = member(other.project, preferred_projects) * 500 … (70+ out of the box, can be customized) Sample – Machine ClassAd

12 Another Sample - Job Submit project = “uscms” imagesize = 260 requirements = (other.Arch == “Intel-Linux” || other.Arch == “Sparc-Solaris”) && Disk > 500 && ( fooapp_path =!= UNDEFINED ) rank = other.KFlops * other.Memory executable = find_particle.$$(Arch) arguments = $$(fooapp_path) … Note: We introduced augmentation of the job ClassAd based upon information discovered in its matching resource ClassAd.

13 First make some matches Schedd Startd Schedd MatchMaker Jobs NOTE: Have as many schedds as you want!

14 Then claim and schedule Schedd Startd Schedd MatchMaker Jobs

15 Then claim and schedule Schedd Startd Schedd MatchMaker Jobs Note: Each schedd decides for itself which job to run.

16 How long does this claim exist? › Until it is explicitly preempted  Or on job boundaries if you insist › Who can preempt?  The startd due to resource owner policy  The matchmaker due to organization policy Wants to give the node to a higher priority user  The schedd due to job scheduling policy Stop runaway job, etc. › Responsibility is placed at the appropriate level

17 Condor-G Globus 2 Globus 4 Unicore (Nordugrid) Startd Schedd Jobs LSF PBS Schedd - Condor-G - Condor-C -Thanks INFN!

18 User/Application/Portal Fabric ( processing, storage, communication ) Grid Condor Pool Middleware (Globus 2, Globus 4, Unicore, …) Condor-G

19 Two ways to do the “grid thing” (share resources) › “I’ll loan you a machine, you do whatever you want to do with it”.  This is what Condor does via matchmaking › “Give me your job, and I’ll run it for you”.  Traditional grid broker

20 How Flocking Works › Add a line to your condor_config : FLOCK_HOSTS = Site-A, Site-B Schedd Collector Negotiator Matchmaker Collector Negotiator Site-A Matchmaker Collector Negotiator Site-B Matchmaker Submit Machine

21 Example of “Give me a job” Broker (Condor-G) Grid Middleware Batch System Front-end Execute Machine

22 Any problems? › Sure. Lack of scheduling information Job monitoring details Loss of job semantic guarantees › What can we do? Glide-In Schedd and Startd are separate services that do not require any special privledges  Thus we can submit them as jobs!  Grow a Condor pool out of Grid resources

23 Glide-In Startd Example Condor-G (Schedd) Batch System Frontend Middleware Startd Job

24 Glide-In Startd › Why?  Restores all the benefits that may have been washed away by the middleware, such as Lack of scheduling information Job monitoring details  End-to-end management solution Preserves job semantic guarantees Preserves policy  Enables lazy planning

25 Glide-In Startd (aka ‘sneaky job’) › Why not?  Time wasted on scheduling overhead for black hole-jobs Alternative? Wait-while-idle. Which is more evil? Glide-ins can be removed Glide-in “broadcast” is not the only model  Makes accurate prediction time difficult (impossible?) But can still give worst case › Would not be needed if schedulers could delegate machines instead of just run jobs…

26 What’s brewing for after v6.8.0? › More data, data, data  Stork distributed w/ v6.8.0, incl DAGMan support  NeST manage Condor spool files, ckpt servers  Stork used for Condor job data transfers › Virtual Machines (and the future of Standard Universe) › Condor and Shibboleth (with Georgetown Univ) › Least Privilege Security Access (with U of Cambridge) › Dynamic Temporary Accounts (with EGEE, Argonne) › Leverage Database Technology (with UW DB group) › ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) › Easier Updates › New ClassAds (integration with Optena) › Hierarchical Matchmaking Can I commit this to CVS??

27 BIG 10 MM UW MM CS MM Theory Group MM CC R R R R “I need more resources” A Tree of Matchmakers Fault Tolerance Flexibility MM now manage other MMs Erdos MM A Match

28 Data Placement * (DaP) must be an integral part of the end-to-end solution Space management and Data transfer *

29 Stork › A scheduler for data placement activities in the Grid › What Condor is for computational jobs, Stork is for data placement › Stork comes with a new concept: “Make data placement a first class citizen in the Grid.”

30 Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs

31 DAGMan DAG with DaP Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F

32 Why Stork? › Stork understands the characteristics and semantics of data placement jobs. › Can make smart scheduling decisions, for reliable and efficient data placement.

33 Failure Recovery and Efficient Resource Utilization › Fault tolerance  Just submit a bunch of data placement jobs, and then go away.. › Control number of concurrent transfers from/to any storage system  Prevents overloading › Space allocation and De-allocations  Make sure space is available

34 Support for Heterogeneity Protocol translation using Stork memory buffer.

35 Support for Heterogeneity Protocol translation using Stork Disk Cache.

36 Flexible Job Representation and Multilevel Policy Support [ Type = “Transfer”; Src_Url = “srb://ghidorac.sdsc.edu/kosart.condor/x.dat”; Dest_Url = “nest://turkey.cs.wisc.edu/kosart/x.dat”; …… Max_Retry = 10; Restart_in = “2 hours”; ]

37 Run-time Adaptation › Dynamic protocol selection [ dap_type = “transfer”; src_url = “drouter://slic04.sdsc.edu/tmp/test.dat”; dest_url = “drouter://quest2.ncsa.uiuc.edu/tmp/test.dat”; alt_protocols = “nest-nest, gsiftp-gsiftp”; ] [ dap_type = “transfer”; src_url = “any://slic04.sdsc.edu/tmp/test.dat”; dest_url = “any://quest2.ncsa.uiuc.edu/tmp/test.dat”; ]

38 Run-time Adaptation › Run-time Protocol Auto-tuning [ link = “slic04.sdsc.edu – quest2.ncsa.uiuc.edu”; protocol = “gsiftp”; bs = 1024KB;//block size tcp_bs= 1024KB;//TCP buffer size p= 4; ]

39 Planner DAGMan Condor-G Stork RFT GRAM SRM StartD SRB NeST GridFTP Application Parrot

40 Some bold assertions for discussion… › Need clean separation between resource allocation and work delegation (scheduling)  Sites just worry about resource allocation › Scheduling policy should be at the VO level and deal with resources first  Don’t ship jobs – ship requests for resources  if you send a job to site, do not assume that it will actually run there.  do not depend on what the site tells you, use your own experience to evaluate how well the site serves you.  Keep in mind that grids are not about running one job, it is about running MANY jobs.  Want CoD? Get your resources first!!

41 Some bold assertions for discussion, cont… › Data Handling  Treat data movement as a “first class citizen” › “Lazy Planning” good › Application scheduling “intelligence” should start in a workflow manager and push downwards › Schedulers must be dynamic  Deal gracefully with resources coming and going.

42 Thank You! › Questions?