Distributed Computing in Practice: The Condor Experience CS739 11/4/2013 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

1 Using Condor An Introduction ICE 2008.

Workload Management Massimo Sgaravatto INFN Padova.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.

Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.

Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,

Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.

Intermediate Condor Rob Quick Open Science Grid HTC - Indiana University.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Hunter of Idle Workstations Miron Livny Marvin Solomon University of Wisconsin-Madison URL:

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Review of Condor,SGE,LSF,PBS

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.

Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Grid Compute Resources and Job Management. 2 How do we access the grid ?  Command line with tools that you'll use  Specialised applications Ex: Write.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

Intermediate Condor Monday morning, 10:45am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

Condor DAGMan: Managing Job Dependencies with Condor

Quick Architecture Overview INFN HTCondor Workshop Oct 2016

Intermediate HTCondor: Workflows Monday pm

Grid Compute Resources and Job Management

Building Grids with Condor

Condor: Job Management

Basic Grid Projects – Condor (Part I)

Condor-G Making Condor Grid Enabled

Presentation transcript:

Distributed Computing in Practice: The Condor Experience CS739 11/4/2013 Todd Tannenbaum Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison

The Condor High Throughput Computing System 2 “Condor is a high-throughput distributed computing system. Like other batch systems, Condor provides a job management mechanism, scheduling policy, priority scheme, resource monitoring, and resource management…. While similar to other conventional batch systems, Condor's novel architecture allows it to perform well in environments where other batch systems are weak: high- throughput computing and opportunistic computing.”

3 3 Grids Clouds Map-Reduce eScience Cyber Infrastructure SaaS HPC HTC eResearch Web Services Virtual Machines HTPC MultiCore GPUs HDFS IaaS SLA QoS Open Source Green Computing Master-Worker WorkFlows Cyber Security High Availability Workforce 100Gb

Definitional Criteria for a Distributed Processing System 4  Multiplicity of resources  Component interconnection  Unity of control  System transparency  Component autonomy P.H. Enslow and T. G. Saponas “”Distributed and Decentralized Control in Fully Distributed Processing Systems” Technical Report, 1981

5 A thinker's guide to the most important trends of the new decade “ The goal shouldn't be to eliminate failure; it should be to build a system resilient enough to withstand it ” “The real secret of our success is that we learn from the past, and then we forget it. Unfortunately, we're dangerously close to forgetting the most important lessons of our own history: how to fail gracefully and how to get back on our feet with equal grace. ” In Defense of Failure By Megan McArdle Thursday, Mar. 11, 2010

Claims for “benefits” provided by Distributed Processing Systems 6  High Availability and Reliability  High System Performance  Ease of Modular and Incremental Growth  Automatic Load and Resource Sharing  Good Response to Temporary Overloads  Easy Expansion in Capacity and/or Function P.H. Enslow, “What is a Distributed Data Processing System?” Computer, January 1978

› High Throughput Computing  Goal: Provide large amounts of fault tolerant computational power over prolonged periods of time by effectively utilizing all resources › High Performance Computing (HPC) vs HTC  FLOPS vs FLOPY  FLOPY != FLOPS * (Num of seconds in a year) High Throughput Computing (HTC) 7

› Opportunistic Computing  Ability to use resources whenever they are available, without requiring one hundred percent availability.  Very attractive due to realities of distributed ownership › [Related: Volunteer Computing] Opportunistic Computing 8

› Let communities grow naturally.  Build structures that permit but do not require cooperation.  Relationships, obligations, and schemata will develop according to user necessity › Leave the owner in control. › Plan without being picky.  Better to be flexible than fast! › Lend and borrow. Philosophy of Flexibility 9

› Architect in terms of responsibility instead of expected functionality  Delegate like Ronald Reagan. How? › Plan for Failure!  Leases everywhere, 2PC, belt and suspenders, … › Never underestimate the value of code that has withstood the test of time in the field › Always keep end-to-end picture in mind when deciding upon layer functionality Todd’s Helpful Tips 10

› Code reuse creates tensions… › Example: Network layer uses checksums to ensure correctness End to end thinking 11

The Condor Kernel 12 Step 1: User submits a job

› ClassAds and Matchmaking  Name-value pairs  Values can be literal data or expressions  Expression can refer to match candidate  Requirements and Rank are treated special How are jobs described? 13

› Semi-structured data  No fixed schema  “Let communities grow naturally…” › Use three-value logic  Expressions evaluate to True, False, or Undefined › Bilateral Interesting ClassAd Properties 14

› What if schedd crashes during job submission?  ARIES-style recovery log  2 phase commit Plan for Failure, Lend and borrow 15 Begin Transaction 105 Owner Todd 105 Cmd /bin/hi … End Transaction Begin Transaction 106 Owner Fred 106 Cmd /bin/bye schedd Prepare Job ID 106 Commit 106

The Condor Kernel 16 Step 2: Matchmaking

› Each component (A, M, R) service is independent and has a distinct responsibility › Centralized Matchmaker is very light weight: after Match step, it is not involved › Claims can be reused, delegated, … Matchmaking Protocol 17 1.Advertise 2.Match 3.Claim

› Federation many years later was easy because of clearly defined roles of services Federation via Direct Flocking 18

› GRAM = Grid Resource Access and Management  “Defacto-standard” protocol for submission to a site scheduler › Problems  Dropped many end-to-end features Like exit codes!  Couples resource allocation and job execution Early binding of a specific job to a specific queue Globus and GRAM 19

20

21

22

23

Solution: Late-binding via GlideIn 24

› Standard Universe  Process Checkpoint/Restart  Remote I/O › Java Universe Split Execution Environments 25

Process Checkpointing › Condor’s process checkpointing mechanism saves the entire state of a process into a checkpoint file  Memory, CPU, I/O, etc. › The process can then be restarted from right where it left off › Typically no changes to your job’s source code needed—however, your job must be relinked with Condor’s Standard Universe support library

Relinking Your Job for Standard Universe To do this, just place “condor_compile” in front of the command you normally use to link your job: % condor_compile gcc -o myjob myjob.c - OR - % condor_compile f77 -o myjob filea.f fileb.f

When will Condor checkpoint your job? › Periodically, if desired (for fault tolerance) › When your job is preempted by a higher priority job  Preempt/Resume scheduling – powerful! › When your job is vacated because the execution machine becomes busy

Remote System Calls › I/O system calls are trapped and sent back to submit machine › Allows transparent migration across administrative domains  Checkpoint on machine A, restart on B › No source code changes required › Language independent › Opportunities for application steering

The Condor Kernel 30

Job I/O Lib Remote I/O condor_schedd condor_startd condor_shadow condor_starter File

Java Universe Job universe = java executable = Main.class jar_files = MyLibrary.jar input = infile output = outfile arguments = Main queue

Why not use Vanilla Universe for Java jobs? › Java Universe provides more than just inserting “java” at the start of the execute line  Knows which machines have a JVM installed  Knows the location, version, and performance of JVM on each machine  Can differentiate JVM exit code from program exit code  Can report Java exceptions

DAGMan › Directed Acyclic Graph Manager › DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. › (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

What is a DAG? › A DAG is the data structure used by DAGMan to represent these dependencies. › Each job is a “node” in the DAG. › Each node can have any number of “parent” or “children” nodes – as long as there are no loops! Job A Job BJob C Job D

› A DAG is defined by a.dag file, listing each of its nodes and their dependencies: # diamond.dag Job A a.sub Job B b.sub Job C c.sub Job D d.sub Parent A Child B C Parent B C Child D › each node will run the Condor job specified by its accompanying Condor submit file Defining a DAG Job A Job BJob C Job D

› CEDAR › Job Sandboxing Security 37

Questions? 38