Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor.

Slides:

Advertisements

Similar presentations

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor: A Project and.

Advertisements

/ 1 N. Williams Grid Middleware Experiences Nadya Williams OCI Grid Computing, University of Zurich

Jaime Frey, Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison OGF.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Condor Project Computer Sciences Department University of Wisconsin-Madison Eager, Lazy, and Just-in-Time.

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

A Computation Management Agent for Multi-Institutional Grids

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)

Miron Livny Computer Sciences Department University of Wisconsin-Madison From Compute Intensive to Data.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Grid Computing, B. Wilkinson, 20046d.1 Schedulers and Resource Brokers.

6d.1 Schedulers and Resource Brokers ITCS 4010 Grid Computing, 2005, UNC-Charlotte, B. Wilkinson.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

Grid Computing I CONDOR.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

BOSCO Architecture Derek Weitzel University of Nebraska – Lincoln.

Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

CSF4 Meta-Scheduler Name: Zhaohui Ding, Xiaohui Wei

Grid Workload Management Massimo Sgaravatto INFN Padova.

Grid job submission using HTCondor Andrew Lahiff.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.

NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar and Miron Livny University of Wisconsin-Madison March 25 th, 2004 Tokyo, Japan.

1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison

Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.

Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.

Review of Condor,SGE,LSF,PBS

STORK: Making Data Placement a First Class Citizen in the Grid Tevfik Kosar University of Wisconsin-Madison May 25 th, 2004 CERN.

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor and DAGMan Barcelona,

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Welcome!!! Condor Week 2006.

Reliable and Efficient Grid Data Placement using Stork and DiskRouter Tevfik Kosar University of Wisconsin-Madison April 15 th, 2004.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Since computing power is everywhere, how can we make it usable by anyone? (From Condor Week 2003, UW)

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning.

Condor DAGMan: Managing Job Dependencies with Condor

Condor – A Hunter of Idle Workstation

Grid Compute Resources and Job Management

Building Grids with Condor

Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.

Basic Grid Projects – Condor (Part I)

STORK: A Scheduler for Data Placement Activities in Grid

Genre1: Condor Grid: CSECCR

JRA 1 Progress Report ETICS 2 All-Hands Meeting

Presentation transcript:

Condor Project Computer Sciences Department University of Wisconsin-Madison Introduction Condor Software Forum OGF19

Outline What do YOU want to talk about? Proposed Agenda Introduction Condor-G APIs > Grid Job Router GCB Roadmap

The Condor Project (Established 85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students.

The Condor Project (Established 85) Distributed High Throughput Computing research performed by a team of ~35 faculty, full time staff and students who: face software engineering challenges in a distributed UNIX/Linux/NT environment are involved in national and international grid collaborations, actively interact with academic and commercial users, maintain and support large distributed production environments, and educate and train students. Funding – US Govt. (DoD, DoE, NASA, NSF, NIH), AT&T, IBM, INTEL, Microsoft, UW-Madison, …

Main Threads of Activities Distributed Computing Research – develop and evaluate new concepts, frameworks and technologies The Open Science Grid (OSG) – build and operate a national distributed computing and storage infrastructure Keep Condor flight worthy and support our users The NSF Middleware Initiative (NMI) – develop, build and operate a national Build and Test facility The Grid Laboratory Of Wisconsin (GLOW) – build, maintain and operate a distributed computing and storage infrastructure on the UW campus

A Multifaceted Project Harnessing the power of clusters - opportunistic and/or dedicated (Condor) Job management services for Grid applications (Condor-G, Stork) Fabric management services for Grid resources (Condor, GlideIns, NeST) Distributed I/O technology (Parrot, Kangaroo, NeST) Job-flow management (DAGMan, Condor, Hawk) Distributed monitoring and management (HawkEye) Technology for Distributed Systems (ClassAD, MW) Packaging and Integration (NMI, VDT)

Some software produced by the Condor Project Condor System ClassAd Library DAGMan GAHP Hawkeye GCB MW NeST Stork Parrot Condor-G And others… all as open source

What is Condor? Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high- throughput computing (HTC) facility. Condor manages both resources (machines) and resource requests (jobs) Condor has several unique mechanisms Transparent checkpoint/restart Transparent process migration I/O Redirection ClassAd Matchmaking Technology Grid Metacheduling

Condor can manage a large number of jobs Managing a large number of jobs You specify the jobs in a file and submit them to Condor, which runs them all and keeps you notified on their progress Mechanisms to help you manage huge numbers of jobs (1000s), all the data, etc. Condor can handle inter-job dependencies (DAGMan) Condor users can set job priorities Condor administrators can set user priorities

Condor can manage Dedicated Resources… Dedicated Resources Compute Clusters Grid Resources Manage Node monitoring, scheduling Job launch, monitor & cleanup

…and Condor can manage non-dedicated resources Non-dedicated resources examples: Desktop workstations in offices Workstations in student labs Non-dedicated resources are often idle --- ~70% of the time! Condor can effectively harness the otherwise wasted compute cycles from non-dedicated resources

Condor Classads Capture and communicate attributes of objects (resources, work units, connections, claims, …) Define policies/conditions/triggers via Boolean expressions ClassAd Collections provide persistent storage Facilitate matchmaking and gangmatching

Example: Job Polices w/ ClassAds Do not remove if exits with a signal: on_exit_remove = ExitBySignal == False Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)

Condor Job Universes Vanilla - serial jobs Standard – serial jobs with Transparent checkpoint/restart Remote System Calls Java PVM Parallel (thanks to AIST and Best Systems) Scheduler Grid

Condor Job Universes, cont. Scheduler Grid

Scheduler Job example: DAGMan Directed Acyclic Graph Manager Often a job will have several logical steps that must be executed in order DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. (e.g., Dont run job B until job A has completed successfully.)

What is a DAG? A DAG is the data structure used by DAGMan to represent these dependencies. Each job is a node in the DAG Can have its own requirements Can be scheduled independently Each node can have any number of parent or child nodes – as long as there are no loops! Job A Job BJob C Job D

Additional DAGMan Features Provides other handy features for job management… nodes can have PRE & POST scripts failed nodes can be automatically re- tried a configurable number of times job submission can be throttled

With Grid Universe, always specify a gridtype. Allowed GridTypes GT2 (Globus Toolkit 2) GT3 (Globus Toolkit 3.2) GT4 (Globus Toolkit ) UNICORE Nordugrid PBS (OpenPBS, PBSPro – thanks to INFN) LSF (Platform LSF –thanks to INFN) CONDOR (thanks gLite!) Grid Universe Condor-C Condor-G

A Grid MetaScheduler Grid Universe + ClassAd Matchmaking

COD Computing On Demand

What Problem Does COD Solve? Some people want to run interactive, yet compute-intensive applications Jobs that take lots of compute power over a relatively short period of time They want to use batch computing resources, but need them right away Ideally, when theyre not in use, resources would go back to the batch system

COD is not just high- priority jobs Checkpoint to Swap Space When a high-priority COD job appears, the lower-priority batch job is suspended The COD job can run right away, while the batch job is suspended Batch jobs (even those that cant checkpoint) can resume instantly once there are no more active COD jobs

Stork – Data Placement Agent Need for data placement on the Grid: Locate the data Send data to processing sites Share the results with other sites Allocate and de-allocate storage Clean-up everything Do these reliably and efficiently Make data placement a first class citizen in the Grid.

Stork A scheduler for data placement activities in the Grid What Condor is for computational jobs, Stork is for data placement Stork understands the characteristics and semantics of data placement jobs. Can make smart scheduling decisions, for reliable and efficient data placement.

Stork - The Concept Stage-in Execute the Job Stage-out Stage-in Execute the jobStage-outRelease input spaceRelease output space Allocate space for input & output data Data Placement Jobs Computational Jobs

DAGMan Stork - The Concept Condor Job Queue DaP A A.submit DaP B B.submit Job C C.submit ….. Parent A child B Parent B child C Parent C child D, E ….. C Stork Job Queue E DAG specification ACB D E F

Stork - Support for Heterogeneity Protocol translation using Stork memory buffer.

GCB – Generic Connection Broker Build grids despite the reality of Firewalls Private Networks NATs

Condor Usage

X86/Linux X86/Windows Downloads per month

Condor-Users –Messages per month Condor Team Contributions

Questions?