Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Condor Project Computer Sciences Department University of Wisconsin-Madison Stork An Introduction Condor Week 2006 Milan.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
Basic Grid Job Submission Alessandra Forti 28 March 2006.
GRID COMPUTING & GRID SCHEDULERS - Neeraj Shah. Definition A ‘Grid’ is a collection of different machines where in all of them contribute any combination.
Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Part 6: (Local) Condor A: What is Condor? B: Using (Local) Condor C: Laboratory: Condor.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G Operations.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Condor Team Welcome to Condor Week #10 (year #25 for the project)
Grid job submission using HTCondor Andrew Lahiff.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Nick LeRoy Computer Sciences Department University of Wisconsin-Madison Hawkeye.
Ian D. Alderman Computer Sciences Department University of Wisconsin-Madison Condor Week 2008 End-to-end.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
Ian Collier, STFC, Romain Wartel, CERN Maintaining Traceability in an Evolving Distributed Computing Environment Introduction Security.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Condor Project Computer Sciences Department University of Wisconsin-Madison Condor-G: Condor and Grid Computing.
HTCondor-CE. 2 The Open Science Grid OSG is a consortium of software, service and resource providers and researchers, from universities, national laboratories.
System Software Laboratory Databases and the Grid by Paul Watson University of Newcastle Grid Computing: Making the Global Infrastructure a Reality June.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Job Delegation and Planning.
Dynamic Deployment of VO Specific Condor Scheduler using GT4
CREAM-CE/HTCondor site
Building Grids with Condor
Basic Grid Projects – Condor (Part I)
Condor-G Making Condor Grid Enabled
Abstractions for Fault Tolerance
Presentation transcript:

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed Job Delegation

Job Delegation › Transfer of responsibility to schedule and execute a job › Multiple delegations can form a chain

Job Delegation in Condor-G Today Condor-G Globus GRAM Batch System Front-end Execute Machine

Expanding the Model › What can we do with new forms of job delegation? › Some ideas  Mirroring  Load-balancing  Glide-in schedd  Multi-hop grid scheduling

Mirroring › What it does  Jobs mirrored on two Condor-Gs  If primary Condor-G crashes, secondary one starts running jobs  On recovery, primary Condor-G gets job status from secondary one › Removes Condor-G submit point as single point of failure

Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2

Mirroring Example Condor-G 1 Matchmaker Execute Machine Condor-G 2

Load-Balancing › What it does  Front-end Condor-G distributes all jobs among several back-end Condor-Gs  Front-end Condor-G keeps updated job status › Improves scalability › Maintains single submit point for users

Load-Balancing Example Condor-G Back-end 1 Condor-G Front-end Condor-G Back-end 3 Condor-G Back-end 2

Glide-In Schedd › What it does  Drop a Condor-G onto the front-end machine of a cluster  Delegate jobs to the cluster through the glide-in schedd › Apply cluster-specific policies to jobs

Glide-In Schedd Example Condor-G Glide-In Schedd Batch System

Multi-Hop Grid Scheduling › Match a job to a Virtual Organization (VO), then to a resource within that VO › Easier to schedule jobs across multiple VOs and grids

Multi-Hop Grid Scheduling Example Experiment Condor-G Experiment Resource Broker VO Condor-G VO Resource Broker Globus GRAM Batch Scheduler

Endless Possibilities › These new models can be combined with each other or with other new models › Resulting system can be arbitrarily sophisticated

Job Delegation Challenges › New complexity introduces new issues and exacerbates existing ones › A few…  Transparency  Representation  Scheduling Control  Active Job Control  Revocation  Error Handling and Debugging

Transparency › Full information about job should be available to user  Information from full delegation path  No manual tracing across multiple machines › Users need to know what’s happening with their jobs

Representation › Job state is a vector › How best to show this to user  Summary Current delegation endpoint Job state at endpoint  Full information available if desired Series of nested ClassAds?

Scheduling Control › Avoid loops in delegation path › Give user control of scheduling  Allow limiting of delegation path length?  Allow user to specify part or all of delegation path

Active Job Control › User may request certain actions  hold, suspend, vacate, checkpoint › Actions cannot be completed synchronously for user  Must forward along delegation path  User checks completion later

Active Job Control (cont) › Endpoint systems may not support actions  If possible, execute them at furthest point that does support them › Allow user to apply action in middle of delegation path

Revocation › Leases  Lease must be renewed periodically for delegation to remain valid  Allows revocation during long-term failures › What are good values for lease lifetime and update interval?

Error Handling and Debugging › Many more places for things to go horribly wrong › Need clear, simple error semantics › Logs, logs, logs  Have them everywhere

Current Status › Done  Mirroring › In Progress  Condor-G -> Condor-G delegation User must specify hops  Glide-in schedd Set up by hand

Thank You! › Questions?