Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

29 June 2006 GridSite Andrew McNabwww.gridsite.org VOMS and VOs Andrew McNab University of Manchester.
NGS computation services: API's,
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.
Dinker Batra CLUSTERING Categories of Clusters. Dinker Batra Introduction A computer cluster is a group of linked computers, working together closely.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
BMC Control-M Architecture By Shaikh Ilyas
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Distributed Computing Overviews. Agenda What is distributed computing Why distributed computing Common Architecture Best Practice Case study –Condor –Hadoop.
FIREWALL TECHNOLOGIES Tahani al jehani. Firewall benefits  A firewall functions as a choke point – all traffic in and out must pass through this single.
Jim Basney Computer Sciences Department University of Wisconsin-Madison Managing Network Resources in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Zach Miller Condor Project Computer Sciences Department University of Wisconsin-Madison Flexible Data Placement Mechanisms in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Cheap cycles from the desktop to the dedicated cluster: combining opportunistic and dedicated scheduling with Condor Derek Wright Computer Sciences Department.
Port Knocking Software Project Presentation Paper Study – Part 1 Group member: Liew Jiun Hau ( ) Lee Shirly ( ) Ong Ivy ( )
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Installing and Managing a Large Condor Pool Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
The Cluster Computing Project Robert L. Tureman Paul D. Camp Community College.
Grid Computing I CONDOR.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Hao Wang Computer Sciences Department University of Wisconsin-Madison Authentication and Authorization.
 Whether using paper forms or forms on the web, forms are used for gathering information. User enter information into designated areas, or fields. Forms.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Privilege separation in Condor Bruce Beckles University of Cambridge Computing Service.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Cloud Age Time to change the programming paradigm?
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
Slide 1 Archive Computing: Scalable Computing Environments on Very Large Archives Andreas J. Wicenec 13-June-2002.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
PTools Annual Meeting, Knoxville, TN, September 2002 The Tool Daemon Protocol: Defining the Interface Between Tools and Process Management Systems.
Portal Update Plan Ashok Adiga (512)
FATCOP: A Mixed Integer Program Solver Michael FerrisQun Chen Department of Computer Sciences University of Wisconsin-Madison Jeff Linderoth, Argonne.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
Nick LeRoy Computer Sciences Department University of Wisconsin-Madison Hawkeye.
Greg Thain Computer Sciences Department University of Wisconsin-Madison Configuring Quill Condor Week.
MW: A framework to support Master Worker Applications Sanjeev R. Kulkarni Computer Sciences Department University of Wisconsin-Madison
Globus: A Report. Introduction What is Globus? Need for Globus. Goal of Globus Approach used by Globus: –Develop High level tools and basic technologies.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor and Virtual Machines.
Resource Selection Services for a Single Job Execution Soonwook Hwang National Institute of Informatics/NAREGI OGSA F2F RSS Session Sunnyvale, CA, US Aug.
CSCI/CMPE 4334 Operating Systems Review: Exam 1 1.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Duncan MacMichael & Galen Deal CSS 534 – Autumn 2016
Building Grids with Condor
TYPES OF SERVER. TYPES OF SERVER What is a server.
Basic Grid Projects – Condor (Part I)
Presentation transcript:

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor Week Madison, WI 2001

Overview › MPI and Condor: Why Now? › Dedicated and Opportunistic Scheduling › How Does it All Work? › Specific MPI Implementations › Future Work

What is MPI? › MPI is the “Message Passing Interface” › Basically, a library for writing parallel applications that use message passing for inter-process communication › MPI is a standard with many different implementations

MPI and Condor: Why Haven’t We Supported it Until Now? › MPI's model is a static world › We always saw the world as dynamic, opportunistic, ever-changing › We focused our parallel support on PVM which supported a dynamic environment

MPI With Condor: Why Now? › More and more Condor pools are being formed from dedicated resources › MPI's API is also starting to move towards supporting a dynamic world (e.g. LAM, MPI2, etc) › Few schedulers (if any) handle both opportunistic and dedicated resources at the same time

Dedicated and Opportunistic Scheduling › Resources can move between 'dedicated' and 'opportunistic' status › Users submit jobs that are either dedicated (e.g. Universe = MPI) or opportunistic (e.g. Universe = standard)

Dedicated and Opportunistic (Cont'd) › Condor leaves all resources as opportunistic unless it sees dedicated jobs to service › The Dedicated Scheduler ('DS') claims opportunistic resources and turns them into dedicated ones to schedule into the future

Dedicated and Opportunistic (Cont'd) › When the DS has no more jobs, it releases the resources which go back to serving opportunistic jobs

Dedicated Scheduling, and "Back-Filling” › There will always be "holes" in the dedicated schedule, sets of resources that can't be filled with dedicated jobs for certain periods of time › Traditional solution is “back-filling” the holes with smaller dedicated jobs › However, these might not be preemptable

Back-Filling (Cont’d) › Instead of back-filling with dedicated jobs, we give the resources to Condor’s opportunistic scheduler › Condor runs preemptable opportunistic jobs until the DS decides it needs the resources again and reclaims them

Dedicated Resources are Opportunistic Resources › Even “dedicated” resources are really opportunistic  Hardware failure, software failure, etc  Condor handles these failures better than traditional dedicated schedulers, since our system already deals with them after years of opportunistic scheduling experience

How Does MPI Support in Condor Really Work? › Changes to the resource agent (condor_startd) › Changes to the job scheduling agent (condor_schedd) › Changes to the rest of the Condor system

How Do You Make a Resource Dedicated in Condor? › Just have to change a few config file settings.... no new startd binary is required › Add an attribute to the classad saying which scheduler, if any, this resource is willing to become dedicated to

Other Configuration Changes for the startd › In addition, you must change the policy expressions:  Must always be willing to run jobs from the DS  While the resource is claimed by the DS, the startd should never suspend or preempt jobs.

Submitting Dedicated Jobs › Requires a new "contrib" version of the condor_schedd › Condor "wakes up" the dedicated scheduler logic inside the condor_schedd when MPI jobs are submitted

How Does Your Job Get Resources? › The DS does a query to find all resources that are willing to become dedicated to it › DS sends out "resource request" classads and negotiates for resources with the negotiator (the opportunistic scheduler)

How Does Your Job Get Resources? (Cont’d) › DS then claims resources directly › Once resources are available, the DS schedules and spawns jobs › When jobs complete, if more MPI jobs can be serviced with the same resources, the DS holds onto them and uses them immediately

Changes to the rest of Condor? › Very few other changes required › Users can use all the same tools, interfaces, etc. › Just need a new condor_starter to actually spawn MPI jobs (will also be offered as a contrib module)

Specific MPI Implementations › MPICH › LAM › Others?

Condor and MPICH › Currently we support MPICH on Unix › Working on adding MPICH-NT support  NT’s MPICH has a different mechanism to spawn jobs than the Unix MPICH...

Condor + LAM = "LAMdor” › LAM's API is better suited for a dynamic environment, where hosts can come and go from your MPI universe › Has a different mechanism for spawning jobs than MPICH › Condor working to support their methods for spawning

LAMdor (Cont’d) › LAM working to understand, expand, and fully implement the dynamic scheduling calls in their API › LAM also considering using Condor’s libraries to support checkpointing of MPI computations

MPI-2 Standard › The MPI-2 standard contains calls to handle dynamic resources › Not yet fully implemented by anyone › When it is, we'll support it

Other MPI implementations › What are people using? › Do you want to see Condor support any other MPI implementations? › If so, send to and let us know

Future work › Implementing more advanced dedicated scheduling algorithms › Support for all sorts of MPI implementations (LAM, MPICH-NT, MPI-2, others)

More Future work › Solving problems w/ MPI on the Grid  "Flocking" MPI jobs to remote pools, or even spanning pools with a single computation  Solving issues of resource ownership on the Grid (i.e. how do you handle multiple dedicated schedulers on the grid wanting to control a given resource?)

More Future work › Checkpointing entire MPI computations › "MW" implmentation on top of Condor-MPI

More Future work › Support for other kinds of dedicated jobs  Generic dedicated jobs (we just gather and schedule the resources, then call your program, give it the list of machines, and let the program spawn itself)  LINDA

How do I start using MPI with Condor? › MPI support is still alpha, not quite ready for production use › A beta release should be out soon as a contrib module › Check the web site

Thanks for Listening! › Questions? › For more information:  