Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor.

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
The SAM-Grid Fabric Services Gabriele Garzoglio (for the SAM-Grid team) Computing Division Fermilab.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
17/09/2004 John Kewley Grid Technology Group Introduction to Condor.
Grid Computing 7700 Fall 2005 Lecture 17: Resource Management Gabrielle Allen
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
1 Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Sonny (Sechang) Son Computer Sciences Department University of Wisconsin-Madison Dealing with Internet Connectivity in Distributed.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor.
Ashok Agarwal 1 BaBar MC Production on the Canadian Grid using a Web Services Approach Ashok Agarwal, Ron Desmarais, Ian Gable, Sergey Popov, Sydney Schaffer,
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Peter Keller Computer Sciences Department University of Wisconsin-Madison Quill Tutorial Condor Week.
Grid Computing I CONDOR.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Condor Birdbath Web Service interface to Condor
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
3-2.1 Topics Grid Computing Meta-schedulers –Condor-G –Gridway Distributed Resource Management Application (DRMAA) © 2010 B. Wilkinson/Clayton Ferner.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Condor Week Summary March 14-16, 2005 Madison, Wisconsin.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
1 Getting popular Figure 1: Condor downloads by platform Figure 2: Known # of Condor hosts.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
1 Condor BirdBath SOAP Interface to Condor Charaka Goonatilake Department of Computer Science University College London
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Developer APIs to Condor + A Tutorial on Condor’s Web Service Interface Todd Tannenbaum, UW-Madison Matthew Farrellee, Red Hat.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Derek Wright Computer Sciences Department University of Wisconsin-Madison New Ways to Fetch Work The new hook infrastructure in Condor.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s new in Condor?
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Matthew Farrellee Computer Sciences Department University of Wisconsin-Madison Condor and Web Services.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Leveraging Database Technologies in Condor Jeff Naughton April 25, 2006.
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Peter Kacsuk – Sipos Gergely MTA SZTAKI
The ETICS Build and Test Service
Building Grids with Condor
Basic Grid Projects – Condor (Part I)
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor Week 2005

Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum ….

3 Outline › Version 6.7.x to Version  Availability Failover, fault tolerance  Scalability Resources, jobs, matchmaking framework, files  Accessibility APIs, more Grid middleware, network firewalls  Everything else New functionality, new ports, etc. › And after that? p.s. Still here? Thank you for your generous PayPal pledge!

4 Current Status › Current Stable Release  Version › Current Development Release  Version › Next Stable Release Version  Once per year  Code freeze end of April  Release end of May

5 Existing Ports Digital UNIX 4.0 Alpha AIX 5.2 (clipped) PowerPC Tru (clipped) Alpha HP UNIX PA RISC HP UNIX (clipped using hpux bit) PA RISC Irix 6.5 (clipped) SGI Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 Linux 2.4.x (glibc 2.2) - Red Hat 8 Intel x86 Linux 2.4.x (glibc 2.3) - Red Hat 9 Intel x86 Enterprise Server 8.1 Intel Itanium Solaris 8 Sparc Solaris 9 Sparc Microsoft Windows 2000 or XP (clipped) Intel x86

6 New Ports › Introduced in v6.6.x  MacOSX (“clipped") PowerPC  Debian Linux 3.1 Intel x86  Fedora Core 1 Intel x86  Red Hat Enterprise Linux 3 Intel x86  SuSE Linux Enterprise Server 8.1 Intel Itanium › Introduced in v6.7.x  AIX 5.1 (“clipped") PowerPC  Fedora Core 2 on x86  Fedora Core 3 on x86  SuSE 8.0 ("clipped") on AMD64  Solaris 10 ("clipped") on Sparc  Scientific Linux (Release 303) on x86 › Still to be introduced in v6.7.x (before v6.8.0)  HPUX 11i 64-bit pa-risc  RHEL 4 on x86  “native” 64 bit AMD Linux Sigh… “Psilord” – The Condor porting doctor. Talk to him in person tomorrow.

7 Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines.  If network outage between execute and submit machine  If submit machine restarts › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: job_lease_duration = 1200

8 Job Progress continues if submit machine fails › Condor can now support a submit machine “hot spare”  If your submit machine A is down for longer than N minutes, a second machine B can take over  Requires shared filesystem between machines A and B

9 Central Manager Failover › Condor Central Manager has two services › condor_collector  Now a list of collectors is supported › condor_negotiator (matchmaker)  If fails, election process, another takes over  Contributed technology from Technion

10 Some Condor APIs › Command Line tools  condor_submit, condor_q, etc › Condor Perl Module › Chirp › Checkpoint Library API › MW --- improved! › DRMAA › Condor Grid ASCII Protocol (GAHP) › Web Service Interface

11 DRMAA › Distributed Resource Management Application API (DRMAA)  GGF Working Group  An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems › An API with C and Java bindings  not a protocol › Scope  Does: job submission, monitoring, control, final status  Does not: file staging, reservations, security, …

12 Condor GAHP › The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout › Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events

13 GAHP, cont Example: R: $GahpVersion: Nov NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: Nov NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_ txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: S: QUIT R: S

14 Web Service Interfaces › SOAP over http or https to the Condor daemons › Use any language or platform (where you can find a decent SOAP library) › Functionality Exposed in current release  Submit jobs  Retrieve job output  Remove/hold/release jobs  Query machine status (fetch ads from collector)  Query job status (fetch ads from the schedd)

15 Getting machine status via SOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“ ads = collector. queryStartdAds (“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions.

16 › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes?  GT2 (Globus Toolkit 2)  GT3 (Globus Toolkit 3.2)  GT4 (Globus Toolkit )  UNICORE (Unicore)  PBS (OpenPBS, PBSPro – technology from INFN)  LSF (Platform LSF – technology from INFN)  CONDOR (thanks gLite!) New “Grid Universe” ‘Condor-C’ ‘Condor-G’

17 Other Grid Universe improvements › Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI › Some functionality present in Condor-G added to Condor-C  Forwarding of refreshed credentials (EGEE)  GSI authentication support

18 Quill › Job ClassAds information mirrored into an RDBMS › Both active jobs and historical jobs › Benefits BOTH scalability and accessibility QuillSchedd Job Queue log RDBMS Startd … Master Queue + History Tables

19 › Condor can now transfer job data files larger than 2 GB in size.  On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA  Real-time monitoring of job progress › Condor Installer on Win32 uses MSI (thanks Micron!) › condor_transfer_data (DZero) › STARTD_VM_EXPRS (INFN) › condor_vacate_job tool › condor_status -negotiator BAM! More tasty Condor goodness!

20 And More… › New startd policy expression MaxJobRetirementTime.  specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job › -peaceful option to condor_off, condor_restart › noop_job = True › Preliminary support for the Tool Daemon Protocol (TDP)  TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools.  specify a ``tool'' that should be spawned along-side their regular Condor job.  On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.

21 Hey Jobs! We’re watching you! › condor_starter enforce limits  Starter is already monitoring many job characteristics (image size, cpu usage, etc)  Threshold expressions Use more resources than you said you would, and BAM! › Local Universe  Just like Scheduler Universe, but there is a condor_starter  All advantages of the starter schedd starter job Submit startd starter job Execute Hey, job, behave or else!

22 GCB layer Server app TCP/IP GCB layer Client app TCP/IP translate connect Relay point listen accept Condor with Firewalls and NATS: GCB in v6.8.0!

23 Binding & Registration BGCB lib Broker X Server B = socket(); bind(B, ANY); getsockname (B, X ) BIND (B) X X Locally bound to B Officially bound to X Registere d (X, B)

24 GCB: Public-Private Connection B GCB lib X Server AGCB lib Client connect(A, X ) CONNECT (X) PASSIVE CONTACT (A)

25 GCB: Private-Private Connection B GCB lib X Server AGCB lib Client connect(A, X ) CONNECT (X) ACTIVE (X) Y CONTACT (Y)

26 From CondorWeek 2003: › New version of ClassAds into Condor  Conditionals !! if/then/else  Aggregates (lists, nested classads)  Built-in functions String operations, pattern matching, time operators, unit conversions  Clean implementations in C++ and Java  ClassAd collections › This may become v6.8.0 Is this TODD ?!?!

27 ClassAd Improvements in Condor! › Conditionals  IfThenElse(condition,then,else) › String functions  Strcat(), strcmp(), toUpper(), etc. › StringList functions  Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake”  StrListContains(), StrListAppend(), StrListRemove(), etc. › Others  Type test, some math functions

28 Security › New Service: condor_credd  Store, refresh, forward credentials  Right now used just by stork – role will expand (AFS authentication?) › Common Authentication Methods between Condor on Unix and Win32  Kerberos 1.4 Additional hopeful benefit: Authentication against MS Active Directory!?!  GSI on Win32 ? › Starter only runs known executables › Shadow only reads/writes to a given subdirectory(s)

29 Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) › Account Group Quota (inspiration Fermi)  Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them  Could use Machine Rank… but this ties to specific machines  Or could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group

30 Improved Scalability › Much faster negotiation  SIGNIFICANT_ATTRIBUTES determined automatically  Schedd uses non-blocking TCP connects to the startd  Negotiator caching  Collector Forks for queries  More…

31 Parallel Universe › SSHD running alongside your job!  Also works with VANILLA, JAVA universe! › Support for parallel jobs  Other than just MPICH, e.g. Lam, SCore  Nice for testing environments

32 What’s brewing for after v6.8.0? › More data, data, data  Stork distributed w/ v6.8.0, incl DAGMan support  NeST manage Condor spool files, ckpt servers  Stork used for Condor job data transfers › Virtual Machines (and the future of Standard Universe) › Condor and Shibboleth (with Georgetown Univ) › Least Privilege Security Access (with U of Cambridge) › Dynamic Temporary Accounts (with EGEE, Argonne) › Leverage Database Technology (with UW DB group) › ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) › Easier Updates › New ClassAds (integration with Optena) › Hierarchical Matchmaking Can I commit this to CVS??

33 BIG 10 MM UW MM CS MM Theory Group MM CC R R R R “I need more resources” A Tree of Matchmakers Fault Tolerance Flexibility MM now manage other MMs Erdos MM A Match

34 Thank you!