Presentation is loading. Please wait.

Presentation is loading. Please wait.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor.

Similar presentations


Presentation on theme: "Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor."— Presentation transcript:

1 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Condor RoadMap Paradyn/Condor Week 2005

2 Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum ….

3 3 Outline › Version 6.7.x to Version 6.8.0  Availability Failover, fault tolerance  Scalability Resources, jobs, matchmaking framework, files  Accessibility APIs, more Grid middleware, network firewalls  Everything else New functionality, new ports, etc. › And after that? p.s. Still here? Thank you for your generous PayPal pledge!

4 4 Current Status › Current Stable Release  Version 6.6.9 › Current Development Release  Version 6.7.5 › Next Stable Release Version 6.8.0  Once per year  Code freeze end of April  Release end of May

5 5 Existing Ports Digital UNIX 4.0 Alpha AIX 5.2 (clipped) PowerPC Tru64 5.1 (clipped) Alpha HP UNIX 10.20 PA RISC HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC Irix 6.5 (clipped) SGI Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 Linux 2.4.x (glibc 2.2) - Red Hat 8 Intel x86 Linux 2.4.x (glibc 2.3) - Red Hat 9 Intel x86 Enterprise Server 8.1 Intel Itanium Solaris 8 Sparc Solaris 9 Sparc Microsoft Windows 2000 or XP (clipped) Intel x86

6 6 New Ports › Introduced in v6.6.x  MacOSX (“clipped") PowerPC  Debian Linux 3.1 Intel x86  Fedora Core 1 Intel x86  Red Hat Enterprise Linux 3 Intel x86  SuSE Linux Enterprise Server 8.1 Intel Itanium › Introduced in v6.7.x  AIX 5.1 (“clipped") PowerPC  Fedora Core 2 on x86  Fedora Core 3 on x86  SuSE 8.0 ("clipped") on AMD64  Solaris 10 ("clipped") on Sparc  Scientific Linux (Release 303) on x86 › Still to be introduced in v6.7.x (before v6.8.0)  HPUX 11i 64-bit pa-risc  RHEL 4 on x86  “native” 64 bit AMD Linux Sigh… “Psilord” – The Condor porting doctor. Talk to him in person tomorrow.

7 7 Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines.  If network outage between execute and submit machine  If submit machine restarts › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: job_lease_duration = 1200

8 8 Job Progress continues if submit machine fails › Condor can now support a submit machine “hot spare”  If your submit machine A is down for longer than N minutes, a second machine B can take over  Requires shared filesystem between machines A and B

9 9 Central Manager Failover › Condor Central Manager has two services › condor_collector  Now a list of collectors is supported › condor_negotiator (matchmaker)  If fails, election process, another takes over  Contributed technology from Technion

10 10 Some Condor APIs › Command Line tools  condor_submit, condor_q, etc › Condor Perl Module › Chirp › Checkpoint Library API › MW --- improved! › DRMAA › Condor Grid ASCII Protocol (GAHP) › Web Service Interface

11 11 DRMAA › Distributed Resource Management Application API (DRMAA)  GGF Working Group  An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems › An API with C and Java bindings  not a protocol › Scope  Does: job submission, monitoring, control, final status  Does not: file staging, reservations, security, …

12 12 Condor GAHP › The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout › Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events

13 13 GAHP, cont Example: R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S

14 14 Web Service Interfaces › SOAP over http or https to the Condor daemons › Use any language or platform (where you can find a decent SOAP library) › Functionality Exposed in current release  Submit jobs  Retrieve job output  Remove/hold/release jobs  Query machine status (fetch ads from collector)  Query job status (fetch ads from the schedd)

15 15 Getting machine status via SOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“http://machine:port”)); ads = collector. queryStartdAds (“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions.

16 16 › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes?  GT2 (Globus Toolkit 2)  GT3 (Globus Toolkit 3.2)  GT4 (Globus Toolkit 3.9.5+)  UNICORE (Unicore)  PBS (OpenPBS, PBSPro – technology from INFN)  LSF (Platform LSF – technology from INFN)  CONDOR (thanks gLite!) New “Grid Universe” ‘Condor-C’ ‘Condor-G’

17 17 Other Grid Universe improvements › Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy/ › Some functionality present in Condor-G added to Condor-C  Forwarding of refreshed credentials (EGEE)  GSI authentication support

18 18 Quill › Job ClassAds information mirrored into an RDBMS › Both active jobs and historical jobs › Benefits BOTH scalability and accessibility QuillSchedd Job Queue log RDBMS Startd … Master Queue + History Tables

19 19 › Condor can now transfer job data files larger than 2 GB in size.  On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA  Real-time monitoring of job progress › Condor Installer on Win32 uses MSI (thanks Micron!) › condor_transfer_data (DZero) › STARTD_VM_EXPRS (INFN) › condor_vacate_job tool › condor_status -negotiator BAM! More tasty Condor goodness!

20 20 And More… › New startd policy expression MaxJobRetirementTime.  specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job › -peaceful option to condor_off, condor_restart › noop_job = True › Preliminary support for the Tool Daemon Protocol (TDP)  TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools.  specify a ``tool'' that should be spawned along-side their regular Condor job.  On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.

21 21 Hey Jobs! We’re watching you! › condor_starter enforce limits  Starter is already monitoring many job characteristics (image size, cpu usage, etc)  Threshold expressions Use more resources than you said you would, and BAM! › Local Universe  Just like Scheduler Universe, but there is a condor_starter  All advantages of the starter schedd starter job Submit startd starter job Execute Hey, job, behave or else!

22 22 GCB layer Server app TCP/IP GCB layer Client app TCP/IP translate connect Relay point listen accept Condor with Firewalls and NATS: GCB in v6.8.0!

23 23 Binding & Registration BGCB lib Broker X Server B = socket(); bind(B, ANY); getsockname (B, X ) BIND (B) X X Locally bound to B Officially bound to X Registere d (X, B)

24 24 GCB: Public-Private Connection B GCB lib X Server AGCB lib Client connect(A, X ) CONNECT (X) PASSIVE CONTACT (A)

25 25 GCB: Private-Private Connection B GCB lib X Server AGCB lib Client connect(A, X ) CONNECT (X) ACTIVE (X) Y CONTACT (Y)

26 26 From CondorWeek 2003: › New version of ClassAds into Condor  Conditionals !! if/then/else  Aggregates (lists, nested classads)  Built-in functions String operations, pattern matching, time operators, unit conversions  Clean implementations in C++ and Java  ClassAd collections › This may become v6.8.0 Is this TODD ?!?!

27 27 ClassAd Improvements in Condor! › Conditionals  IfThenElse(condition,then,else) › String functions  Strcat(), strcmp(), toUpper(), etc. › StringList functions  Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake”  StrListContains(), StrListAppend(), StrListRemove(), etc. › Others  Type test, some math functions

28 28 Security › New Service: condor_credd  Store, refresh, forward credentials  Right now used just by stork – role will expand (AFS authentication?) › Common Authentication Methods between Condor on Unix and Win32  Kerberos 1.4 Additional hopeful benefit: Authentication against MS Active Directory!?!  GSI on Win32 ? › Starter only runs known executables › Shadow only reads/writes to a given subdirectory(s)

29 29 Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) › Account Group Quota (inspiration CDF @ Fermi)  Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them  Could use Machine Rank… but this ties to specific machines  Or could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group

30 30 Improved Scalability › Much faster negotiation  SIGNIFICANT_ATTRIBUTES determined automatically  Schedd uses non-blocking TCP connects to the startd  Negotiator caching  Collector Forks for queries  More…

31 31 Parallel Universe › SSHD running alongside your job!  Also works with VANILLA, JAVA universe! › Support for parallel jobs  Other than just MPICH, e.g. Lam, SCore  Nice for testing environments

32 32 What’s brewing for after v6.8.0? › More data, data, data  Stork distributed w/ v6.8.0, incl DAGMan support  NeST manage Condor spool files, ckpt servers  Stork used for Condor job data transfers › Virtual Machines (and the future of Standard Universe) › Condor and Shibboleth (with Georgetown Univ) › Least Privilege Security Access (with U of Cambridge) › Dynamic Temporary Accounts (with EGEE, Argonne) › Leverage Database Technology (with UW DB group) › ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) › Easier Updates › New ClassAds (integration with Optena) › Hierarchical Matchmaking Can I commit this to CVS??

33 33 BIG 10 MM UW MM CS MM Theory Group MM CC R R R R “I need more resources” A Tree of Matchmakers Fault Tolerance Flexibility MM now manage other MMs Erdos MM A Match

34 34 Thank you!


Download ppt "Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor."

Similar presentations


Ads by Google