Download presentation
Presentation is loading. Please wait.
Published byKristopher Douglas Modified over 9 years ago
1
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor Condor RoadMap Paradyn/Condor Week 2005
2
Terms of License Any and all dates in these slides are relative from a date hereby unspecified in the event of a likely situation involving a frequent condition. Viewing, use, reproduction, display, modification and redistribution of these slides, with or without modification, in source and binary forms, is permitted only after a deposit by said user into PayPal accounts registered to Todd Tannenbaum ….
3
3 Outline › Version 6.7.x to Version 6.8.0 Availability Failover, fault tolerance Scalability Resources, jobs, matchmaking framework, files Accessibility APIs, more Grid middleware, network firewalls Everything else New functionality, new ports, etc. › And after that? p.s. Still here? Thank you for your generous PayPal pledge!
4
4 Current Status › Current Stable Release Version 6.6.9 › Current Development Release Version 6.7.5 › Next Stable Release Version 6.8.0 Once per year Code freeze end of April Release end of May
5
5 Existing Ports Digital UNIX 4.0 Alpha AIX 5.2 (clipped) PowerPC Tru64 5.1 (clipped) Alpha HP UNIX 10.20 PA RISC HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC Irix 6.5 (clipped) SGI Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 Linux 2.4.x (glibc 2.2) - Red Hat 8 Intel x86 Linux 2.4.x (glibc 2.3) - Red Hat 9 Intel x86 Enterprise Server 8.1 Intel Itanium Solaris 8 Sparc Solaris 9 Sparc Microsoft Windows 2000 or XP (clipped) Intel x86
6
6 New Ports › Introduced in v6.6.x MacOSX (“clipped") PowerPC Debian Linux 3.1 Intel x86 Fedora Core 1 Intel x86 Red Hat Enterprise Linux 3 Intel x86 SuSE Linux Enterprise Server 8.1 Intel Itanium › Introduced in v6.7.x AIX 5.1 (“clipped") PowerPC Fedora Core 2 on x86 Fedora Core 3 on x86 SuSE 8.0 ("clipped") on AMD64 Solaris 10 ("clipped") on Sparc Scientific Linux (Release 303) on x86 › Still to be introduced in v6.7.x (before v6.8.0) HPUX 11i 64-bit pa-risc RHEL 4 on x86 “native” 64 bit AMD Linux Sigh… “Psilord” – The Condor porting doctor. Talk to him in person tomorrow.
7
7 Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. If network outage between execute and submit machine If submit machine restarts › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: job_lease_duration = 1200
8
8 Job Progress continues if submit machine fails › Condor can now support a submit machine “hot spare” If your submit machine A is down for longer than N minutes, a second machine B can take over Requires shared filesystem between machines A and B
9
9 Central Manager Failover › Condor Central Manager has two services › condor_collector Now a list of collectors is supported › condor_negotiator (matchmaker) If fails, election process, another takes over Contributed technology from Technion
10
10 Some Condor APIs › Command Line tools condor_submit, condor_q, etc › Condor Perl Module › Chirp › Checkpoint Library API › MW --- improved! › DRMAA › Condor Grid ASCII Protocol (GAHP) › Web Service Interface
11
11 DRMAA › Distributed Resource Management Application API (DRMAA) GGF Working Group An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems › An API with C and Java bindings not a protocol › Scope Does: job submission, monitoring, control, final status Does not: file staging, reservations, security, …
12
12 Condor GAHP › The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout › Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events
13
13 GAHP, cont Example: R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S
14
14 Web Service Interfaces › SOAP over http or https to the Condor daemons › Use any language or platform (where you can find a decent SOAP library) › Functionality Exposed in current release Submit jobs Retrieve job output Remove/hold/release jobs Query machine status (fetch ads from collector) Query job status (fetch ads from the schedd)
15
15 Getting machine status via SOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“http://machine:port”)); ads = collector. queryStartdAds (“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions.
16
16 › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes? GT2 (Globus Toolkit 2) GT3 (Globus Toolkit 3.2) GT4 (Globus Toolkit 3.9.5+) UNICORE (Unicore) PBS (OpenPBS, PBSPro – technology from INFN) LSF (Platform LSF – technology from INFN) CONDOR (thanks gLite!) New “Grid Universe” ‘Condor-C’ ‘Condor-G’
17
17 Other Grid Universe improvements › Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy/ › Some functionality present in Condor-G added to Condor-C Forwarding of refreshed credentials (EGEE) GSI authentication support
18
18 Quill › Job ClassAds information mirrored into an RDBMS › Both active jobs and historical jobs › Benefits BOTH scalability and accessibility QuillSchedd Job Queue log RDBMS Startd … Master Queue + History Tables
19
19 › Condor can now transfer job data files larger than 2 GB in size. On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA Real-time monitoring of job progress › Condor Installer on Win32 uses MSI (thanks Micron!) › condor_transfer_data (DZero) › STARTD_VM_EXPRS (INFN) › condor_vacate_job tool › condor_status -negotiator BAM! More tasty Condor goodness!
20
20 And More… › New startd policy expression MaxJobRetirementTime. specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job › -peaceful option to condor_off, condor_restart › noop_job = True › Preliminary support for the Tool Daemon Protocol (TDP) TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. specify a ``tool'' that should be spawned along-side their regular Condor job. On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.
21
21 Hey Jobs! We’re watching you! › condor_starter enforce limits Starter is already monitoring many job characteristics (image size, cpu usage, etc) Threshold expressions Use more resources than you said you would, and BAM! › Local Universe Just like Scheduler Universe, but there is a condor_starter All advantages of the starter schedd starter job Submit startd starter job Execute Hey, job, behave or else!
22
22 GCB layer Server app TCP/IP GCB layer Client app TCP/IP translate connect Relay point listen accept Condor with Firewalls and NATS: GCB in v6.8.0!
23
23 Binding & Registration BGCB lib Broker X Server B = socket(); bind(B, ANY); getsockname (B, X ) BIND (B) X X Locally bound to B Officially bound to X Registere d (X, B)
24
24 GCB: Public-Private Connection B GCB lib X Server AGCB lib Client connect(A, X ) CONNECT (X) PASSIVE CONTACT (A)
25
25 GCB: Private-Private Connection B GCB lib X Server AGCB lib Client connect(A, X ) CONNECT (X) ACTIVE (X) Y CONTACT (Y)
26
26 From CondorWeek 2003: › New version of ClassAds into Condor Conditionals !! if/then/else Aggregates (lists, nested classads) Built-in functions String operations, pattern matching, time operators, unit conversions Clean implementations in C++ and Java ClassAd collections › This may become v6.8.0 Is this TODD ?!?!
27
27 ClassAd Improvements in Condor! › Conditionals IfThenElse(condition,then,else) › String functions Strcat(), strcmp(), toUpper(), etc. › StringList functions Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake” StrListContains(), StrListAppend(), StrListRemove(), etc. › Others Type test, some math functions
28
28 Security › New Service: condor_credd Store, refresh, forward credentials Right now used just by stork – role will expand (AFS authentication?) › Common Authentication Methods between Condor on Unix and Win32 Kerberos 1.4 Additional hopeful benefit: Authentication against MS Active Directory!?! GSI on Win32 ? › Starter only runs known executables › Shadow only reads/writes to a given subdirectory(s)
29
29 Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) › Account Group Quota (inspiration CDF @ Fermi) Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them Could use Machine Rank… but this ties to specific machines Or could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group
30
30 Improved Scalability › Much faster negotiation SIGNIFICANT_ATTRIBUTES determined automatically Schedd uses non-blocking TCP connects to the startd Negotiator caching Collector Forks for queries More…
31
31 Parallel Universe › SSHD running alongside your job! Also works with VANILLA, JAVA universe! › Support for parallel jobs Other than just MPICH, e.g. Lam, SCore Nice for testing environments
32
32 What’s brewing for after v6.8.0? › More data, data, data Stork distributed w/ v6.8.0, incl DAGMan support NeST manage Condor spool files, ckpt servers Stork used for Condor job data transfers › Virtual Machines (and the future of Standard Universe) › Condor and Shibboleth (with Georgetown Univ) › Least Privilege Security Access (with U of Cambridge) › Dynamic Temporary Accounts (with EGEE, Argonne) › Leverage Database Technology (with UW DB group) › ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) › Easier Updates › New ClassAds (integration with Optena) › Hierarchical Matchmaking Can I commit this to CVS??
33
33 BIG 10 MM UW MM CS MM Theory Group MM CC R R R R “I need more resources” A Tree of Matchmakers Fault Tolerance Flexibility MM now manage other MMs Erdos MM A Match
34
34 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.