Presentation is loading. Please wait.

Presentation is loading. Please wait.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s new in Condor?

Similar presentations


Presentation on theme: "Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s new in Condor?"— Presentation transcript:

1 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor What’s new in Condor? Condor Week 2006

2 2 So Todd… where is v6.8? Well, v6.7 has been a challenge…

3 3 inint

4 4

5 5 Around since the 80’s

6 6 80’s Mullet Boy

7 7 100 people surveyed! Favorite “ility” ?

8 8 Deployability!

9 9 Existing Ports Digital UNIX 4.0 Alpha AIX 5.2 (clipped) PowerPC Tru64 5.1 (clipped) Alpha HP UNIX 10.20 PA RISC HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC Irix 6.5 (clipped) SGI Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 Linux 2.4.x (glibc 2.2) - Red Hat 8 Intel x86 Linux 2.4.x (glibc 2.3) - Red Hat 9 Intel x86 Enterprise Server 8.1 Intel Itanium Solaris 8 Sparc Solaris 9 Sparc Microsoft Windows 2000 or XP (clipped) Intel x86

10 10 New Ports › Introduced in v6.6.x  MacOSX (“clipped") PowerPC  Debian Linux 3.1 Intel x86  Fedora Core 1 Intel x86  Red Hat Enterprise Linux 3 Intel x86  SuSE Linux Enterprise Server 8.1 Intel Itanium › Introduced in v6.7.x  AIX 5.1 (“clipped") PowerPC  Fedora Core 2 on x86  Fedora Core 3 on x86  SuSE 8.0 ("clipped") on AMD64  Solaris 10 ("clipped") on Sparc  Scientific Linux (Release 303) on x86 › Still to be introduced in v6.7.x (before v6.8.0)  HPUX 11i 64-bit pa-risc  RHEL 4 on x86  “native” 64 bit AMD Linux Sigh… “Psilord” – The Condor porting doctor. Talk to him in person tomorrow.

11 11 Porting Table › See http://www.cs.wisc.edu/condor/porting/port_table.html › Highlights  Almost every 32-bit Linux flavor as “full”  Every other Unix, MacOS and Windows available as “clipped”  Solaris 10 and HP-UX 11.x now “clipped”  FreeBSD 4 contribution from Yahoo!, added 5 and 6  X86_64 Linux: “full” running in the lab

12 12 Backfill Jobs › Execute machines will run a locally staged executable when otherwise idle. › Currently designed for BOINC. # Turn on backfill functionality, and use BOINC ENABLE_BACKFILL = TRUE BACKFILL_SYSTEM = BOINC # Spawn a backfill job if we've been Unclaimed for more than 5 minutes START_BACKFILL = $(StateTimer) > (5 * $(MINUTE)) # Evict a backfill job if the machine is busy (based on keyboard # activity or cpu load) EVICT_BACKFILL = $(MachineBusy)

13 13 Joining Condor’s Einstein@Home Compute Team › If you’re running BOINC backfill jobs in Condor and want to use your cycles to help another UW project, please join the Einstein@Home computation › Join the “Condor Backfill” team:  http://einstein.phys.uwm.edu/team_display.p hp?teamid=5994  http://einstein.phys.uwm.edu/create_accoun t_form.php?teamid=5994

14 14 More “deployability” › “Personal” Condor Support on Win32  LocalSystem not required › MSI installer on Win32 (thanks Micron!) › New tools Safe, dynamic Condor service deployment. More info @ Research BOF 9am Rm219  condor_cold_start and  condor_cold_stop

15 15 100 people surveyed! Favorite “ility” ?

16 16 100 people surveyed! Favorite “ility” ? Availability!

17 17 GCB layer Server app TCP/IP GCB layer Client app TCP/IP translate connect Relay point listen accept Condor with Firewalls and NATS: GCB in v6.8.0!

18 18 Job Progress continues if connection is interrupted › Now for Vanilla, Java, and Grid universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines.  If network outage between execute and submit machine  If submit machine restarts  Grid Universe was tricky… › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: job_lease_duration = 1200

19 19 Job Progress continues if submit machine fails › Condor can now support a submit machine “hot spare” (schedd failover)  If your submit machine A is down for longer than N minutes, a second machine B can take over  Requires shared filesystem between machines A and B

20 20 Central Manager Failover › Condor Central Manager has two services › condor_collector  Now a list of collectors is supported › condor_negotiator (matchmaker)  If fails, election process, another takes over  Accounting state is peridocially replicated  Contributed technology from Technion

21 21 Reliability, cont. › Time shifts › Quill › Closing windows of vulnerability

22 22 100 people surveyed! Favorite “ility” ?

23 23 100 people surveyed! Favorite “ility” ? Lighweight?

24 24 100 people surveyed! Favorite “ility” ? Lighweight?

25 25 100 people surveyed! Favorite “ility” ?

26 26 100 people surveyed! Favorite “ility” ? Functionality!

27 27 Security › Common Authentication Methods between Condor on Unix and Win32  Kerberos 1.4 Additional hopeful benefit: Authentication against MS Active Directory!  SSL  Password (shared secret) › Starter only runs known executables › More powerful, unified map file(s) › GSI credentials delegated

28 28 With Condor on Win32, it be nice if … › My jobs could access my files just like the condor_shadow can › I didn’t have to tie my execute machines to a single account › I didn’t have to run condor_store_cred from every machine where my credential is needed (thank you Optena)

29 29 The Windows CredD y0urs myp4sswd C:\>condor_store_cred add Account: gquinn@CROW Enter password: Operation succeeded. credd › A centralized repository for user passwords “store password”

30 30 The Windows CredD y0urs myp4sswd schedd shadow Submit machines can use the CredD to impersonate the user in the shadow “fetch password”

31 31 The Windows CredD y0urs myp4sswd starter condor_exec.exe Execute machines can use the CredD to run jobs as the submitting user! “fetch password”

32 32 Running Jobs as Submitting User CREDD_HOST = vault.cs.wisc.edu STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True › In submit file:  Run_job_as_owner = true › In config file on submit and execute nodes:

33 33 Some Condor APIs › Command Line tools  condor_submit, condor_q, etc  -format, -constraint, -xml › Condor Perl Module › Chirp › Checkpoint Library API › MW --- improved! › DRMAA (Works w/ Win32, on SourceForge) › Condor Grid ASCII Protocol (GAHP) › Web Service Interface

34 34 DRMAA › Distributed Resource Management Application API (DRMAA)  GGF Working Group  An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems › An API with C and Java bindings  not a protocol › Scope  Does: job submission, monitoring, control, final status  Does not: file staging, reservations, security, …

35 35 Condor GAHP › The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout › Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events

36 36 GAHP, cont Example: R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S

37 37 Web Service Interfaces › SOAP over http or https to the Condor daemons › Use any language or platform (where you can find a decent SOAP library) › Functionality Exposed in current release  Submit jobs  Retrieve job output  Remove/hold/release jobs  Query machine status (fetch ads from collector)  Query job status (fetch ads from the schedd)

38 38 Getting machine status via SOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“http://machine:port”)); ads = collector. queryStartdAds (“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions.

39 39 More Functionality changes.. › FINALLY, clean/consistent cross-platform quoting rules for arguments and environment variables (see condor_submit man page) › Schedd can run HawkEye modules, just like the Startd  Enables monitoring on the submit machine › condor_history : now faster than a snail, and cleans up droppings. › DeferralTime, DeferralWindow  Coordinated starts › BIND_ALL_INTERFACES in config file › WANT_REMOTE_IO in job ClassAd

40 40 ClassAd Functions in Condor! › Conditionals  IfThenElse(condition,then,else) › String functions  Strcat(), strcmp(), toUpper(), etc. › StringList functions  Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake”  StrListContains(), StrListAppend(), StrListRemove(), etc. › Others  Regular expressions, arithmetic, etc…

41 41 Accounting Groups and Group Quota Support › Account Group (w/ CORE Feature Animation) › Account Group Quota (inspiration CDF @ Fermi)  Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them  Could use Machine Rank… but this ties to specific machines  Or could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group

42 42 100 people surveyed! Favorite “ility” ?

43 43 100 people surveyed! Favorite “ility” ? Universability!

44 44 › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes?  GT2 (Globus Toolkit 2)  GT3 (Globus Toolkit 3.2)  GT4 (Globus Toolkit 3.9.5+)  UNICORE  Nordugrid  PBS (OpenPBS, PBSPro – technology from INFN)  LSF (Platform LSF – technology from INFN)  CONDOR (thanks gLite!) Grid Universe ‘Condor-C’ ‘Condor-G’

45 45 Other Grid Universe improvements › Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy (both GT2 and GT4) › GT4 : we start a GridFTP server behind the scenes  GridFTP server bundled w/ Condor nowadays › Some functionality present in Condor-G added to Condor-C  Forwarding of refreshed credentials (EGEE)  GSI authentication support  Cleaner ClassAd representation (URL)

46 46 Parallel Universe › Replaces the “MPI” universe › Allows running arbitrary programs that need to gang-schedule multiple machines  MPICH, LAM, …  FT-MPICH (Seoul National Univ)  Great for testing environments

47 47 Hey Jobs! We’re watching you! › Local Universe  Just like Scheduler Universe, but there is a condor_starter  All advantages of the starter schedd starter job Submit startd starter job Execute Hey, job, behave or else!

48 48 100 people surveyed! Favorite “ility” ?

49 49 100 people surveyed! Favorite “ility” ? Scalability!

50 50 Faster Negotiation › SIGNIFICANT_ATTRIBUTES determined automatically  Job attributes AutoClusterId and AutoClusterAttributes  Rounding of Attributes › Schedd uses non-blocking TCP connects to the startd › Negotiator caching › Collector Forks for queries › More coming…

51 51 Scalability, cont. › Knobs  GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE,  GRIDMANAGER_MAX_PENDING_SUBMIT_PER_RESOURCE,  GRIDMANAGER_MAX_SUBMITTED_JOBS_PER_RESOURCE › One instance of gridmanager handles multiple jobs (all from a given user) › One instance of condor_dagman can run multiple dags  Is the Shadow next? › Buffered I/O read on schedd restart (thanks Yahoo!)

52 52 Quill › Job ClassAds information mirrored into an RDBMS › Both active jobs and historical jobs › Benefits BOTH scalability and accessibility QuillSchedd Job Queue log RDBMS Startd … Master Queue + History Tables

53 53 Version 6.9.x

54 54 What’s brewing for after v6.8.0? › More data, data, data  Stork distributed now v6.7.x, incl DAGMan support – next it is NeST’s turn.  NeST manage Condor spool files, ckpt servers GridFTP used to move the bits  Quill++ and CondorDB goodness › Virtual Machines (and the future of Standard Universe)  Research BOF w/ Jaeyoung Moon, rm219 9am

55 55 SOAP API › First focus will be to finish interfaces used by all command-line tools  condor_userprio, condor_cod, … › Explore message-based security  Ian Alderman’s work w/ signed ClassAd attributes

56 56 Privilege Separation › No more root in the Condor daemons! › Instead, a small component will be responsible for privileged operations › Initial exploratory work w/ GNU userv (Cambridge) › Now focusing on integration w/ glexec (gLite / nikhef)

57 57 “The Year of the Schedd” › Schedd is juggling to many tasks  Break it down into smaller pieces, more modular › Scalability  All non-blocking I/O  Hierarchy of schedds › Schedd-on-the-side  “Scheduler booster”  Transform & delegate job classads to different grids  A “job router” for a grid

58 58 Thank you!


Download ppt "Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s new in Condor?"

Similar presentations


Ads by Google