Presentation is loading. Please wait.

Presentation is loading. Please wait.

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Similar presentations


Presentation on theme: "Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor."— Presentation transcript:

1 Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison condor-admin@cs.wisc.edu http://www.cs.wisc.edu/condor What’s New in Condor

2 www.cs.wisc.edu/condor Overview Quick ‘sound bytes’ on new functionality in recent Condor releases › Condor Development Process › New Features in Condor version 6.6.x › New Features in Condor version 6.7.0

3 www.cs.wisc.edu/condor Condor Development Process › We maintain two different releases at all times  Stable Series Second digit is even: e.g. 6.2.2, 6.4.7, 6.6.3  Development Series Second digit is odd: e.g. 6.5.1, 6.7.2

4 www.cs.wisc.edu/condor Stable Series › Heavily tested › Runs on our department production pool of nearly 1,000 CPUs (for min of 3 weeks) › No new features, only bugfixes and ports. › A given stable release is always compatible with other releases from the same series  6.6.X is compatible with 6.6.Y › Recommended for production pools

5 www.cs.wisc.edu/condor Development Series › Less heavily tested › Runs on our small(er) test pool. › New features and new technology are added frequently › Versions from the same development series are not guaranteed compatible with each other (although we try hard)

6 www.cs.wisc.edu/condor New in version 6.6.x › Version 6.6.0 released in November 03. › Current release: version 6.6.7, to be released in Oct 04.

7 www.cs.wisc.edu/condor The Struggle to Build Condor › Condor is BIG  Condor code consists of primary source plus ‘externals’. Externals include Kerberos, zlib, GSI, PVM, gSOAP… Patches to externals

8 www.cs.wisc.edu/condor The Struggle to Build Condor › Condor is BIG  Condor code consists of primary source plus ‘externals’. Externals include Kerberos, zlib, GSI, PVM, gSOAP… Patches to externals  Current shipped source + externals: ~415MB of source, or ~9 million lines!  Building Condor outside of UW- Madison used to be very difficult. “LIST OF SHAME”“LIST OF SHAME”: Build pointed to packages on UW-Madison fileservers.

9 www.cs.wisc.edu/condor Now Condor Source “Self-Contained” › Source code to externals are now bundled w/ Condor itself.  Self-contained  Allows version control on externals + patches › Build w/ just “configure; make” !  Checks for existence and proper version of all “bootstrap” requirements, such as the compiler  Applies our patches to the externals  All 9 million lines built and bundled

10 www.cs.wisc.edu/condor Building Condor Building Condor before Version 6.6.0… Building Condor Post Version 6.6.0!

11 www.cs.wisc.edu/condor › NMI = NSF Middleware Initiative › Automated build and test infrastructure built on top of Condor  Pool of 37 machines of many architectures  Scalable  Runs every night, builds several Condor source branches, then runs 114 test programs.  All results stored in RDBMS, reported on the web.  Yes, Condor builds Condor! Condor + NMI

12 www.cs.wisc.edu/condor Ports › New Ports w/ v6.6.x –vs- v6.4.x :  Solaris 9  RedHat Linux 8.x, 9.x for x86 (+RPMs)  RedHat Linux 7.x and SUSE 8.0 for IA64 (clipped)  Tru64 5.1 (clipped)  AIX 5.2 (clipped)  Mac OS X (clipped)

13 www.cs.wisc.edu/condor Some new components › Computing On Demand (COD) › Integration of “Hawkeye” technology › Condor-G Additions  Matchmaking  Grid Monitor  Grid Shell

14 www.cs.wisc.edu/condor Computing On Demand (COD) › Introduce effective timesharing to a distributed system  Batch applications often want sustained throughput for a long period of time  Interactive applications often want a quick burst of CPU power for small period of time  COD : Allow both to co-exist

15 www.cs.wisc.edu/condor HawkEye Technology › Dynamic Resource Monitoring, now ‘built-in’ to Condor.  Allows custom dynamic attributes to be added into machine classads.  These attributes can be used for Queries Scheduling  Many plugins available. Disk space, memory used, network errors, open files/descriptors, process monitoring, users, …

16 www.cs.wisc.edu/condor Condor-G › Condor-G Matchmaking  Condor-G can determine which grid site to utilize via ClassAd matchmaking (grid planning, meta scheduling, …) › Condor-G Grid Monitor  Reduces the load on a GT2-based gatekeeper, greatly increasing the amount of jobs that can be submitted › Condor-G GridShell  A wrapper for the job  Reports exit status, cpu utilization, more

17 www.cs.wisc.edu/condor Improvements in Condor for Windows › Ability to run SCHEDULER universe jobs  Including DAGMan › JAVA universe support › More Win32 flavors, incl international versions. › Added support for encryption on disk of the job and data files on execute machine. › v6.6.6: Many issues fixed w/ signaling jobs › V6.6.7: Support for SP2

18 www.cs.wisc.edu/condor New Features in DAGMan › DAGMan previously required that all jobs in a DAG share one log file › Each job can now have it’s own log file › Understands XML formatted logs › Can draw a graphical representation of your DAG  Uses GraphViz, http://www.graphviz.org/

19 www.cs.wisc.edu/condor

20 Central Manager New Features › Central Manager daemons can now run on any port COLLECTOR_HOST = condor.cs.wisc.edu:9019 NEGOTIATOR_HOST = condor.cs.wisc.edu:9020  Useful for firewall situations  Allows multiple instances on one machine › Keeps statistics on missed updates › Can use TCP instead of UDP, if you must

21 www.cs.wisc.edu/condor Command-line Tools › ‘condor_update_stats’ tool to display information on any dropped central manager updates › ‘condor_q –hold’ gives you a list of held jobs and the reason they were put on hold › ‘condor_config_val –v’ tells you where (file and line number) an attribute is defined › ‘condor_fetch_log’ will grab a log file from a remote machine:  condor_fetch_log c2-15.cs.wisc.edu STARTD › ‘condor_configure’ will install Condor via simple command-line switches, no questions asked › ‘condor_vacate_job’ to release a resource by job id, and can be invoked by the job owner. › `condor_wait’ blocks until a job or set of jobs completes

22 www.cs.wisc.edu/condor New 6.7.x Development Series › Release of v6.7.2 was in April 04.

23 www.cs.wisc.edu/condor Big Picture What do we want to achieve What do we want to achieve in a new Condor developer series? › Technology Transfer  Building a bridge between the Condor production software development activity and the academic core research activity BAD-FS, Stork, Diskrouter, Parrot (transparent I/O), Schedd Glidein, VO Schedulers, HA, Management, Improved ClassAds…

24 www.cs.wisc.edu/condor What do we want to achieve, cont? New Ports: Go to where the cycles are! The RedHat Dilemma Our porting ‘hopper’ : AIX 5.1L on the PowerPC architecture Redhat AS server on x86 Fedora Core on x86 Fedora Core 2 on x86 Redhat AS server on AMD64 SuSE 8.0 on AMD64 Redhat AS server on IA64 HPUX 11.11 64-bit

25 www.cs.wisc.edu/condor What do we want to achieve, cont. › Improve existing ports  Move “clipped wing” port to full ports (w/ checkpoint, process migration) Max OS X, Windows  Better integration into environments Windows: operate better w/ DFS, use MSI Unix: operate w/ AFS

26 www.cs.wisc.edu/condor What do we want to achieve, cont. › Address changes in the computing landscape  Firewalls, NATs  64-bit operating systems  Emphasis on data  Movement towards standards such as WS, OGSA, …

27 www.cs.wisc.edu/condor V6.7 Themes › Scalability  Resources, jobs, matchmaking framework › Accessibility  APIs, more Grid middleware, network › Availability  Failover

28 www.cs.wisc.edu/condor What happens if my submit machine reboots? Once upon a time, only one answer: job restarts. Checkpoint? No Checkpoint? High Availability in v6.7.x

29 www.cs.wisc.edu/condor New: Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor now supports reestablishment of the connection between the submitting and executing machines. › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: JobLeaseDuration = 1200

30 www.cs.wisc.edu/condor What if the submission point spontaneously explodes? (don’t try this at home)

31 www.cs.wisc.edu/condor More High Availability Solutions › Condor can support a submit machine “hot spare”  If your submit machine is down for longer than N minutes, a second machine can take over › Two mechanisms available  Job Mirroring  High Availability Daemon Failover Just tell the condor_master to run ONE instance

32 www.cs.wisc.edu/condor Daemon Failover Master SchedD Master SchedD Refresh Lock Check Lock Machine A Machine B Active(hot spare) Obtain Lock Refresh Lock Active

33 www.cs.wisc.edu/condor Accessibility › Support for GCB  Condor working w/ NATs, Firewalls › Distributed Resource Management Application API (DRMAA)  GGF Working Group  An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems  Condor DRMAA interface to appear in v6.7.0

34 www.cs.wisc.edu/condor SOAP/Grid Service condor_schedd Cedar Web Service: SOAP HTTPS

35 www.cs.wisc.edu/condor New “Grid Universe” › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes? GT3 for OGSA- based Globus Toolkit 3

36 www.cs.wisc.edu/condor Condor-G improvements › Condor-G can submit to either Globus GT2 or GT3 resources, including support for GT3 with web services.  Condor-G includes everything required; no need for client to have a GT3 installation.  Good migration path to OGSA › Condor-G to Nordugrid, Unicore, Condor, ORACLE › Support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy/

37 www.cs.wisc.edu/condor Why Condor + MyProxy? › Long-lived tasks or services need credentials  Task lifetime is difficult to predict › Don’t want to delegate long-lived credentials  Fear of compromise › Instead, renew credentials with MyProxy as needed during the task’s lifetime  Provides a single point of monitoring and control  Renewal policy can be modified at any time For example, disable renewals if compromise is detected or suspected

38 www.cs.wisc.edu/condor Credential Renewal Condor-G Scheduler MyProxy Resource Manager Job HomeRemote Submit Jobs Enable Renewal Launch Job Retrieve Credentials Refresh Credentials

39 www.cs.wisc.edu/condor More… › Condor can now transfer job data files larger than 2 GB in size.  On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA  Real-time monitoring of job progress › Working on Hierarchical Negotiations

40 www.cs.wisc.edu/condor Thank you!


Download ppt "Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor."

Similar presentations


Ads by Google