1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
2 Stable vs. Development Series › Much like the Linux kernel, Condor provides two different releases at any time: Stable series Development series › Allows Condor to be both a research project and a production-ready system
3 Stable series › Series number in version is even (e.g ) › Releases are heavily tested › Only bug fixes and ports to new platforms are added on a stable series
4 Stable series (cont.) › A given stable release is always compatible with other releases from the same series › Recommended for production pools
5 Development Series › Series number in the version is odd (e.g , 6.3.1) › New features and new technology are added frequently › Versions from the same development series are not always compatible with each other
6 Development Series (cont.) › Releases are not as heavily tested › Generally not recommended for production pools … unless new features are required … unless we recommend otherwise :^)
7 Where is Condor Today? › Version being released asap – this is the v6.4.0 release candidate. › We expect version released by the end of March.
8 What’s new for Condor v6.4.0?
9 New Ports in › Full support (with checkpointing and remote system calls): RedHat 7.x (Linux 2.4.x kernel + glibc 2.2.x)
10 New Ports in (cont.) › ”Clipped" support (no checkpointing, PVM, or remote system calls, but all other functionality is available) Windows 2000 Mac OS X
11 Secure Communication › Secure network communication Strong user authentication Multiple methods supported: Kerberos, X509, NT LanMan, … Encryption Integrity › Authorization based on host or user
12 New Job Universes › MPI Universe Launch MPI jobs linked with MPICH library › Globus Universe Faster, more reliable, better integrated › Java Universe
13 Java Universe Job universe = java executable = Main.class jar_files = MyLibrary.jar input = infile output = outfile arguments = Main queue condor_submit
14 Why not use Vanilla Universe for Java jobs? › Java Universe provides more than just inserting “java” at the start of the execute line Knows which machines have a JVM installed Knows the location, version, and performance of JVM on each machine Provides more information about Java job completion than just JVM exit code Program runs in a Java wrapper, allowing Condor to report Java exceptions, etc.
15 Java support, cont. condor_status -java Name JavaVendor Ver State Activity LoadAv Mem aish.cs.wisc. Sun Microsy Owner Idle anfrom.cs.wis Sun Microsy Owner Idle babe.cs.wisc. Sun Microsy Claimed Busy
16 Condor File Transfer › Condor will transfer job files from the submit machine to the execute machine › Files to send and/or receive specified at submit time › Transfer is atomic All files are transferred, or transfer fails › Appeared in v6.2 only in Condor for Windows
17 File Transfer, cont. › Example: transfer_input_files = x, y, z … transfer_output_files = a, b, c …. transfer_files = [ ALWAYS | ONEXIT ] › Note: Condor can automatically figure out output files Default: Send back any new/changed files
18 Remote I/O Socket › Job can request that the condor_starter process on the execute machine create a Remote I/O Socket › Used for online access of file on submit machine – without Standard Universe. Use in Vanilla, Java, … › Libraries provided for Java and for C, e.g. : Java: FileInputStream -> ChirpInputStream C : open() -> chirp_open()
Job Fork startershadow Home File System I/O Library I/O ServerI/O Proxy Secure Remote I/O Local System Calls Local I/O (Chirp) Execution Site Submission Site
20 Job Policy Expressions › User can supply job policy expressions in the submit file. › Can be used to describe a successful run. on_exit_remove = on_exit_hold = periodic_remove = periodic_hold =
21 Job Policy Examples › Do not remove if exits with a signal: on_exit_remove = ExitBySignal == False › Place on hold if exits with nonzero status or ran for less than an hour: on_exit_hold = ((ExitBySignal==False) && (ExitSignal != 0)) || ((ServerStartTime – JobStartDate) < 3600) › Place on hold if job has spent more than 50% of its time suspended: periodic_hold = CumulativeSuspensionTime > (RemoteWallClockTime / 2.0)
22 Firewall Support › Port Restrictions In condor_config file can specify: LOWPORT = x HIGHPORT = y All dynamic ports will be between x and y inclusive › Condor + Firewalls/Private Networks: Who: Se-Chang Son Time: 9am-12pm Weds Where: rm 3387
23 Condor on Windows › On both NT and Win2k › New universes added: MPI, Java, Scheduler (and Globus in the works!) › DAGMan ported › CondorView ported › Run shadow + DAGMan as the user Allows submission from directories on shared filesystems
24 And more… › Unix Man pages › Fetch/consolidate log files remotely › ClassAd chaining › Many DAGMan improvements › Bug fixes, etc…
25 What’s Next? Future Directions › Increased focus on standalone tools built with Condor Technology DAGMan NeST PFS HawkEye Condor-G …
26 What’s Next? › Big Item: More focus on being a service provider than just an end-user tool: Developer APIs / libraries SOAP access to services XML representations of user logs, ClassAds, accounting info, etc.
27 More what’s next… › Condor on Windows Increased support from Microsoft Research Remote I/O Complete Shared Filesystem support Condor-G › MPI Scheduling Improvements
28 More what’s next… › New version of ClassAds into Condor Conditionals !! if/then/else Aggregates (lists, nested classads) Built-in functions String operations, pattern matching, time operators, unit conversions Clean implementations in C++ and Java ClassAd collections
29 More what’s next… › Re-write of the condor_schedd Performance enhancements and lowered resource requirements (particularly RAM) › Re-write of the checkpoint server Add secure communication NEST technology infusion Enhanced support for multiple servers Store meta-data along with checkpoint files
30 Thank you for coming to Paradyn/Condor Week!