Presentation is loading. Please wait.

Presentation is loading. Please wait.

Condor Week Summary March 14-16, 2005 Madison, Wisconsin.

Similar presentations


Presentation on theme: "Condor Week Summary March 14-16, 2005 Madison, Wisconsin."— Presentation transcript:

1 Condor Week Summary March 14-16, 2005 Madison, Wisconsin

2 Overview Annual meeting at UW-Madison. About 80 participants at this year’s meeting. Participants come from universities, research labs and industry. Single plenary sessions with talks from users and developers.

3 Overview Topics ranged from basic to advanced. Selected highlights in today’s talk. Slides from this year’s talks can be found at http://www.cs.wisc.edu/condor/CondorWeek2005 http://www.cs.wisc.edu/condor/CondorWeek2005

4 CondorWeek Topics distributed computing and Condor data handling and Condor 3rd party contributions to Condor reports from the field Condor roadmap

5 Condor Grids (by Alan De Smet) Various alternatives for accessing remote computing resources (distributed computing, flocking, Globus/Condor-G, Condor-C, etc). Discussed pros and cons of each approach (ACF uses Globus/Condor-G).

6 Condor-G Status and News Globus Toolkit 2 is stable Globus Toolkit 3 is supported –But we think most people are moving to… Globus Toolkit 4 in progress –GT4 beta works now in Condor 6.7.6 –Condor will officially support soon after official GT4 release.

7 Glidein (by Dan Bradley) You have access to a cluster running some other batch system. You want Condor features, such as –queue management –matchmaking –checkpoint migration

8 What Does Glidein Do? Installation and setup of Condor. –May be done remotely. Launching Condor. –Through Condor-G submission to Globus. –Or you run the startup script however you like.

9 Condor and DBMS (by Jeff Naughton) Premise: A running Condor system is awash in data: –Operational data –Historical data –User data DBMS technology can help capture, organize, manage, archive, and query this data.

10 Three potential levels of involvement 1.Passively collect and organize data, expose it through DB query interfaces. 2.Move/extend some data-related portions of Condor to DBMS (Condor writes to and reads from DBMS) 3.Provide services to help users manage their data.

11 Why do this? For Condor administrators –Easier to analyze and trouble shoot; –Easier to audit; –Easier to explore current and past system status and behavior.

12 Our projects and plans Quill: Transparently provide a DBMS query interface to job_queue and history data. [ready to deploy!] CondorDB: Transparently captures and provides interface to critical data from all Condor daemons. [status: partial prototype working in our own “sandbox”]

13 Quill Job ClassAds information mirrored into an RDBMS Both active jobs and historical jobs Benefits BOTH scalability and accessibility QuillSchedd Job Queue log RDBMS Startd … Master Queue + History Tables

14 Longer-term plans Tight integration of DBMS technology and Condor [status: thinking hard!]. DBMS-inspired data management services to help Condor users manage their own data. [status: thinking really hard!]

15 Stork (by Tevfik Kosar) Condor tool for data movement. First available in v. 6.7.6. Will be included in next stable release (6.8.0). Prototypes deployed at various sites.

16 Bioinformatics: BLAST High Energy Physics: LHC Astronomy: LSST 2MASS SDSS DPOSS GSC-II WFCAM VISTA NVSS FIRST GALEX ROSAT OGLE... LSST 2MASS SDSS DPOSS GSC-II WFCAM VISTA NVSS FIRST GALEX ROSAT OGLE... Educational Technology: WCER EVP 500 TB/year 2-3 PB/year 11 PB/year 20 TB - 1 PB/year

17 Stork: Data Placement Scheduler First scheduler specialized for data movement/placement. De-couples data placement from computation. Understands the characteristics and semantics of data placement jobs. Can make smart scheduling decisions for reliable and efficient data placement. http://www.cs.wisc.edu/condor/stork

18 Stork can also: Allocate/de-allocate (optical) network links Allocate/de-allocate storage space Register/un-register files to Meta Data Catalog Locate physical location of a logical file name Control concurrency levels on storage servers

19 Storage Management (by Jeff Weber) NEST (Network Storage Technology) is another project at UW-Madison. To be coupled to Condor and Stork. No stable release available yet.

20 Overview of NeST NeST: Network Storage Technology Lightweight: Configuration and installation can be performed in minutes. Multi-protocol: Supports Chirp, GridFTP, NFS, HTTP –Chirp is NeST’s internal protocol Secure: GSI authentication Allocation: NeST negotiates “mini storage contracts” between users and server.

21 Why storage allocations ? Users need both temporary storage, and long-term guaranteed storage. Administrators need a storage solution with configurable limits and policy. Administrators will benefit from NeST’s autonomous reclamations of expired storage allocations.

22 Storage allocations in NeST Lot – abstraction for storage allocation with an associated handle –Handle is used for all subsequent operations on this lot Client requests lot of a specified size and duration. Server accepts or rejects client request.

23 Condor and SRM (by Derek Wright) Coordinate computation and data movement with Condor. Condor ClassAd hook (STARTD_CRON_JOBS) queries DRM for files in cache and publishes it in ClassAd for each node. FSM keeps track of all files required by jobs in the system and contacts HRM if required files are missing. Regular Condor matchmaking schedules jobs where files exist.

24 3 rd party contributions to Condor High availability features (Technion Institute). Privilege separation in Condor (Univ. of Cambridge). Optimizing Condor throughput (CORE Feature Animation). Web interface to Condor (Univ. College of London).

25 Collector Negotiator Current Condor Pool Startd and Schedd Central Manager

26 Highly Available Condor Pool Startd and Schedd Idle Central Manager Idle Central Manager Active Central Manager Highly Available Central Manager

27 Highly Available Central Manager Our solution - Highly Available Central Manager –Automatic failure detection –Transparent failover to backup matchmaker (no global configuration change for the pool entities) –“Split brain” reconciliation after network partitions –State replication between active and backups –No changes to Negotiator/Collector code

28 What is privilege separation? Isolation of those parts of the code that run at different privilege levels root Condor daemons Condor job No privilege separation: root Condor daemons Condor job Privilege separation:

29 Throughput Optimization (CORE Feature Animation) Performance Before => After: ● Removed Groups: 6 => 5.5 min ● Significant Attributes: 5.5 => 3 min ● Schedd Algorithm: 3 => 1.5min ● Separate Servers:1.5 => 0.6min ● Cycle delay:0.6 => 0.33 min ● Server Loads:<1 Middleware <2 Central Manager

30 Web Service Interface to Condor Facilitate the development of third-party applications capable of interacting with Condor (remotely). –E.g. build higher-level application specific scheduler that submits jobs to multiple Condor pools based on application semantics –These can be built using a wide range of languages/SOAP packages –BirdBath has been tested on: Java (Apache Axis, XSUL) Python (ZSI) C# (.Net) C/C++ (gSOAP) Condor accessible from platforms where its command-line tools are not supported/installed

31 Condor Plans (by Todd Tannenbaum) Condor 6.8.0 (stable series) available in May 05. Fail-over, persistence and other features. Improved scalability and accessibility (API’s, Grid middleware, Web-based interfaces, etc). Grid universe and security improvements.

32 Condor can now transfer job data files larger than 2 GB in size. –On all platforms that support 64bit file offsets Real-time spooling of stdout/err/in in any universe incl VANILLA –Real-time monitoring of job progress Condor Installer on Win32 uses MSI (thanks Micron!) condor_transfer_data (DZero) STARTD_VM_EXPRS (INFN) condor_vacate_job tool condor_status -negotiator BAM! More tasty Condor goodness!

33 And More… New startd policy expression MaxJobRetirementTime. –specifies the maximum amount of time (in seconds) that the startd is willing to wait for a job to finish on its own when the startd needs to preempt the job -peaceful option to condor_off, condor_restart noop_job = True Preliminary support for the Tool Daemon Protocol (TDP) –TDP goal is to provide a generic way for scheduling systems (daemons) to interact with monitoring tools. –specify a ``tool'' that should be spawned along-side their regular Condor job. –On Linux, ability to allow a monitoring tool to attach with ptrace() before the job's main() function is called.

34 Hey Jobs! We’re watching you! condor_starter enforce limits –Starter is already monitoring many job characteristics (image size, cpu usage, etc) –Threshold expressions Use more resources than you said you would, and BAM! Local Universe –Just like Scheduler Universe, but there is a condor_starter –All advantages of the starter schedd starter job Submit startd starter job Execute Hey, job, behave or else!

35 ClassAd Improvements in Condor! Conditionals –IfThenElse(condition,then,else) String functions –Strcat(), strcmp(), toUpper(), etc. StringList functions –Example of a “string list” (CSV style) Mylist = “Joe, Jon, Jeff, Jim, Jake” –StrListContains(), StrListAppend(), StrListRemove(), etc. Others –Type test, some math functions

36 Accounting Groups and Group Quota Support Account Group (w/ CORE Feature Animation) Account Group Quota (inspiration CDF @ Fermi) –Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them –Could use Machine Rank… but this ties to specific machines –Or could use new group support Each group can be given a quota in config file Job ads can specify group membership Group quotas are satisfied first Accounting by user and by group

37 Improved Scalability Much faster negotiation –SIGNIFICANT_ATTRIBUTES determined automatically –Schedd uses non-blocking TCP connects to the startd –Negotiator caching –Collector Forks for queries –More…

38 What’s brewing for after v6.8.0? More data, data, data –Stork distributed w/ v6.8.0, incl DAGMan support –NeST manage Condor spool files, ckpt servers –Stork used for Condor job data transfers Virtual Machines (and the future of Standard Universe) Condor and Shibboleth (with Georgetown Univ) Least Privilege Security Access (with U of Cambridge) Dynamic Temporary Accounts (with EGEE, Argonne) Leverage Database Technology (with UW DB group) ‘Automatic’ Glideins (NMI Nanohub – Purdue, U of Florida) Easier Updates New ClassAds (integration with Optena) Hierarchical Matchmaking Can I commit this to CVS??


Download ppt "Condor Week Summary March 14-16, 2005 Madison, Wisconsin."

Similar presentations


Ads by Google