Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
Operating-System Structures
1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.
Dr. David Wallom Use of Condor in our Campus Grid and the University September 2004.
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Derek Wright Computer Sciences Department, UW-Madison Lawrence Berkeley National Labs (LBNL)
Condor Project Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Utilizing Condor and HTC to address archiving online courses at Clemson on a weekly basis Sam Hoover 1 Project Blackbird Computing,
Lesson 4 Computer Software
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
17/09/2004 John Kewley Grid Technology Group Introduction to Condor.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
1 Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Ashok Agarwal 1 BaBar MC Production on the Canadian Grid using a Web Services Approach Ashok Agarwal, Ron Desmarais, Ian Gable, Sergey Popov, Sydney Schaffer,
1 HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
Grid Computing I CONDOR.
Condor Birdbath Web Service interface to Condor
GRAM5 - A sustainable, scalable, reliable GRAM service Stuart Martin - UC/ANL.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
National Computational Science National Center for Supercomputing Applications National Computational Science NCSA-IPG Collaboration Projects Overview.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
1 The Roadmap to New Releases Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
1 Condor BirdBath SOAP Interface to Condor Charaka Goonatilake Department of Computer Science University College London
Grid Security: Authentication Most Grids rely on a Public Key Infrastructure system for issuing credentials. Users are issued long term public and private.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap Paradyn/Condor.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Tevfik Kosar Computer Sciences Department University of Wisconsin-Madison Managing and Scheduling Data.
Review of Condor,SGE,LSF,PBS
Condor Project Computer Sciences Department University of Wisconsin-Madison Grids and Condor Barcelona,
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
Alain Roy Computer Sciences Department University of Wisconsin-Madison Condor & Middleware: NMI & VDT.
National Computational Science National Center for Supercomputing Applications National Computational Science Integration of the MyProxy Online Credential.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor NT Condor ported.
HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
Condor on Dedicated Clusters Peter Couvares and Derek Wright Computer Sciences Department University of Wisconsin-Madison
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Chapter 2: System Structures
Building Grids with Condor
Building and Testing using Condor
Basic Grid Projects – Condor (Part I)
The Condor JobRouter.
Condor-G Making Condor Grid Enabled
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Condor-G: An Update.
Presentation transcript:

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison What’s New in Condor

Overview Quick ‘sound bytes’ on new functionality in recent Condor releases › Condor Development Process › New Features in Condor version 6.6.x › New Features in Condor version 6.7.0

Condor Development Process › We maintain two different releases at all times  Stable Series Second digit is even: e.g , 6.4.7,  Development Series Second digit is odd: e.g , 6.7.2

Stable Series › Heavily tested › Runs on our department production pool of nearly 1,000 CPUs (for min of 3 weeks) › No new features, only bugfixes and ports. › A given stable release is always compatible with other releases from the same series  6.6.X is compatible with 6.6.Y › Recommended for production pools

Development Series › Less heavily tested › Runs on our small(er) test pool. › New features and new technology are added frequently › Versions from the same development series are not guaranteed compatible with each other (although we try hard)

New in version 6.6.x › Version released in November 03. › Current release: version 6.6.7, to be released in Oct 04.

The Struggle to Build Condor › Condor is BIG  Condor code consists of primary source plus ‘externals’. Externals include Kerberos, zlib, GSI, PVM, gSOAP… Patches to externals

The Struggle to Build Condor › Condor is BIG  Condor code consists of primary source plus ‘externals’. Externals include Kerberos, zlib, GSI, PVM, gSOAP… Patches to externals  Current shipped source + externals: ~415MB of source, or ~9 million lines!  Building Condor outside of UW- Madison used to be very difficult. “LIST OF SHAME”“LIST OF SHAME”: Build pointed to packages on UW-Madison fileservers.

Now Condor Source “Self-Contained” › Source code to externals are now bundled w/ Condor itself.  Self-contained  Allows version control on externals + patches › Build w/ just “configure; make” !  Checks for existence and proper version of all “bootstrap” requirements, such as the compiler  Applies our patches to the externals  All 9 million lines built and bundled

Building Condor Building Condor before Version 6.6.0… Building Condor Post Version 6.6.0!

› NMI = NSF Middleware Initiative › Automated build and test infrastructure built on top of Condor  Pool of 37 machines of many architectures  Scalable  Runs every night, builds several Condor source branches, then runs 114 test programs.  All results stored in RDBMS, reported on the web.  Yes, Condor builds Condor! Condor + NMI

Ports › New Ports w/ v6.6.x –vs- v6.4.x :  Solaris 9  RedHat Linux 8.x, 9.x for x86 (+RPMs)  RedHat Linux 7.x and SUSE 8.0 for IA64 (clipped)  Tru (clipped)  AIX 5.2 (clipped)  Mac OS X (clipped)

Some new components › Computing On Demand (COD) › Integration of “Hawkeye” technology › Condor-G Additions  Matchmaking  Grid Monitor  Grid Shell

Computing On Demand (COD) › Introduce effective timesharing to a distributed system  Batch applications often want sustained throughput for a long period of time  Interactive applications often want a quick burst of CPU power for small period of time  COD : Allow both to co-exist

HawkEye Technology › Dynamic Resource Monitoring, now ‘built-in’ to Condor.  Allows custom dynamic attributes to be added into machine classads.  These attributes can be used for Queries Scheduling  Many plugins available. Disk space, memory used, network errors, open files/descriptors, process monitoring, users, …

Condor-G › Condor-G Matchmaking  Condor-G can determine which grid site to utilize via ClassAd matchmaking (grid planning, meta scheduling, …) › Condor-G Grid Monitor  Reduces the load on a GT2-based gatekeeper, greatly increasing the amount of jobs that can be submitted › Condor-G GridShell  A wrapper for the job  Reports exit status, cpu utilization, more

Improvements in Condor for Windows › Ability to run SCHEDULER universe jobs  Including DAGMan › JAVA universe support › More Win32 flavors, incl international versions. › Added support for encryption on disk of the job and data files on execute machine. › v6.6.6: Many issues fixed w/ signaling jobs › V6.6.7: Support for SP2

New Features in DAGMan › DAGMan previously required that all jobs in a DAG share one log file › Each job can now have it’s own log file › Understands XML formatted logs › Can draw a graphical representation of your DAG  Uses GraphViz,

Central Manager New Features › Central Manager daemons can now run on any port COLLECTOR_HOST = condor.cs.wisc.edu:9019 NEGOTIATOR_HOST = condor.cs.wisc.edu:9020  Useful for firewall situations  Allows multiple instances on one machine › Keeps statistics on missed updates › Can use TCP instead of UDP, if you must

Command-line Tools › ‘condor_update_stats’ tool to display information on any dropped central manager updates › ‘condor_q –hold’ gives you a list of held jobs and the reason they were put on hold › ‘condor_config_val –v’ tells you where (file and line number) an attribute is defined › ‘condor_fetch_log’ will grab a log file from a remote machine:  condor_fetch_log c2-15.cs.wisc.edu STARTD › ‘condor_configure’ will install Condor via simple command-line switches, no questions asked › ‘condor_vacate_job’ to release a resource by job id, and can be invoked by the job owner. › `condor_wait’ blocks until a job or set of jobs completes

New 6.7.x Development Series › Release of v6.7.2 was in April 04.

Big Picture What do we want to achieve What do we want to achieve in a new Condor developer series? › Technology Transfer  Building a bridge between the Condor production software development activity and the academic core research activity BAD-FS, Stork, Diskrouter, Parrot (transparent I/O), Schedd Glidein, VO Schedulers, HA, Management, Improved ClassAds…

What do we want to achieve, cont? New Ports: Go to where the cycles are! The RedHat Dilemma Our porting ‘hopper’ : AIX 5.1L on the PowerPC architecture Redhat AS server on x86 Fedora Core on x86 Fedora Core 2 on x86 Redhat AS server on AMD64 SuSE 8.0 on AMD64 Redhat AS server on IA64 HPUX bit

What do we want to achieve, cont. › Improve existing ports  Move “clipped wing” port to full ports (w/ checkpoint, process migration) Max OS X, Windows  Better integration into environments Windows: operate better w/ DFS, use MSI Unix: operate w/ AFS

What do we want to achieve, cont. › Address changes in the computing landscape  Firewalls, NATs  64-bit operating systems  Emphasis on data  Movement towards standards such as WS, OGSA, …

V6.7 Themes › Scalability  Resources, jobs, matchmaking framework › Accessibility  APIs, more Grid middleware, network › Availability  Failover

What happens if my submit machine reboots? Once upon a time, only one answer: job restarts. Checkpoint? No Checkpoint? High Availability in v6.7.x

New: Job Progress continues if connection is interrupted › Now for Vanilla and Java universe jobs, Condor now supports reestablishment of the connection between the submitting and executing machines. › To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = For example: JobLeaseDuration = 1200

What if the submission point spontaneously explodes? (don’t try this at home)

More High Availability Solutions › Condor can support a submit machine “hot spare”  If your submit machine is down for longer than N minutes, a second machine can take over › Two mechanisms available  Job Mirroring  High Availability Daemon Failover Just tell the condor_master to run ONE instance

Daemon Failover Master SchedD Master SchedD Refresh Lock Check Lock Machine A Machine B Active(hot spare) Obtain Lock Refresh Lock Active

Accessibility › Support for GCB  Condor working w/ NATs, Firewalls › Distributed Resource Management Application API (DRMAA)  GGF Working Group  An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems  Condor DRMAA interface to appear in v6.7.0

SOAP/Grid Service condor_schedd Cedar Web Service: SOAP HTTPS

New “Grid Universe” › With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 › Other gridtypes? GT3 for OGSA- based Globus Toolkit 3

Condor-G improvements › Condor-G can submit to either Globus GT2 or GT3 resources, including support for GT3 with web services.  Condor-G includes everything required; no need for client to have a GT3 installation.  Good migration path to OGSA › Condor-G to Nordugrid, Unicore, Condor, ORACLE › Support for credential refresh via the MyProxy Online Credential Management in NMI

Why Condor + MyProxy? › Long-lived tasks or services need credentials  Task lifetime is difficult to predict › Don’t want to delegate long-lived credentials  Fear of compromise › Instead, renew credentials with MyProxy as needed during the task’s lifetime  Provides a single point of monitoring and control  Renewal policy can be modified at any time For example, disable renewals if compromise is detected or suspected

Credential Renewal Condor-G Scheduler MyProxy Resource Manager Job HomeRemote Submit Jobs Enable Renewal Launch Job Retrieve Credentials Refresh Credentials

More… › Condor can now transfer job data files larger than 2 GB in size.  On all platforms that support 64bit file offsets › Real-time spooling of stdout/err/in in any universe incl VANILLA  Real-time monitoring of job progress › Working on Hierarchical Negotiations

Thank you!