Upgrading Condor Best Practices

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Configuration management
MOSS 2007 Document Management Adam McCarthy 1 st April 2009.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Change Management & Revision Control CPTE 433 John Beckett.
Lecture 19 Page 1 CS 111 Online Protecting Operating Systems Resources How do we use these various tools to protect actual OS resources? Memory? Files?
Condor Overview Bill Hoagland. Condor Workload management system for compute-intensive jobs Harnesses collection of dedicated or non-dedicated hardware.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machine Universe in.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
FileSecure Implementation Training Patch Management Version 1.1.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
Zach Miller Computer Sciences Department University of Wisconsin-Madison What’s New in Condor.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Objectives Define IP Address To be able to assign an IP address with its Subnet Mask and Default Gateway to a PC that operates using Windows 7 or Fedora.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Sun Grid Engine. Grids Grids are collections of resources made available to customers. Compute grids make cycles available to customers from an access.
Secure Operating Systems Lesson E: Windows Security - Overview.
1 1 Vulnerability Assessment of Grid Software Jim Kupsch Associate Researcher, Dept. of Computer Sciences University of Wisconsin-Madison Condor Week 2006.
Git A distributed version control system Powerpoint credited to University of PA And modified by Pepper 8-Oct-15.
| nectar.org.au NECTAR TRAINING Module 9 Backing up & Packing up.
Installation Overview Lab#2 1Hanin Abdulrahman. Installing Ubuntu Linux is the process of copying operating system files from a CD, DVD, or USB flash.
Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.
1 Vulnerability Assessment of Grid Software James A. Kupsch Computer Sciences Department University of Wisconsin Condor Week 2007 May 2, 2007.
An Intro to Concurrent Versions System (CVS) ECE 417/617: Elements of Software Engineering Stan Birchfield Clemson University.
The Roadmap to New Releases Derek Wright Computer Sciences Department University of Wisconsin-Madison
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Quill / Quill++ Tutorial.
1 FreeBSD Installation AFNOG X Cairo, Egypt May 2009 Hervey Allen.
Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.
An Introduction to Device Drivers Ted Baker  Andy Wang COP 5641 / CIS 4930.
European Middleware Initiative (EMI) The Software Engineering Model Alberto Di Meglio (CERN) Interim Project Director.
CS223: Software Engineering Lecture 2: Introduction to Software Engineering.
Version Control and SVN ECE 297. Why Do We Need Version Control?
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
How High Throughput was my cluster? Greg Thain Center for High Throughput Computing.
HTCondor Security Basics HTCondor Week, Madison 2016 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
Once you have been through these notes you will need to complete the workbook.
Chapter 25 – Configuration Management 1Chapter 25 Configuration management.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Chapter Six.
Backing Up Your System With rsnapshot
Improvements to Configuration
Version Control with Subversion
HTCondor Security Basics
Scheduling Policy John (TJ) Knoeller Condor Week 2017.
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
High Availability in HTCondor
SRM2 Migration Strategy
June 2011 David Front Weizmann Institute
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Integration of Singularity With Makeflow
Using the Parallel Universe beyond MPI
Accounting, Group Quotas, and User Priorities
HTCondor Security Basics HTCondor Week, Madison 2016
IS3440 Linux Security Unit 7 Securing the Linux Kernel
Basic Grid Projects – Condor (Part I)
Chapter Six.
November 5 No exam results today. 9 Classes to go!
Dr. Rob Hasker SE 3800 Note 9 Reviews.
Sun Grid Engine.
Stata Basic Course Lab 2.
Periodic Processes Chapter 9.
Tonga Institute of Higher Education IT 141: Information Systems
Tonga Institute of Higher Education IT 141: Information Systems
Condor-G Making Condor Grid Enabled
Job Submission Via File Transfer
Outline Announcements: Version control with CVS HW II due today!
Advanced Tips and Tricks
Presentation transcript:

Upgrading Condor Best Practices

The problem More frequent releases of Condor Every six to nine months? Understand this is a problem for users We’re willing to help out

Overview Config file management Condor testing strategies Standard Universe issues

Config files LOCAL_CONFIG_FILE Used for #include-like behaviour: $(HOSTS), $(GLOBAL), $(POLICY)…

Typical Config file ## Try to save this much swap space by not starting new shadows. ## Specified in megabytes. #RESERVED_SWAP = 5 Commented out lists the default value

Config file editing Never edit base condor_config file Except to specify the local file Put all edits in a local file One local file per config type E.g. for schedds, CMs, types of execute machines Can mix and match

Dealing with a new config Diff base config with your config Understand new items Documented in manual version-history Existing ones rarely change Usually capacity changes Almost always, overwriting base file works

Managing config files Centralized management key Cfengine, rsync, nfs (!) etc.

Testing new versions

Compatibility Guarantees No guarantees… But we try very hard! Both forward and backward Especially within one machine Federation techniques require this

Incremental testing! Three basic components of Condor: Central Manager Submit points Execute machines Test each independently

Testing Central Manager Take advantage of statelessness Condor HAD can help out here If it breaks, existing jobs keep running

Testing schedds Adding a new test schedd easy Schedd can be bottleneck Test jobs useful too, not just sleep Schedd can be bottleneck Probably only place you need to check cpu performance

Testing startds Easy to test a few at once Be careful when running std uni Glide in can be very helpful But beware of root specific issues Admin slots helpful

Now that we’ve tested… Always be undo-able! (never overwrite files) Rely on master restart on stat change

Big bang approach What we do at CS Just change a symlink to the binaries Master does the rest… Can be a big hit on shared filesystems

Incremental restart First, restart CM Send, reboot schedd No jobs lost Send, reboot schedd If restart happens in 20 minutes, jobs keep running What about the startds? Might be OK for standard uni Work on this coming soon…

Standard Universe CheckpointPlatform clarifications More sensitive to backward compatibility CheckpointPlatform clarifications condor_qedit -constraint 'LastCheckpointPlatform =?= "LINUX INTEL 2.6.x normal"' LastCheckpointPlatform '"LINUX INTEL 2.6.x normal 0xffffe000"'

Draining old Std Uni Keep a few old startds around To finish old standard uni jobs Set start to “JobUniverse == 1” Or maybe rank… Only on the old platforms

When to upgrade? Zeroth law of software engineering Development series actually pretty stable We’ll let you know about security issues Probably don’t need every minor version Don’t be more than one major stable version behind

In summary… Keep config files under control Test each component in isolation Be aware of standard universe issues

Any questions? Thank you!