US CMS Testbed
Large Hadron Collider Supercollider on French-Swiss border Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA)
Compact Muon Solenoid Detector / Experiment for LHC Search for Higgs Boson, other fundamental forces
Still Under Development Developing software to process enormous amount of data generated For testing and prototyping, the detector is being simulated now Simulating events (particle collisions) We’re involved in the United States portion of the effort
Storage and Computational Requirements Simulating and reconstructing millions of events per year, batches of around 150,000 (about 10 CPU months) Each event requires about 3 minutes of processor time A single run will generate about 300 GB of data
Before Condor-G and Globus Runs are hand assigned to individual sites Manpower intensive to organize run distribution and collect results Each site has staff managing their runs Manpower intensive to monitor jobs, CPU availability, disk space, etc.
Before Condor-G and Globus Use existing tool (MCRunJob) to manage tasks Not “Grid-Aware” Expects reliable batch system
UW High Energy Physics: A special case Was a site being assigned runs Modified local configuration to flock to UW Computer Science Condor pool When possible used standard universe to increase available computers During one week used 30,000 CPU hours.
Our Goal Move the work onto “the Grid” using Globus and Condor-G
Why the Grid? Centralize management of simulation work Reduce manpower at individual sites
Why Condor-G? Monitors and manages tasks Reliability in unreliable world
Lessons Learned The grid will fail Design for recovery
The Grid Will Fail The grid is complex The grid is new and untested Often beta, alpha, or prototype. The public Internet is out of your control Remote sites are out of your control
The Grid is Complex Our system has 16 layers A minimal Globus/Condor-G system has 9 layers Most layers stable and transparent MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper > Impala wrapper > actual job
Design for Recovery Provide recovery at multiple levels to minimize lost work Be able to start a particular task over from scratch if necessary Never assume that a particular step will succeed Allocate lots of debugging time
Now Single master site sends jobs to distributed worker sites. Individual sites provide configured Globus node and batch system 300+ CPUs across a dozen sites. Condor-G acts as reliable batch system and Grid front end
How? MOP. Monte Carlo Distributed Production System Pretends to be local batch system for MCRunJob Repackages jobs to run on a remote site
CMS Testbed Big Picture Master Site Worker MCRunJob Globus MOP Condor DAGMan Real Work Condor-G
DAGMan, Condor-G, Globus, Condor DAGMan - Manages dependencies Condor-G - Monitors the job on master site Globus - Sends jobs to remote site Condor - Manages job and computers at remote site
Recovery: Condor Automatically recovers from machine and network problems on execute cluster.
Recovery: Condor-G Automatically monitors for and retries a number of possibly transient errors. Recovers from down master site, down worker sites, down network. After a network outage can reconnect to still running jobs.
Recovery: DAGMan If a particular task fails permanently, notes it and allows easy retry. Can automatically retry, we don’t.
Globus Globus software under rapid development Use old software and miss important updates Use new software and deal with version incompatibilities
Fall of 2002: First Test Our first run gave us two weeks to do about 10 days of work (given available CPUs at the time). We had problems Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures
It Worked! The system recovered automatically from many problems Relatively low human intervention Approximately one full time person
Since Then Improved automatic recovery for more situations Generated 1.5 million events (about 30 CPU years) in just a few months Currently gearing up for even larger runs starting this summer
Future Work Expanding grid with more machines Use Condor-G’s scheduling capabilities to automatically assign jobs to sites Officially replace previous system this summer.
Thank You! http://www.cs.wisc.edu/condor adesmet@cs.wisc.edu