Download presentation
Presentation is loading. Please wait.
1
US CMS Testbed
2
Large Hadron Collider Supercollider on French-Swiss border
Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA)
3
Compact Muon Solenoid Detector / Experiment for LHC
Search for Higgs Boson, other fundamental forces
4
Still Under Development
Developing software to process enormous amount of data generated For testing and prototyping, the detector is being simulated now Simulating events (particle collisions) We’re involved in the United States portion of the effort
5
Storage and Computational Requirements
Simulating and reconstructing millions of events per year, batches of around 150,000 (about 10 CPU months) Each event requires about 3 minutes of processor time A single run will generate about 300 GB of data
6
Before Condor-G and Globus
Runs are hand assigned to individual sites Manpower intensive to organize run distribution and collect results Each site has staff managing their runs Manpower intensive to monitor jobs, CPU availability, disk space, etc.
7
Before Condor-G and Globus
Use existing tool (MCRunJob) to manage tasks Not “Grid-Aware” Expects reliable batch system
8
UW High Energy Physics: A special case
Was a site being assigned runs Modified local configuration to flock to UW Computer Science Condor pool When possible used standard universe to increase available computers During one week used 30,000 CPU hours.
9
Our Goal Move the work onto “the Grid” using Globus and Condor-G
10
Why the Grid? Centralize management of simulation work
Reduce manpower at individual sites
11
Why Condor-G? Monitors and manages tasks
Reliability in unreliable world
12
Lessons Learned The grid will fail Design for recovery
13
The Grid Will Fail The grid is complex The grid is new and untested
Often beta, alpha, or prototype. The public Internet is out of your control Remote sites are out of your control
14
The Grid is Complex Our system has 16 layers
A minimal Globus/Condor-G system has 9 layers Most layers stable and transparent MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper > Impala wrapper > actual job
15
Design for Recovery Provide recovery at multiple levels to minimize lost work Be able to start a particular task over from scratch if necessary Never assume that a particular step will succeed Allocate lots of debugging time
16
Now Single master site sends jobs to distributed worker sites.
Individual sites provide configured Globus node and batch system 300+ CPUs across a dozen sites. Condor-G acts as reliable batch system and Grid front end
17
How? MOP. Monte Carlo Distributed Production System
Pretends to be local batch system for MCRunJob Repackages jobs to run on a remote site
18
CMS Testbed Big Picture
Master Site Worker MCRunJob Globus MOP Condor DAGMan Real Work Condor-G
19
DAGMan, Condor-G, Globus, Condor
DAGMan - Manages dependencies Condor-G - Monitors the job on master site Globus - Sends jobs to remote site Condor - Manages job and computers at remote site
20
Recovery: Condor Automatically recovers from machine and network problems on execute cluster.
21
Recovery: Condor-G Automatically monitors for and retries a number of possibly transient errors. Recovers from down master site, down worker sites, down network. After a network outage can reconnect to still running jobs.
22
Recovery: DAGMan If a particular task fails permanently, notes it and allows easy retry. Can automatically retry, we don’t.
23
Globus Globus software under rapid development
Use old software and miss important updates Use new software and deal with version incompatibilities
24
Fall of 2002: First Test Our first run gave us two weeks to do about 10 days of work (given available CPUs at the time). We had problems Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures
25
It Worked! The system recovered automatically from many problems
Relatively low human intervention Approximately one full time person
26
Since Then Improved automatic recovery for more situations
Generated 1.5 million events (about 30 CPU years) in just a few months Currently gearing up for even larger runs starting this summer
27
Future Work Expanding grid with more machines
Use Condor-G’s scheduling capabilities to automatically assign jobs to sites Officially replace previous system this summer.
28
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.