US CMS Testbed.

US CMS Testbed

Large Hadron Collider Supercollider on French-Swiss border
Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA)

Compact Muon Solenoid Detector / Experiment for LHC
Search for Higgs Boson, other fundamental forces

Still Under Development
Developing software to process enormous amount of data generated For testing and prototyping, the detector is being simulated now Simulating events (particle collisions) We’re involved in the United States portion of the effort

Storage and Computational Requirements
Simulating and reconstructing millions of events per year, batches of around 150,000 (about 10 CPU months) Each event requires about 3 minutes of processor time A single run will generate about 300 GB of data

Before Condor-G and Globus
Runs are hand assigned to individual sites Manpower intensive to organize run distribution and collect results Each site has staff managing their runs Manpower intensive to monitor jobs, CPU availability, disk space, etc.

Before Condor-G and Globus
Use existing tool (MCRunJob) to manage tasks Not “Grid-Aware” Expects reliable batch system

UW High Energy Physics: A special case
Was a site being assigned runs Modified local configuration to flock to UW Computer Science Condor pool When possible used standard universe to increase available computers During one week used 30,000 CPU hours.

Our Goal Move the work onto “the Grid” using Globus and Condor-G

Why the Grid? Centralize management of simulation work
Reduce manpower at individual sites

Why Condor-G? Monitors and manages tasks
Reliability in unreliable world

Lessons Learned The grid will fail Design for recovery

The Grid Will Fail The grid is complex The grid is new and untested
Often beta, alpha, or prototype. The public Internet is out of your control Remote sites are out of your control

The Grid is Complex Our system has 16 layers
A minimal Globus/Condor-G system has 9 layers Most layers stable and transparent MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper > Impala wrapper > actual job

Design for Recovery Provide recovery at multiple levels to minimize lost work Be able to start a particular task over from scratch if necessary Never assume that a particular step will succeed Allocate lots of debugging time

Now Single master site sends jobs to distributed worker sites.
Individual sites provide configured Globus node and batch system 300+ CPUs across a dozen sites. Condor-G acts as reliable batch system and Grid front end

How? MOP. Monte Carlo Distributed Production System
Pretends to be local batch system for MCRunJob Repackages jobs to run on a remote site

CMS Testbed Big Picture
Master Site Worker MCRunJob Globus MOP Condor DAGMan Real Work Condor-G

DAGMan, Condor-G, Globus, Condor
DAGMan - Manages dependencies Condor-G - Monitors the job on master site Globus - Sends jobs to remote site Condor - Manages job and computers at remote site

Recovery: Condor Automatically recovers from machine and network problems on execute cluster.

Recovery: Condor-G Automatically monitors for and retries a number of possibly transient errors. Recovers from down master site, down worker sites, down network. After a network outage can reconnect to still running jobs.

Recovery: DAGMan If a particular task fails permanently, notes it and allows easy retry. Can automatically retry, we don’t.

Globus Globus software under rapid development
Use old software and miss important updates Use new software and deal with version incompatibilities

Fall of 2002: First Test Our first run gave us two weeks to do about 10 days of work (given available CPUs at the time). We had problems Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures

It Worked! The system recovered automatically from many problems
Relatively low human intervention Approximately one full time person

Since Then Improved automatic recovery for more situations
Generated 1.5 million events (about 30 CPU years) in just a few months Currently gearing up for even larger runs starting this summer

Future Work Expanding grid with more machines
Use Condor-G’s scheduling capabilities to automatically assign jobs to sites Officially replace previous system this summer.

Thank You!

US CMS Testbed.

Similar presentations

Presentation on theme: "US CMS Testbed."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

US CMS Testbed.

Similar presentations

Presentation on theme: "US CMS Testbed."— Presentation transcript:

Similar presentations

About project

Feedback