US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison
Trust No One The grid will fail Design for recovery
The Grid Will Fail The grid is complex The grid is relatively new and untested –Much of it is best described as prototypes or alpha versions The public Internet is out of your control Remote sites are out of your control
Design for Recovery Provide recovery at multiple levels to minimize lost work Be able to start a particular task over from scratch if necessary Never assume that a particular step will succeed Allocate lots of debugging time
Some Background
Compact Muon Solenoid Detector The Compact Muon Solenoid (CMS) detector at the Large Hadron Collider will probe fundamental forces in our Universe and search for the yet-undetected Higgs Boson. (Based on slide by Scott Koranda at NCSA)
Compact Muon Solenoid (Based on slide by Scott Koranda at NCSA)
CMS - Now and the Future The CMS detector is expected to come online in 2006 Software to analyze this enormous amount of data from the detector is being developed now. For testing and prototyping, the detector is being simulated now.
What We’re Doing Now Our runs are divided into two phases –Monte Carlo detector response simulation –Physics reconstruction The testbed currently only does simulation, but is moving toward reconstruction.
Storage and Computational Requirements Simulating and reconstructing millions of events per year Each event requires about 3 minutes of processor time Events are generally processed in run of about 150,000 events The simulation step of a single run will generate about 150 GB of data –Reconstruction has similar requirements
Existing CMS Production Runs are assigned to individual sites Each site has staff managing their runs –Manpower intensive to monitor jobs, CPU availability, disk space Local site uses Impala (old way) or MCRunJob (new way) to manage jobs running on local batch system.
Testbed CMS Production What I work on Designed to allow a single master site to manage jobs scattered to many worker sites
CMS Testbed Workers As we move from testbed to full production, we will add more sites and hundreds of CPUs.
CMS Testbed Big Picture Master Site Impala MOP Condor-G Worker Globus Condor Real Work DAGMan
Impala Tool used in current production Assembles jobs to be run Sends jobs out Collects results Minimal recovery mechanism Expects to hand jobs off to a local batch system –Assumes local file system
MOP Monte Carlo Distributed Production System –It could have been MonteDistPro (as the, The Count of…) Pretends to be local batch system for Impala Repackages jobs to run on a remote site
MOP Repackaging Impala hands MOP a list of input files, output files, and a script to run. Binds site specific information to script –Path to binaries, location of scratch space, staging location, etc –Impala is given locations like _path_to_gdmp_dir_ which MOP rewrites Breaks jobs into five step DAGs Hands job off to DAGMan/Condor-G
MOP Job Stages Stage-in - Move input data and program to remote site Run - Execute the program Stage-out - Retrieve program logs Publish - Retrieve program output Cleanup - Delete files MOP Job Stages
MOP Job Stages A MOP “run” collects multiple groups into a single DAG which is submitted to DAGMan Combined DAG...
DAGMan, Condor-G, Globus, Condor DAGMan - Manages dependencies Condor-G - Monitors the job on master site Globus - Sends jobs to remote site Condor - Manages job and computers at remote site
Typical Network Configuration Worker Site: Head Node Worker Site: Compute Node Worker Site: Compute Node Private Network Public Internet MOP Master Machine
Network Configuration Some sites make compute nodes visible to the public Internet, but many do not. –Private networks will scale better as sites add dozens or hundreds of machine –As a result, any stage handling data transfer to or from the MOP Master must run on the head node. No other node can address the MOP Master This is a scalability issue. We haven’t hit the limit yet.
When Things Go Wrong How recovery is handled
Recovery - DAGMan Remembers current status –When restarted, determines current progress and continues. Notes failed jobs for resubmission –Can automatically retry, but we don’t
Recovery - Condor-G Remembers current status –When restarted, reconnects jobs to remote sites and updates status –Also runs DAGMan, when restarted restarts DAGMan Retries in certain failure cases Holds jobs in other failure cases
Recovery - Condor Remembers current status Running on remote site Recovers job state and restarts jobs on machine failure
globus-url-copy Used for file transfer Client process can hang under some circumstances Wrapped in a shell script giving transfer a maximum duration. If run exceeds duration, job is killed and restarted. Using ftsh to write script - Doug Thain’s Fault Tolerant Shell.
Human Involvement in Failure Recovery Condor-G places some problem jobs on hold –By placing them on hold, we prevent the jobs from failing and provide an opportunity to recover. Usually Globus problems:expired certificate, jobmanager misconfiguration, bugs in the jobmanager
Human Involvement in Failure Recovery A human diagnoses the jobs placed on hold –Is problem transient? condor_release the job. –Otherwise fix the problem, then release the job. –Can the problem not be fixed? Reset the GlobusContactString and release the job, forcing it to restart. condor_qedit GlobusContactString X
Human Involvement in Failure Recovery Sometimes tasks themselves fail A variety of problems, typically external: disk full, network outage –DAGMan notes failure. When all possible DAGMan nodes finish or fail, a rescue DAG file is generated. –Submitting this rescue DAG will retry all failed nodes.
Doing Real Work
CMS Production Job 1828 US CMS Testbed asked to help with real CMS production Given 150,000 events to do in two weeks.
What Went Wrong Power outage Network outages Worker site failures Globus failures DAGMan failure Unsolved mysteries
Power Outage A power outage at the UW took out the master site and the UW worker site for several hours During the outage worker sites continued running assigned tasks, but as they exhausted their queues we could not send additional tasks File transfers sending data back failed System recovered well
Network Outages Several outages, most less than an hour, one for eleven hours Worker sites continued running assigned tasks Master site was unable to report status until network was restored File transfers failed System recovered well
Worker Site Failures One site had a configuration change go bad, causing the Condor jobs to fail –Condor-G placed problem tasks on hold. When the situation was resolved, we released the jobs and they succeeded. Another site was incompletely upgraded during the run. –Jobs were held, released when fixed.
Worker Site Failure / Globus Failure At one site, Condor jobs were removed from the pool using condor_rm, probably by accident The previous Globus interface to Condor wasn’t prepared for that possibility and erroneously reported the job as still running –Fixed in newest Globus Job’s contact string was reset.
Globus Failures globus-job-manager would sometimes stop checking the status of a job, reporting the last status forever When a job was taking unusually long, this was usually the problem Killing the globus-job-manager caused a new one to be started, solving the problem –Has to be done on the remote site (Or via globus-job-run)
Globus Failures globus-job-manager would sometimes corrupt state files Wisconsin team debugged problem and distributed patched program Failed jobs had their GlobusContactStrings reset.
Globus Failures Some globus-job-managers would report problems accessing input files –The reason has not been diagnosed. Affected jobs had their GlobusContactStrings reset.
DAGMan failure In one instance a DAGMan managing 50 groups of jobs crashed. The DAG file was tweaked by hand to mark completed jobs as such and resubmitted –Finished jobs in a DAG simply have DONE added to then end of their entry
Problems Previously Encountered We’ve been doing test runs for ~9 months. We’ve encountered and resolved many other issues. Consider building your own copy of the Globus tools out of CVS to stay on top of bugfixes. Monitor and the Globus mailing lists.
The Future
Future Improvements Currently our run stage runs as a vanilla universe Condor job on the worker site. If there is a problem the job must be restarted from scratch. Switching to the standard universe would allow jobs to recover and continue aborted runs.
Future Improvements Data transfer jobs are run as Globus fork jobs. They are completely unmanaged on the remote site. If the remote site has an outage, there is no information on the jobs. –Running these under Condor (Scheduler universe) would ensure that status was not lost. –Also looking at using the DaP Scheduler
Future Improvements Jobs are assigned to specific sites by an operator Once assigned, changing the assigned site is nearly impossible Working to support “grid scheduling”: automatic assignment of jobs to sites and changing site assignment