US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison

Slides:



Advertisements
Similar presentations
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Advertisements

CERN LCG Overview & Scaling challenges David Smith For LCG Deployment Group CERN HEPiX 2003, Vancouver.
Part 7: CondorG A: Condor-G B: Laboratory: CondorG.
A Computation Management Agent for Multi-Institutional Grids
Intermediate Condor: DAGMan Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
Vladimir Litvin, Harvey Newman, Sergey Schevchenko Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum,
GRID Workload Management System Massimo Sgaravatto INFN Padova.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
Workload Management Massimo Sgaravatto INFN Padova.
The B A B AR G RID demonstrator Tim Adye, Roger Barlow, Alessandra Forti, Andrew McNab, David Smith What is BaBar? The BaBar detector is a High Energy.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Open Science Grid: More compute power Alan De Smet
Intermediate HTCondor: Workflows Monday pm Greg Thain Center For High Throughput Computing University of Wisconsin-Madison.
Distributed Deadlocks and Transaction Recovery.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.
The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
D0 Farms 1 D0 Run II Farms M. Diesburg, B.Alcorn, J.Bakken, T.Dawson, D.Fagan, J.Fromm, K.Genser, L.Giacchetti, D.Holmgren, T.Jones, T.Levshina, L.Lueking,
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
Distribution After Release Tool Natalia Ratnikova.
1 st December 2003 JIM for CDF 1 JIM and SAMGrid for CDF Mòrag Burgon-Lyon University of Glasgow.
INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Grid Compute Resources and Job Management. 2 Local Resource Managers (LRM)‏ Compute resources have a local resource manager (LRM) that controls:  Who.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
Enabling Grids for E-sciencE EGEE-III INFSO-RI Using DIANE for astrophysics applications Ladislav Hluchy, Viet Tran Institute of Informatics Slovak.
CERN IT Department CH-1211 Genève 23 Switzerland t Monitoring: Tracking your tasks with Task Monitoring PAT eLearning – Module 11 Edward.
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Condor-G A Quick Introduction Alan De Smet Condor Project University of Wisconsin - Madison.
Intermediate Condor: Workflows Rob Quick Open Science Grid Indiana University.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
Review of Condor,SGE,LSF,PBS
HTCondor and Workflows: An Introduction HTCondor Week 2015 Kent Wenger.
US CMS Centers & Grids – Taiwan GDB Meeting1 Introduction l US CMS is positioning itself to be able to learn, prototype and develop while providing.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Intermediate Condor: Workflows Monday, 1:15pm Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.
A Fully Automated Fault- tolerant System for Distributed Video Processing and Off­site Replication George Kola, Tevfik Kosar and Miron Livny University.
Grid Compute Resources and Job Management. 2 Job and compute resource management This module is about running jobs on remote compute resources.
SPI NIGHTLIES Alex Hodgkins. SPI nightlies  Build and test various software projects each night  Provide a nightlies summary page that displays all.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
Jaime Frey Computer Sciences Department University of Wisconsin-Madison What’s New in Condor-G.
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
VO Experiences with Open Science Grid Storage OSG Storage Forum | Wednesday September 22, 2010 (10:30am)
Condor DAGMan: Managing Job Dependencies with Condor
U.S. ATLAS Grid Production Experience
Chapter 2: System Structures
Building Grids with Condor
US CMS Testbed.
Condor-G Making Condor Grid Enabled
Presentation transcript:

US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison

Trust No One The grid will fail Design for recovery

The Grid Will Fail The grid is complex The grid is relatively new and untested –Much of it is best described as prototypes or alpha versions The public Internet is out of your control Remote sites are out of your control

Design for Recovery Provide recovery at multiple levels to minimize lost work Be able to start a particular task over from scratch if necessary Never assume that a particular step will succeed Allocate lots of debugging time

Some Background

Compact Muon Solenoid Detector The Compact Muon Solenoid (CMS) detector at the Large Hadron Collider will probe fundamental forces in our Universe and search for the yet-undetected Higgs Boson. (Based on slide by Scott Koranda at NCSA)

Compact Muon Solenoid (Based on slide by Scott Koranda at NCSA)

CMS - Now and the Future The CMS detector is expected to come online in 2006 Software to analyze this enormous amount of data from the detector is being developed now. For testing and prototyping, the detector is being simulated now.

What We’re Doing Now Our runs are divided into two phases –Monte Carlo detector response simulation –Physics reconstruction The testbed currently only does simulation, but is moving toward reconstruction.

Storage and Computational Requirements Simulating and reconstructing millions of events per year Each event requires about 3 minutes of processor time Events are generally processed in run of about 150,000 events The simulation step of a single run will generate about 150 GB of data –Reconstruction has similar requirements

Existing CMS Production Runs are assigned to individual sites Each site has staff managing their runs –Manpower intensive to monitor jobs, CPU availability, disk space Local site uses Impala (old way) or MCRunJob (new way) to manage jobs running on local batch system.

Testbed CMS Production What I work on Designed to allow a single master site to manage jobs scattered to many worker sites

CMS Testbed Workers As we move from testbed to full production, we will add more sites and hundreds of CPUs.

CMS Testbed Big Picture Master Site Impala MOP Condor-G Worker Globus Condor Real Work DAGMan

Impala Tool used in current production Assembles jobs to be run Sends jobs out Collects results Minimal recovery mechanism Expects to hand jobs off to a local batch system –Assumes local file system

MOP Monte Carlo Distributed Production System –It could have been MonteDistPro (as the, The Count of…) Pretends to be local batch system for Impala Repackages jobs to run on a remote site

MOP Repackaging Impala hands MOP a list of input files, output files, and a script to run. Binds site specific information to script –Path to binaries, location of scratch space, staging location, etc –Impala is given locations like _path_to_gdmp_dir_ which MOP rewrites Breaks jobs into five step DAGs Hands job off to DAGMan/Condor-G

MOP Job Stages Stage-in - Move input data and program to remote site Run - Execute the program Stage-out - Retrieve program logs Publish - Retrieve program output Cleanup - Delete files MOP Job Stages

MOP Job Stages A MOP “run” collects multiple groups into a single DAG which is submitted to DAGMan Combined DAG...

DAGMan, Condor-G, Globus, Condor DAGMan - Manages dependencies Condor-G - Monitors the job on master site Globus - Sends jobs to remote site Condor - Manages job and computers at remote site

Typical Network Configuration Worker Site: Head Node Worker Site: Compute Node Worker Site: Compute Node Private Network Public Internet MOP Master Machine

Network Configuration Some sites make compute nodes visible to the public Internet, but many do not. –Private networks will scale better as sites add dozens or hundreds of machine –As a result, any stage handling data transfer to or from the MOP Master must run on the head node. No other node can address the MOP Master This is a scalability issue. We haven’t hit the limit yet.

When Things Go Wrong How recovery is handled

Recovery - DAGMan Remembers current status –When restarted, determines current progress and continues. Notes failed jobs for resubmission –Can automatically retry, but we don’t

Recovery - Condor-G Remembers current status –When restarted, reconnects jobs to remote sites and updates status –Also runs DAGMan, when restarted restarts DAGMan Retries in certain failure cases Holds jobs in other failure cases

Recovery - Condor Remembers current status Running on remote site Recovers job state and restarts jobs on machine failure

globus-url-copy Used for file transfer Client process can hang under some circumstances Wrapped in a shell script giving transfer a maximum duration. If run exceeds duration, job is killed and restarted. Using ftsh to write script - Doug Thain’s Fault Tolerant Shell.

Human Involvement in Failure Recovery Condor-G places some problem jobs on hold –By placing them on hold, we prevent the jobs from failing and provide an opportunity to recover. Usually Globus problems:expired certificate, jobmanager misconfiguration, bugs in the jobmanager

Human Involvement in Failure Recovery A human diagnoses the jobs placed on hold –Is problem transient? condor_release the job. –Otherwise fix the problem, then release the job. –Can the problem not be fixed? Reset the GlobusContactString and release the job, forcing it to restart. condor_qedit GlobusContactString X

Human Involvement in Failure Recovery Sometimes tasks themselves fail A variety of problems, typically external: disk full, network outage –DAGMan notes failure. When all possible DAGMan nodes finish or fail, a rescue DAG file is generated. –Submitting this rescue DAG will retry all failed nodes.

Doing Real Work

CMS Production Job 1828 US CMS Testbed asked to help with real CMS production Given 150,000 events to do in two weeks.

What Went Wrong Power outage Network outages Worker site failures Globus failures DAGMan failure Unsolved mysteries

Power Outage A power outage at the UW took out the master site and the UW worker site for several hours During the outage worker sites continued running assigned tasks, but as they exhausted their queues we could not send additional tasks File transfers sending data back failed System recovered well

Network Outages Several outages, most less than an hour, one for eleven hours Worker sites continued running assigned tasks Master site was unable to report status until network was restored File transfers failed System recovered well

Worker Site Failures One site had a configuration change go bad, causing the Condor jobs to fail –Condor-G placed problem tasks on hold. When the situation was resolved, we released the jobs and they succeeded. Another site was incompletely upgraded during the run. –Jobs were held, released when fixed.

Worker Site Failure / Globus Failure At one site, Condor jobs were removed from the pool using condor_rm, probably by accident The previous Globus interface to Condor wasn’t prepared for that possibility and erroneously reported the job as still running –Fixed in newest Globus Job’s contact string was reset.

Globus Failures globus-job-manager would sometimes stop checking the status of a job, reporting the last status forever When a job was taking unusually long, this was usually the problem Killing the globus-job-manager caused a new one to be started, solving the problem –Has to be done on the remote site (Or via globus-job-run)

Globus Failures globus-job-manager would sometimes corrupt state files Wisconsin team debugged problem and distributed patched program Failed jobs had their GlobusContactStrings reset.

Globus Failures Some globus-job-managers would report problems accessing input files –The reason has not been diagnosed. Affected jobs had their GlobusContactStrings reset.

DAGMan failure In one instance a DAGMan managing 50 groups of jobs crashed. The DAG file was tweaked by hand to mark completed jobs as such and resubmitted –Finished jobs in a DAG simply have DONE added to then end of their entry

Problems Previously Encountered We’ve been doing test runs for ~9 months. We’ve encountered and resolved many other issues. Consider building your own copy of the Globus tools out of CVS to stay on top of bugfixes. Monitor and the Globus mailing lists.

The Future

Future Improvements Currently our run stage runs as a vanilla universe Condor job on the worker site. If there is a problem the job must be restarted from scratch. Switching to the standard universe would allow jobs to recover and continue aborted runs.

Future Improvements Data transfer jobs are run as Globus fork jobs. They are completely unmanaged on the remote site. If the remote site has an outage, there is no information on the jobs. –Running these under Condor (Scheduler universe) would ensure that status was not lost. –Also looking at using the DaP Scheduler

Future Improvements Jobs are assigned to specific sites by an operator Once assigned, changing the assigned site is nearly impossible Working to support “grid scheduling”: automatic assignment of jobs to sites and changing site assignment