US CMS Testbed.

Slides:



Advertisements
Similar presentations
Your university or experiment logo here BaBar Status Report Chris Brew GridPP16 QMUL 28/06/2006.
Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing
ATLAS Tier-3 in Geneva Szymon Gadomski, Uni GE at CSCS, November 2009 S. Gadomski, ”ATLAS T3 in Geneva", CSCS meeting, Nov 091 the Geneva ATLAS Tier-3.
The ADAMANT Project: Linking Scientific Workflows and Networks “Adaptive Data-Aware Multi-Domain Application Network Topologies” Ilia Baldine, Charles.
Chapter 5: Server Hardware and Availability. Hardware Reliability and LAN The more reliable a component, the more expensive it is. Server hardware is.
S. Gadomski, "ATLAS computing in Geneva", journee de reflexion, 14 Sept ATLAS computing in Geneva Szymon Gadomski description of the hardware the.
Reliability and Troubleshooting with Condor Douglas Thain Condor Project University of Wisconsin PPDG Troubleshooting Workshop 12 December 2002.
Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.
K.Harrison CERN, 23rd October 2002 HOW TO COMMISSION A NEW CENTRE FOR LHCb PRODUCTION - Overview of LHCb distributed production system - Configuration.
The Difficulties of Distributed Data Douglas Thain Condor Project University of Wisconsin
US CMS Testbed A Grid Computing Case Study Alan De Smet Condor Project University of Wisconsin at Madison
Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.
ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.
José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
Central Reconstruction System on the RHIC Linux Farm in Brookhaven Laboratory HEPIX - BNL October 19, 2004 Tomasz Wlodek - BNL.
Alain Romeyer - 15/06/20041 CMS farm Mons Final goal : included in the GRID CMS framework To be involved in the CMS data processing scheme.
Distribution After Release Tool Natalia Ratnikova.
Block1 Wrapping Your Nugget Around Distributed Processing.
6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
D0RACE: Testbed Session Lee Lueking D0 Remote Analysis Workshop February 12, 2002.
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
1 Development of a High-Throughput Computing Cluster at Florida Tech P. FORD, R. PENA, J. HELSBY, R. HOCH, M. HOHLMANN Physics and Space Sciences Dept,
LOGO Development of the distributed computing system for the MPD at the NICA collider, analytical estimations Mathematical Modeling and Computational Physics.
High Energy Physics and Grids at UF (Dec. 13, 2002)Paul Avery1 University of Florida High Energy Physics.
Evolution of a High Performance Computing and Monitoring system onto the GRID for High Energy Experiments T.L. Hsieh, S. Hou, P.K. Teng Academia Sinica,
US CMS Centers & Grids – Taiwan GDB Meeting1 Introduction l US CMS is positioning itself to be able to learn, prototype and develop while providing.
2/8/00CHEP20001 AMUN A Practical Application Using the Nile Distributed Operating System Authors: R. Baker (Cornell University, Ithaca, NY USA) L. Zhou.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
Final Implementation of a High Performance Computing Cluster at Florida Tech P. FORD, X. FAVE, K. GNANVO, R. HOCH, M. HOHLMANN, D. MITRA Physics and Space.
Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &
Bulk Data Transfer Activities We regard data transfers as “first class citizens,” just like computational jobs. We have transferred ~3 TB of DPOSS data.
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.
Meeting with University of Malta| CERN, May 18, 2015 | Predrag Buncic ALICE Computing in Run 2+ P. Buncic 1.
Monitoring the Readiness and Utilization of the Distributed CMS Computing Facilities XVIII International Conference on Computing in High Energy and Nuclear.
1  2004 Morgan Kaufmann Publishers Fallacies and Pitfalls Fallacy: the rated mean time to failure of disks is 1,200,000 hours, so disks practically never.
ScotGRID is the Scottish prototype Tier 2 Centre for LHCb and ATLAS computing resources. It uses a novel distributed architecture and cutting-edge technology,
Workload Management Workpackage
Condor DAGMan: Managing Job Dependencies with Condor
AWS Integration in Distributed Computing
Grid site as a tool for data processing and data analysis
U.S. ATLAS Grid Production Experience
Seismic Hazard Analysis Using Distributed Workflows
Management of Virtual Machines in Grids Infrastructures
GWE Core Grid Wizard Enterprise (
Ruslan Fomkin and Tore Risch Uppsala DataBase Laboratory
LHC DATA ANALYSIS INFN (LNL – PADOVA)
Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group
WP1 activity, achievements and plans
Building Grids with Condor
Management of Virtual Machines in Grids Infrastructures
Introduction to Cloud Computing
Globus Job Management. Globus Job Management Globus Job Management A: GRAM B: Globus Job Commands C: Laboratory: globusrun.
Grid Aware: HA-OSCAR By
Composition and Operation of a Tier-3 cluster on the Open Science Grid
Grid Canada Testbed using HEP applications
Haiyan Meng and Douglas Thain
ExaO: Software Defined Data Distribution for Exascale Sciences
STORK: A Scheduler for Data Placement Activities in Grid
Chapter 2: Operating-System Structures
Condor-G Making Condor Grid Enabled
Chapter 2: Operating-System Structures
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Production Manager Tools (New Architecture)
The LHCb Computing Data Challenge DC06
Presentation transcript:

US CMS Testbed

Large Hadron Collider Supercollider on French-Swiss border Under construction, completion in 2006. (Based on slide by Scott Koranda at NCSA)

Compact Muon Solenoid Detector / Experiment for LHC Search for Higgs Boson, other fundamental forces

Still Under Development Developing software to process enormous amount of data generated For testing and prototyping, the detector is being simulated now Simulating events (particle collisions) We’re involved in the United States portion of the effort

Storage and Computational Requirements Simulating and reconstructing millions of events per year, batches of around 150,000 (about 10 CPU months) Each event requires about 3 minutes of processor time A single run will generate about 300 GB of data

Before Condor-G and Globus Runs are hand assigned to individual sites Manpower intensive to organize run distribution and collect results Each site has staff managing their runs Manpower intensive to monitor jobs, CPU availability, disk space, etc.

Before Condor-G and Globus Use existing tool (MCRunJob) to manage tasks Not “Grid-Aware” Expects reliable batch system

UW High Energy Physics: A special case Was a site being assigned runs Modified local configuration to flock to UW Computer Science Condor pool When possible used standard universe to increase available computers During one week used 30,000 CPU hours.

Our Goal Move the work onto “the Grid” using Globus and Condor-G

Why the Grid? Centralize management of simulation work Reduce manpower at individual sites

Why Condor-G? Monitors and manages tasks Reliability in unreliable world

Lessons Learned The grid will fail Design for recovery

The Grid Will Fail The grid is complex The grid is new and untested Often beta, alpha, or prototype. The public Internet is out of your control Remote sites are out of your control

The Grid is Complex Our system has 16 layers A minimal Globus/Condor-G system has 9 layers Most layers stable and transparent MCRunJob > Impala > MOP > condor_schedd > DAGMan > condor_schedd > condor_gridmanager > gahp_server > globus-gatekeeper > globus-job-manager > globus-job-manager-script.pl > local batch system submit > local batch system execute > MOP wrapper > Impala wrapper > actual job

Design for Recovery Provide recovery at multiple levels to minimize lost work Be able to start a particular task over from scratch if necessary Never assume that a particular step will succeed Allocate lots of debugging time

Now Single master site sends jobs to distributed worker sites. Individual sites provide configured Globus node and batch system 300+ CPUs across a dozen sites. Condor-G acts as reliable batch system and Grid front end

How? MOP. Monte Carlo Distributed Production System Pretends to be local batch system for MCRunJob Repackages jobs to run on a remote site

CMS Testbed Big Picture Master Site Worker MCRunJob Globus MOP Condor DAGMan Real Work Condor-G

DAGMan, Condor-G, Globus, Condor DAGMan - Manages dependencies Condor-G - Monitors the job on master site Globus - Sends jobs to remote site Condor - Manages job and computers at remote site

Recovery: Condor Automatically recovers from machine and network problems on execute cluster.

Recovery: Condor-G Automatically monitors for and retries a number of possibly transient errors. Recovers from down master site, down worker sites, down network. After a network outage can reconnect to still running jobs.

Recovery: DAGMan If a particular task fails permanently, notes it and allows easy retry. Can automatically retry, we don’t.

Globus Globus software under rapid development Use old software and miss important updates Use new software and deal with version incompatibilities

Fall of 2002: First Test Our first run gave us two weeks to do about 10 days of work (given available CPUs at the time). We had problems Power outage (several hours), network outages (up to eleven hours), worker site failures, full disks, Globus failures

It Worked! The system recovered automatically from many problems Relatively low human intervention Approximately one full time person

Since Then Improved automatic recovery for more situations Generated 1.5 million events (about 30 CPU years) in just a few months Currently gearing up for even larger runs starting this summer

Future Work Expanding grid with more machines Use Condor-G’s scheduling capabilities to automatically assign jobs to sites Officially replace previous system this summer.

Thank You! http://www.cs.wisc.edu/condor adesmet@cs.wisc.edu