A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,

Slides:

Advertisements

Similar presentations

The Quantum Chromodynamics Grid James Perry, Andrew Jackson, Matthew Egbert, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Advertisements

Data Management Expert Panel - WP2. WP2 Overview.

RunJob in CMS Greg Graham Discussion Slides. RunJob in CMS RunJob is an Application Configuration and Job Creation Tool –RunJob uses metadata to abstract.

Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.

Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,

Runjob: A HEP Workflow Planner for Production Processing Greg Graham CD/CMS Fermilab CD Project Status Meeting 13-Nov-2003.

Sphinx Server Sphinx Client Data Warehouse Submitter Generic Grid Site Monitoring Service Resource Message Interface Current Sphinx Client/Server Multi-threaded.

CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.

David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.

Experience with ATLAS Data Challenge Production on the U.S. Grid Testbed Kaushik De University of Texas at Arlington CHEP03 March 27, 2003.

DataGrid Kimmo Soikkeli Ilkka Sormunen. What is DataGrid? DataGrid is a project that aims to enable access to geographically distributed computing power.

GRID Workload Management System Massimo Sgaravatto INFN Padova.

Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Workload Management Massimo Sgaravatto INFN Padova.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Vladimir Litvin, Harvey Newman Caltech CMS Scott Koranda, Bruce Loftis, John Towns NCSA Miron Livny, Peter Couvares, Todd Tannenbaum, Jamie Frey Wisconsin.

CMS Report – GridPP Collaboration Meeting VI Peter Hobson, Brunel University30/1/2003 CMS Status and Plans Progress towards GridPP milestones Workload.

The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.

Grid Testbed Activities in US-CMS Rick Cavanaugh University of Florida 1. Infrastructure 2. Highlight of Current Activities 3. Future Directions NSF/DOE.

Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.

ARGONNE  CHICAGO Ian Foster Discussion Points l Maintaining the right balance between research and development l Maintaining focus vs. accepting broader.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 Plans for the integration of grid tools in the CMS computing environment Claudio.

Pegasus-a framework for planning for execution in grids Ewa Deelman USC Information Sciences Institute.

GriPhyN Status and Project Plan Mike Wilde Mathematics and Computer Science Division Argonne National Laboratory.

Part 8: DAGMan A: Grid Workflow Management B: DAGMan C: Laboratory: DAGMan.

Grid Workload Management Massimo Sgaravatto INFN Padova.

Pegasus: Mapping Scientific Workflows onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Production Tools in ATLAS RWL Jones GridPP EB 24 th June 2003.

Giuseppe Codispoti INFN - Bologna Egee User ForumMarch 2th BOSS: the CMS interface for job summission, monitoring and bookkeeping W. Bacchi, P.

CPT Demo May Build on SC03 Demo and extend it. Phase 1: Doing Root Analysis and add BOSS, Rendezvous, and Pool RLS catalog to analysis workflow.

Grid Scheduler: Plan & Schedule Adam Arbree Jang Uk In.

Pegasus: Mapping complex applications onto the Grid Ewa Deelman Center for Grid Technologies USC Information Sciences Institute.

July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

The GriPhyN Planning Process All-Hands Meeting ISI 15 October 2001.

GRIDS Center Middleware Overview Sandra Redman Information Technology and Systems Center and Information Technology Research Center National Space Science.

GriPhyN Virtual Data System Grid Execution of Virtual Data Workflows Mike Wilde Argonne National Laboratory Mathematics and Computer Science Division.

US CMS Centers & Grids – Taiwan GDB Meeting1 Introduction l US CMS is positioning itself to be able to learn, prototype and develop while providing.

Pegasus-a framework for planning for execution in grids Karan Vahi USC Information Sciences Institute May 5 th, 2004.

Planning Ewa Deelman USC Information Sciences Institute GriPhyN NSF Project Review January 2003 Chicago.

VDT 1 The Virtual Data Toolkit Todd Tannenbaum (Alain Roy)

Peter F. Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Managing Job.

Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.

Claudio Grandi INFN-Bologna CHEP 2000Abstract B 029 Object Oriented simulation of the Level 1 Trigger system of a CMS muon chamber Claudio Grandi INFN-Bologna.

GriPhyN Project Paul Avery, University of Florida, Ian Foster, University of Chicago NSF Grant ITR Research Objectives Significant Results Approach.

Peter Couvares Computer Sciences Department University of Wisconsin-Madison Condor DAGMan: Introduction &

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

ATLAS-specific functionality in Ganga - Requirements for distributed analysis - ATLAS considerations - DIAL submission from Ganga - Graphical interfaces.

Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group

Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.

RefDB: The Reference Database for CMS Monte Carlo Production Véronique Lefébure CERN & HIP CHEP San Diego, California 25 th of March 2003.

STAR Scheduler Gabriele Carcassi STAR Collaboration.

Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.

ATLAS Distributed Analysis DISTRIBUTED ANALYSIS JOBS WITH THE ATLAS PRODUCTION SYSTEM S. González D. Liko

CMS Production Management Software Julia Andreeva CERN CHEP conference 2004.

Shahkar/MCRunjob: An HEP Workflow Planner for Grid Production Processing Greg Graham CD/CMS Fermilab GruPhyN 15 October 2003.

Pegasus WMS Extends DAGMan to the grid world

U.S. ATLAS Grid Production Experience

US CMS Testbed.

Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.

Frieda meets Pegasus-WMS

Presentation transcript:

A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics, 2003 UC San Diego Virtual Data In CMS Production

CHEP Virtual Data Motivations in Production l Data track-ability and result audit-ability –Universally sought by scientists l Facilitates tool and data sharing and collaboration –Data can be sent along with its recipe –Recipe is useful in searching for data l Workflow management –A new, structured paradigm for organizing, locating, and specifying data products l Performance optimizations –Ability to delay execution planning until as late as possible

CHEP Initial CMS Production tests using the Chimera Virtual Data System l Motivation –Simplify CMS production in a Grid environment –Evaluate current state of Virtual Data technology –Understand issues related to provenance of CMS data l Use-case –Implement a simple 5-stage CMS production pipeline on the US CMS Test Grid l Solution –Wrote an interface between Chimera and the CMS production software –Wrote a simple grid scheduler –Ran sample simulations to evaluate system

CHEP What is a DAG? Directed Acyclic Graph l A DAG is the data structure used to represent job dependencies. l Each job is a “node” in the DAG. l Each node can have any number of “parent” or “children” nodes – as long as there are no loops! l We usually talk about workflow in units of "DAGs" Job A Job BJob C Job D Picture Taken from Peter Couvares

CHEP ODBMS Generator Simulator Formator writeESD writeAOD writeTAG writeESD writeAOD writeTAG Analysis Scripts Digitiser Calib. DB Example CMS Data/ Workflow

CHEP ODBMS Generator Simulator Formator writeESD writeAOD writeTAG writeESD writeAOD writeTAG Analysis Scripts Digitiser Calib. DB Online Teams (Re)processing Team MC Production Team Physics Groups Data/workflow is a collaborative endeavour!

CHEP CMKIN CMSIM OOHITS OODIGI NTUPLE.ntpl.fz Event Database.ntpl Events are generated (pythia). Detector’s response is simulated for each event (geant3). Events are reformatted and written into a database. Original events are digitised and reconstructed. Reconstructed data is reduced and written to flat file. A Simple CMS Production 5-Stage Workflow Use-case

CHEP Fortran DB 2-stage DAG Representation of the 5-stage Use-case Fortran job wraps the CMKIN and CMSIM stages. DB job wraps the OOHITS, OODIGI, and NTUPLE stages. This structure was used to enforce policy constraints on the Workflow (i.e. Objectivity/DB license required for DB stages) Initially used a simple script to generate Virtual Data Language (VDL) McRunJob is now used to generate the Workflow in VDL (see talk by G. Graham) CMKIN CMSIM OOHITS OODIGI NTUPL.ntpl.fz Event DB.ntpl Responsibility of a Workflow Generator: creates the abstract plan

CHEP Mapping Abstract Workflows onto Concrete Environments l Abstract DAGs (virtual workflow) –Resource locations unspecified –File names are logical –Data destinations unspecified –build style l Concrete DAGs (stuff for submission) –Resource locations determined –Physical file names specified –Data delivered to and returned from physical locations –make style Abs. Plan VDC RC C. Plan. DAX DAGMan DAG VDL Logical Physical XML In general there are a range of planning steps between abstract workflows and concrete workflows

CHEP Fortran DB Stage File In Execute Job Stage File Out Register File Concrete DAG Representation of the CMS Pipeline Use-case Responsibility of the Concrete Planner: Binds job nodes with physical grid sites Queries Replica and Transformation Catalogs for existence and location. Dresses job nodes with stage-in/out nodes.

CHEP compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) Default middleware configuration from the Virtual Data Toolkit

CHEP compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) WorkRunner RefDB McRunJob: Generic Workflow Generator Modified middleware configuration (to enable massive CMS production workflows)

CHEP compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) WorkRunner McRunJob: Generic Workflow Generator Modified middleware configuration (to enable massive CMS production workflows) RefDB The CMS Metadata Catalog: - contains parameter/cards files - contains production requests - contains production status - etc See Veronique Lefebure's talk on RefDB

CHEP compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) WorkRunner RefDB Modified middleware configuration (to enable massive CMS production workflows) LinkerVDL Generator VDL Config RefDB Module The CMS Workflow Generator: - Constructs production workflow from a request in the RefDB - Writes workflow description in VDL (via ScriptGen) See Greg Graham's talk on MCRunJob McRunJob: Generic Workflow Generator

CHEP compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) RefDB McRunJob: Generic Workflow Generator Modified middleware configuration (to enable massive CMS production workflows) WorkRunner Condor-G Monitor Chimera Interface Job Tracking Module Workflow Grid Scheduler - very simple placeholder (due to lack of interface to resource broker) - submits Chimera workflows based on simple job monitoring information from Condor-G

CHEP compute machines Condor-G Chimera DAGman gahp_server submit hostremote host gatekeeper Local Scheduler (Condor, PBS, etc.) WorkRunner RefDB McRunJob: Generic Workflow Generator Modified middleware configuration (to enable massive CMS production workflows)

CHEP Initial Results l Production Test –Results >678 DAG’s (250 events each) >167,500 test events computed (not delivered to CMS) >350 CPU/days on 25 dual-processor Pentium (1 GHz) machines over 2 weeks of clock time >200 GB simulated data –Problems >8 failed DAG’s >Cause l Pre-emption by another user

CHEP Initial Results (cont) l Scheduling Test –Results >5954 DAG’s (1 event each, not used by CMS) >300 CPU/days on 145 CPU’s in 6 sites l University of Florida: USCMS Cluster (8), HCS Cluster (64), GriPhyN Cluster (28) l University of Wisconsin, Milwaukee, CS Dept. Cluster (30) l University of Chicago, CS Dept. Cluster (5) l Argonne National Lab DataGrid Cluster (10) –Problems >395 failed DAG’s >Causes l Failure to post final data from UF GriPhyN Cluster ( ) l Globus Bug, 1 DAG in 50 fails when communication is lost >Primarily limited by the performance of lower-level grid middleware

CHEP The Value of Virtual Data l Provides full reproducibility (fault tolerance) of one's results: –tracks ALL dependencies between transformations and their derived data products –something like a "Virtual Logbook" –records the provenance of data products l Provides transparency with respect to location and existence. The user need not know: –the data location –how many data files are in a data set –if the requested derived data exists l Allows for optimal performance in planning. Should the derived data be: –staged-in from a remote site? –send the job to the data –send the data to the job –re-created locally on demand?

CHEP Summary: Grid Production of CMS Simulated Data l CMS production of simulated data (to date) –O(10) sites –O(1000) CPUs –O(100) TB of data –O(10) production managers l Goal is to double every year—without increasing the number of production managers! –More automation will be needed for upcoming Data Challenges! l Virtual Data provides –parts of the necessary abstraction required for automation and fault tolerance. –mechanisms for data provenance (important for search engines) l Virtual Data technology is "real" and maturing, but still in its childhood –much functionality currently exists –still requires placeholder components for intelligent planning and optimisation