Virtual Data in CMS Analysis A.Arbree, P.Avery, D.Bourilkov, R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao CMS & GriPhyN CHEP03, La Jolla, California.

Slides:



Advertisements
Similar presentations
Three types of remote process invocation
Advertisements

Configuration management
RunJob in CMS Greg Graham Discussion Slides. RunJob in CMS RunJob is an Application Configuration and Job Creation Tool –RunJob uses metadata to abstract.
Virtual Data and the Chimera System* Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer Science.
CMS Applications Towards Requirements for Data Processing and Analysis on the Open Science Grid Greg Graham FNAL CD/CMS for OSG Deployment 16-Dec-2004.
A. Arbree, P. Avery, D. Bourilkov, R. Cavanaugh, S. Katageri, G. Graham, J. Rodriguez, J. Voeckler, M. Wilde CMS & GriPhyN Conference in High Energy Physics,
Ewa Deelman, Integrating Existing Scientific Workflow Systems: The Kepler/Pegasus Example Nandita Mangal,
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
O. Stézowski IPN Lyon AGATA Week September 2003 Legnaro Data Analysis – Team #3 ROOT as a framework for AGATA.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL March 25, 2003 CHEP 2003 Data Analysis Environment and Visualization.
R. Cavanaugh GriPhyN Analysis Workshop Caltech, June, 2003 Virtual Data Toolkit.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL June 23, 2003 GAE workshop Caltech.
Impact Analysis of Database Schema Changes Andy Maule, Wolfgang Emmerich and David S. Rosenblum London Software Systems Dept. of Computer Science, University.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
A Free sample background from © 2001 By Default!Slide 1.NET Overview BY: Pinkesh Desai.
January, 23, 2006 Ilkay Altintas
Grappa: Grid access portal for physics applications Shava Smallen Extreme! Computing Laboratory Department of Physics Indiana University.
The CAVES Project Collaborative Analysis Versioning Environment System The CODESH Project COllaborative DEvelopment SHell Dimitri Bourilkov University.
Introduction to Hall-D Software February 27, 2009 David Lawrence - JLab.
The Grid is a complex, distributed and heterogeneous execution environment. Running applications requires the knowledge of many grid services: users need.
Advanced Analysis Environments What is the role of Java in physics analysis? Will programming languages at all be relevant? Can commercial products help.
Introduzione al Software di CMS N. Amapane. Nicola AmapaneTorino, Aprile Outline CMS Software projects The framework: overview Finding more.
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
Virtual Logbooks and Collaboration in Science and Software Development Dimitri Bourilkov, Vaibhav Khandelwal, Archis Kulkarni, Sanket Totala University.
 To explain the importance of software configuration management (CM)  To describe key CM activities namely CM planning, change management, version management.
David Adams ATLAS ATLAS Distributed Analysis David Adams BNL March 18, 2004 ATLAS Software Workshop Grid session.
{ Graphite Grigory Arashkovich, Anuj Khanna, Anirban Gangopadhyay, Michael D’Egidio, Laura Willson.
HPS Online Software Discussion Jeremy McCormick, SLAC Status and Plans.
Configuration Management (CM)
Transparent Grid Enablement Using Transparent Shaping and GRID superscalar I. Description and Motivation II. Background Information: Transparent Shaping.
ITPA/IMAGE 7-10 May 2007 Software and Hardware Infrastructure for the ITM B.Guillerminet, on behalf of the ITM & ISIP teams (P Strand, F Imbeaux, G Huysmans,
Claudio Grandi INFN Bologna CHEP'03 Conference, San Diego March 27th 2003 Plans for the integration of grid tools in the CMS computing environment Claudio.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Stuart Wakefield Imperial College London Evolution of BOSS, a tool for job submission and tracking W. Bacchi, G. Codispoti, C. Grandi, INFN Bologna D.
Datasets on the GRID David Adams PPDG All Hands Meeting Catalogs and Datasets session June 11, 2003 BNL.
Introduction Advantages/ disadvantages Code examples Speed Summary Running on the AOD Analysis Platforms 1/11/2007 Andrew Mehta.
Andreas Morsch, CERN EP/AIP CHEP 2003 Simulation in ALICE Andreas Morsch For the ALICE Offline Project 2003 Conference for Computing in High Energy and.
Part 9: MyProxy Pragmatics This presentation and lab ends the GRIDS Center agenda Q: When do we convene again tomorrow?
Metadata Mòrag Burgon-Lyon University of Glasgow.
CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
NSF Review, 18 Nov 2003 Peter Shawhan (LIGO/Caltech)1 How to Develop a LIGO Search Peter Shawhan (LIGO / Caltech) NSF Review November 18, 2003 LIGO-G E.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
Interactive Data Analysis on the “Grid” Tech-X/SLAC/PPDG:CS-11 Balamurali Ananthan David Alexander
Virtual Data Management for CMS Simulation Production A GriPhyN Prototype.
Claudio Grandi INFN-Bologna CHEP 2000Abstract B 029 Object Oriented simulation of the Level 1 Trigger system of a CMS muon chamber Claudio Grandi INFN-Bologna.
Korea Workshop May GAE CMS Analysis (Example) Michael Thomas (on behalf of the GAE group)
- GMA Athena (24mar03 - CHEP La Jolla, CA) GMA Instrumentation of the Athena Framework using NetLogger Dan Gunter, Wim Lavrijsen,
Vincenzo Innocente, CERN/EPUser Collections1 Grid Scenarios in CMS Vincenzo Innocente CERN/EP Simulation, Reconstruction and Analysis scenarios.
DZero Monte Carlo Production Ideas for CMS Greg Graham Fermilab CD/CMS 1/16/01 CMS Production Meeting.
David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.
20 October 2005 LCG Generator Services monthly meeting, CERN Validation of GENSER & News on GENSER Alexander Toropin LCG Generator Services monthly meeting.
Origami: Scientific Distributed Workflow in McIDAS-V Maciek Smuga-Otto, Bruce Flynn (also Bob Knuteson, Ray Garcia) SSEC.
Introduction to FCC Software FCC Istanbul 11 March, 2016 Alice Robson (CERN/UNIGE) on behalf of / with thanks to the FCC software group.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
Chimera Workshop September 4 th, Outline l Install Chimera l Run Chimera –Hello world –Convert simple shell pipeline –Some diamond, etc. l Get.
A Data Handling System for Modern and Future Fermilab Experiments Robert Illingworth Fermilab Scientific Computing Division.
Managing LIGO Workflows on OSG with Pegasus Karan Vahi USC Information Sciences Institute
Shahkar/MCRunjob: An HEP Workflow Planner for Grid Production Processing Greg Graham CD/CMS Fermilab GruPhyN 15 October 2003.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
David Adams ATLAS DIAL Distributed Interactive Analysis of Large datasets David Adams BNL May 19, 2003 BNL Technology Meeting.
(on behalf of the POOL team)
Existing Perl/Oracle Pipeline
POOL persistency framework for LHC
Pegasus and Condor Gaurang Mehta, Ewa Deelman, Carl Kesselman, Karan Vahi Center For Grid Technologies USC/ISI.
 YongPyong-High Jan We appreciate that you give an opportunity to have this talk. Our Belle II computing group would like to report on.
Mats Rynge USC Information Sciences Institute
ATLAS DC2 & Continuous production
Frieda meets Pegasus-WMS
Presentation transcript:

Virtual Data in CMS Analysis A.Arbree, P.Avery, D.Bourilkov, R.Cavanaugh, G.Graham, J.Rodriguez, M.Wilde, Y.Zhao CMS & GriPhyN CHEP03, La Jolla, California March 25, 2003

D.BourilkovVirtual Data in CMS Analysis2 We already do this, but manually! Virtual Data Webster dictionary: vir·tu·al Function: adjective Etymology: Middle English, possessed of certain physical virtues, from Medieval Latin virtualis, from Latin virtus strength, virtue Most scientific data are not simple “measurements”  produced from increasingly complex computations (e.g. reconstructions, calibrations, selections, simulations, fits etc.) HEP (and other sciences) increasingly CPU/Data intensive Programs are significant community resources (transformations) So are the executions of those programs (derivations) Management of dataset transformations important! Derivation: Instantiation of a potential data product Provenance: Exact history of any existing data product

D.BourilkovVirtual Data in CMS Analysis3 Transformation Derivation Data product-of execution-of consumed-by/ generated-by “I’ve detected a muon calibration error and want to know which derived data products need to be recomputed.” “I’ve found some interesting data, but I need to know exactly what corrections were applied before I can trust it.” “I want to search a database for 3 rare electron events. If a program that does this analysis exists, I won’t have to write one from scratch.” “I want to apply a forward jet analysis to 100M events. If the results already exist, I’ll save weeks of computation.” Virtual Data Motivations

D.BourilkovVirtual Data in CMS Analysis4 Virtual Data Motivations Data track-ability and result audit-ability: "Virtual Logbook” In the nature of science Reproducibility of results Tools and data sharing and collaboration (data with “recipe”) Individuals discover other scientists’ work and build from it Different Teams can work in a modular, semi-autonomous fashion: reuse previous data/code/results or entire analysis chains Repair and correction of data – c.f. “make” Workflow management, Performance optimization: data staged-in from remote site OR re-created locally on demand? Transparency with respect to location and existence

D.BourilkovVirtual Data in CMS Analysis5 Introducing CHIMERA: The GriPhyN Virtual Data System l Virtual Data Language l textual (concise, for human consumption) l XML (uses XML schema, for component integration) l Virtual Data Interpreter l implemented in Java l JAVA API and command-line toolkit l Virtual Data Catalog tracks data provenance (acts like a metadata repository); different back-ends for persistency: l PostGreSQL and MySQL DB l file based (for easy testing)

D.BourilkovVirtual Data in CMS Analysis6 Virtual Data in CHIMERA A “function call” paradigm l Virtual data: data objects with a well defined method of (re)production Transformation [namespace]::identifier:[version  ] Abstract description of how a script/executable is invoked Similar to a "function declaration" in C/C++ Derivation [namespace]::identifier:[version range] Invocation of a transformation with specific arguments Similar to a "function call" in C/C++ Can be either past or future a record of how logical files were produced a recipe for creating logical files at some point in the future

D.BourilkovVirtual Data in CMS Analysis7 Virtual Data Language TR pythia( out a2, in a1, none param=“160.0” ) { argument arg = ${param}; argument file = ${a1}; Build-style recipe argument file = ${a2}; } TR cmsim( out a2, in a1[] ) { argument files = ${a1}; argument file = ${a2}; } DV x1->pythia( DV x2->cmsim( @{in:cardfile}] ); Make-style recipe file1 file2, cardfile file3 x1 x2

D.BourilkovVirtual Data in CMS Analysis8 Abstract and Concrete DAGs Abstract DAXs (Virtual Data DAG) abstract directed acyclic graph with logical names for files/executables (complete build-style recipe as DAX) –Resource locations unspecified –File names are logical –Data destinations unspecified Concrete DAGs (stuff for DAGMan) CONDOR style DAG for grid execution (check RC, skip steps, make-style) –Resource locations determined –Physical file names specified –Data delivered to and returned from physical locations Abs. Plan VDC RC C. Plan. DAX DAGMan DAG VDL Logical Physical XML

D.BourilkovVirtual Data in CMS Analysis9 Nitty-Gritty Transformation catalog (expects pre-built executables) #poolname ltransformation physical transformation environment String local hw /bin/echo null local pythcvs /workdir/lhc-h-6-cvs null local pythlin /workdir/lhc-h-6-link null local pythgen /workdir/lhc-h-6-run null local pythtree /workdir/h2root.sh null local pythview /workdir/root.sh null local GriphynRC /vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdk1.3;VDS_HOME=/vdshome local globus-url-copy /vdt/bin/globus-url-copy GLOBUS_LOCATION=/vdt;LD_LIBRARY_PATH=/vdt/lib ufl hw /bin/echo null ufl GriphynRC /vdshome/bin/replica-catalog JAVA_HOME=/vdt/jdk1.3.1_04;VDS_HOME=/vdshome ufl globus-url-copy /vdt/bin/globus-url-copy GLOBUS_LOCATION=/vdt;LD_LIBRARY_PATH=/vdt/lib Pool configuration #pooluniverse job-manager-string url-prefix workdir... ufl vanilla testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir ufl standard testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir ufl globus testulix/jm-condor-INTEL-LINUX gsiftp://testulix/mydir /mydir ufl transfer testulix/jobmanager gsiftp://testulix/mydir /mydir local vanilla localhost/jm-condor gsiftp://localhost/mydir /mydir local globus localhost/jm-condor gsiftp://localhost/mydir /mydir local transfer localhost/jobmanager gsiftp://localhost/mydir /mydir

D.BourilkovVirtual Data in CMS Analysis10 Data Analysis in HEP Decentralized, “chaotic” Flexible enough system: able to accommodate large user base, use cases that we can’t foresee Ability to build scripts/executables “on the fly”, including user supplied code/parameters (possibly linking with preinstalled libraries on the execution sites)

D.BourilkovVirtual Data in CMS Analysis11 Prototypes First for SC2002, second for CHEP03 CVS tag FORTRAN code datacards libraries version N executable root wrapper h2root PYTHIA wrapper compile, link CVS plots ntuples root trees event displays C++ code

D.BourilkovVirtual Data in CMS Analysis12 Prototypes CHIMERA/ROOT prototype for generating events with PYTHIA/CMKIN, histogramming and visualization

D.BourilkovVirtual Data in CMS Analysis13 mass = 160 decay = WW WW  e event = 8 mass = 160 decay = WW WW  e plot = 1 mass = 160 decay = WW plot = 1 mass = 160 decay = WW event = 8 mass = 160 decay = WW WW  e mass = 160 decay = WW WW  leptons mass = 160 decay = WW mass = 160 decay = ZZ mass = 160 decay = bb mass = 160 plot = 1 mass = 160 event = 8 A virtual space of simulated data is created for future use by scientists...

D.BourilkovVirtual Data in CMS Analysis14 mass = 160 decay = WW WW  e event = 8 mass = 160 decay = WW WW  e plot = 1 mass = 160 decay = WW plot = 1 mass = 160 decay = WW event = 8 mass = 160 decay = WW WW  e mass = 160 decay = WW WW  leptons mass = 160 decay = WW mass = 160 decay = ZZ mass = 160 decay = bb mass = 160 plot = 1 mass = 160 event = 8 Search for WW decays of the Higgs Boson where the Ws decay to electron and muon: mass = 160; decay = WW; WW  e

D.BourilkovVirtual Data in CMS Analysis15 mass = 160 decay = WW WW  e event = 8 mass = 160 decay = WW WW  e plot = 1 mass = 160 decay = WW plot = 1 mass = 160 decay = WW event = 8 mass = 160 decay = WW WW  e mass = 160 decay = WW WW  leptons mass = 160 decay = WW mass = 160 decay = ZZ mass = 160 decay = bb mass = 160 plot = 1 mass = 160 event = 8 Scientist obtains an interesting result and wants to track how it was derived.

D.BourilkovVirtual Data in CMS Analysis16 mass = 160 decay = WW WW  e event = 8 mass = 160 decay = WW WW  e plot = 1 mass = 160 decay = WW plot = 1 mass = 160 decay = WW event = 8 mass = 160 decay = WW WW  e mass = 160 decay = WW WW  leptons mass = 160 decay = WW mass = 160 decay = ZZ mass = 160 decay = bb mass = 160 plot = 1 mass = 160 event = 8 Now the scientist wants to dig deeper...

D.BourilkovVirtual Data in CMS Analysis17 mass = 160 decay = WW WW  e Pt > 20 mass = 160 decay = WW WW  e event = 8 mass = 160 decay = WW WW  e plot = 1 mass = 160 decay = WW plot = 1 mass = 160 decay = WW event = 8 mass = 160 decay = WW WW  e mass = 160 decay = WW WW  leptons mass = 160 decay = WW mass = 160 decay = ZZ mass = 160 decay = bb mass = 160 plot = 1 mass = 160 event = 8...The scientist adds a new derived data branch......and continues to investigate !

D.BourilkovVirtual Data in CMS Analysis18 A Collaborative Data-flow Development Environment: Complex Data Flow and Data Provenance in HEP Raw ESD AOD TAG Plots, Tables, Fits Comparisons Plots, Tables, Fits Real Data Simulated Data l History of a Data Analysis (like CVS) l "Check-point" a Data Analysis l Analysis Development Environment l Audit a Data Analysis

D.BourilkovVirtual Data in CMS Analysis19 Outlook Work in progress both on CHIMERA & CMS sides – a “snapshot” A CHIMERA/ROOT prototype for building executables “on the fly”, generating events with PYTHIA/CMKIN, plotting and visualization available (CHIMERA is a great integration tool) The full CMS Monte Carlo chain is working under CHIMERA (next talk) Possible future directions: Workflow management; automatic generation; inheritance … Store metadata about derivations (like annotations) in a searchable catalog Handle Datasets, not just Logical File Names Integration with CLARENS (remote access), with ROOT/PROOF (run in parallel) A picture is better than 1000 words: Prototype Demo Prototype Demo