Henry Nebrensky - MICE VC123 - 7 May 2009 MICE Data and the Grid 1  Storage, archiving and dissemination of experimental data: u Not been a high priority.

Slides:



Advertisements
Similar presentations
Software Summary Database Data Flow G4MICE Status & Plans Detector Reconstruction 1M.Ellis - CM24 - 3rd June 2009.
Advertisements

Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
Batch Production and Monte Carlo + CDB work status Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
NextGRID & OGSA Data Architectures: Example Scenarios Stephen Davey, NeSC, UK ISSGC06 Summer School, Ischia, Italy 12 th July 2006.
Computing Panel Discussion Continued Marco Apollonio, Linda Coney, Mike Courthold, Malcolm Ellis, Jean-Sebastien Graulich, Pierrick Hanlet, Henry Nebrensky.
Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.
Henry Nebrensky – Data Flow Workshop – 30 June 2009 MICE Data Flow Workshop Henry Nebrensky Brunel University 1.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
The ATLAS Production System. The Architecture ATLAS Production Database Eowyn Lexor Lexor-CondorG Oracle SQL queries Dulcinea NorduGrid Panda OSGLCG The.
Software Summary 1M.Ellis - CM23 - Harbin - 16th January 2009  Four very good presentations that produced a lot of useful discussion: u Online Reconstruction.
December 17th 2008RAL PPD Computing Christmas Lectures 11 ATLAS Distributed Computing Stephen Burke RAL.
Henry Nebrensky - MICE CM June 2009 MICE Data Flow Henry Nebrensky Brunel University 1.
Don Quijote Data Management for the ATLAS Automatic Production System Miguel Branco – CERN ATC
MIS 301 Information Systems in Organizations Dave Salisbury ( )
Grid Update Henry Nebrensky Brunel University MICE Collaboration Meeting CM23.
In Dublin’s fair city, where the metadata are so pretty… John Roberts Archives New Zealand.
Databases E. Leonardi, P. Valente. Conditions DB Conditions=Dynamic parameters non-event time-varying Conditions database (CondDB) General definition:
The european ITM Task Force data structure F. Imbeaux.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.
Copyright 2006 Prentice-Hall, Inc. Essentials of Systems Analysis and Design Third Edition Joseph S. Valacich Joey F. George Jeffrey A. Hoffer Chapter.
GO-ESSP Workshop, LLNL, Livermore, CA, Jun 19-21, 2006, Center for ATmosphere sciences and Earthquake Researches Construction of e-science Environment.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Parameter Study Principles & Practices. What is Parameter Study? Parameter study is the application of a single algorithm over a set of independent inputs:
Marco Cattaneo, Aleph plenary, 23rd April Long term archive of LEP data  LEPC working group report Purpose Assumptions Conclusions  Physics goals.
Simulations and Software CBM Collaboration Meeting, GSI, 17 October 2008 Volker Friese Simulations Software Computing.
INFSO-RI Enabling Grids for E-sciencE Αthanasia Asiki Computing Systems Laboratory, National Technical.
Henry Nebrensky – MICE DAQ review - 4 June 2009 MICE Data Flow Henry Nebrensky Brunel University 1.
Why A Software Review? Now have experience of real data and first major analysis results –What have we learned? –How should that change what we do next.
INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.
Jean-Roch Vlimant, CERN Physics Performance and Dataset Project Physics Data & MC Validation Group McM : The Evolution of PREP. The CMS tool for Monte-Carlo.
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
A proposal: from CDR to CDH 1 Paolo Valente – INFN Roma [Acknowledgements to A. Di Girolamo] Liverpool, Aug. 2013NA62 collaboration meeting.
David Adams ATLAS ATLAS-ARDA strategy and priorities David Adams BNL October 21, 2004 ARDA Workshop.
M. Oldenburg GridPP Metadata Workshop — July 4–7 2006, Oxford University 1 Markus Oldenburg GridPP Metadata Workshop July 4–7 2006, Oxford University ALICE.
Workflows and Data Management. Workflow and DM Run3 and after: conditions m LHCb major upgrade is for Run3 (2020 horizon)! o Luminosity x 5 ( )
EGEE is a project funded by the European Union under contract IST Enabling bioinformatics applications to.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
1 Configuration Database David Forrest University of Glasgow RAL :: 31 May 2009.
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
The GridPP DIRAC project DIRAC for non-LHC communities.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Data management in EGEE.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Storage Element Security Jens G Jensen, WP5 Barcelona, May 2003.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Architecture of LHC File Catalog Valeria Ardizzone INFN Catania – EGEE-II NA3/NA4.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
User Domain Storage Elements SURL  TURL LFC Domain (LCG File Catalogue) SA1 – Data Grid Interoperation Enabling Grids for E-sciencE EGEE-III INFSO-RI
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America LFC Server Installation and Configuration.
Hall D Computing Facilities Ian Bird 16 March 2001.
Creating a simplified global unique file catalogue Miguel Martinez Pedreira Pablo Saiz.
Federating Data in the ALICE Experiment
Evolution of storage and data management
Online – Data Storage and Processing
Computing and Software – Calibration Flow Overview
Data Management and Database Framework for the MICE Experiment
SRM2 Migration Strategy
OGSA Data Architecture Scenarios
GSAF Grid Storage Access Framework
LCG Monte-Carlo Events Data Base: current status and plans
GSAF Grid Storage Access Framework
Data Management Ouafa Bentaleb CERIST, Algeria
 YongPyong-High Jan We appreciate that you give an opportunity to have this talk. Our Belle II computing group would like to report on.
Data services in gLite “s” gLite and LCG.
Architecture of the gLite Data Management System
Integrating SRB with the GIGGLE framework
Presentation transcript:

Henry Nebrensky - MICE VC May 2009 MICE Data and the Grid 1  Storage, archiving and dissemination of experimental data: u Not been a high priority so far u Overall strategy not documented anywhere obvious u Individual work on parts of this – but do the pieces fit together?  Grid: u Certain Grid services are separately funded to provide a production service to MICE u Provides a ready-made set of building blocks – but “we” have to put them together u MICE need to know what they want, to make sure that the finished edifice meets all their needs (and that Grid includes all the necessary bricks)

Henry Nebrensky - MICE VC May 2009 Decision Time  We need to start putting the pieces together very soon.  Once data starts going on tape it will not be possible to change how and where it is stored u need an agreed plan in the near future (i.e. by end of CM24)  There are a number of unresolved issues – see Note 252 and the data flow diagram. u Data volumes, lifetime and access control mostly unclear u (LFC) File naming scheme – see MICE Note 247 u File metadata requirements – raised at CM23 2

Henry Nebrensky - MICE VC May 2009 Grid Middleware  We are currently using EGEE/WLCG middleware and resources, as they are receiving significant development effort and are a reasonable match for our needs (shared with various minor experiments such as LHC)  Outside Europe other software may be expected – e.g. the OSG stack in the US. Interoperability is, from our perspective, yet another “known unknown”... 3

Henry Nebrensky - MICE VC May 2009 MICE and Grid Data Storage  The Grid provides MICE not only with computing (number-crunching) power, but also with a secure global framework allowing users access to data u Good news: storing development data on the Grid keeps it available to the collaboration – not stuck on an old PC in the corner of the lab u Bad news: loss of ownership – who picks up the data curation responsibilities?  Data can be downloaded from the Grid to user’s “own” PC – doesn’t need to be analysed remotely 4

Henry Nebrensky - MICE VC May 2009 Grid File Management (1)  Each file is given a unique, machine-generated, GUID when stored on the Grid  The file is physically uploaded to one (or more) SEs (Storage Elements) where it is given a machine-generated SURL (Storage URL)  Machine-generated names are not (meant to be) human- usable  A “replica catalogue” tracks the multiple SURLs of a GUID  For sanity's sake we would like to associate sensible filenames with each file (LFN, Logical File Name)  A “file catalogue” is a database that translates between something that looks like a Unix filesystem and the GUIDs and SURLs needed to actually access the data on the Grid 5

Henry Nebrensky - MICE VC May 2009 Grid File Management (2) 6  MICE has an instance of LFC (LCG File Catalogue) run by the Tier 1 at RAL  The LFC service can do both the replica and LFN cataloguing  LFC presents the user with what looks like a normal Unix filespace - the Grid client SW keeps track of the data behind the scenes. LFC From MICE Note 247

Henry Nebrensky - MICE VC May 2009 MICE Data Flow  The basic data flow in MICE is thus something like: u The raw data file from the experiment are sent to tape using Grid protocols, including registering the files in LFC. u The offline reconstruction can then use Grid/LFC to pull down the raw data, and upload reconstructed (“RECO” or DST) files. u Users can use Grid/LFC to access RECO files they want to play with.  If I combine the above description with some background knowledge of the Grid, some snippets of what people are working on and a whole lot of guesswork I get: 7

Henry Nebrensky - MICE VC May 2009 MICE Data Flow Diagram 8  Short-dashed lines indicate entities that still need confirmation  Question marks indicate even higher levels of uncertainty  More details in MICE Note 252  The diagram would look pretty much the same if non-Grid tools were used

Henry Nebrensky - MICE VC May 2009 MICE Data Unknowns  MICE Note 252 identifies four main flavours of data: RAW, RECO, analysis results, and MonteCarlo simulation.  For all four, we need to understand the: u volume (the total amount of data, the rate at which it will be produced, and the size of the individual files in which it will be stored) u lifetime (ephemeral or longer lasting? will it need archiving to tape?) u access control (who will create the data? who is allowed to see it? can it be modified or deleted, and if so who has those privileges?)  Also need to identify use cases I’ve missed, especially ones that will need more VOMS roles or CASTOR space tokens. 9

Henry Nebrensky - MICE VC May 2009 File Catalogue Namespace (1)  Also, we need to agree on a consistent namespace for the file catalogue  Proposal (MICE Note 247, Grid talk at CM23):  We get given /grid/mice/ by the server u Five upper-level directories:  Construction/ historical data from detector development and QA  Calibration/ needed during analysis (large datasets, c.f. DB)  TestBeam/ test beam data  MICE/ DAQ output and corresponding MC simulation 10

Henry Nebrensky - MICE VC May 2009 File Catalogue Namespace (2)  /grid/mice/users/name For people to use as scratch space for their own purposes, e.g. analysis u Encourage people to do this through LFC – helps avoid “dark data” u LFC allows Unix-style access permissions  Again, the LFC namespace is something that needs to be finalised before production data can start to be registered. 11

Henry Nebrensky - MICE VC May 2009 Metadata Catalogue  For many applications – such as analysis – you will want to identify the list of files containing the data that matches some parameters  This is done by a “metadata catalogue”. For MICE this doesn't yet exist  A metadata catalogue can in principle return either the GUID or an LFN – it shouldn’t matter which as long as it’s properly integrated with the other Grid services.  (Grid talk at CM23) 12

Henry Nebrensky - MICE VC May 2009 MICE Metadata Catalogue  We need to select a technology to use for this u use the configuration database? u gLite AMGA (who else uses it – will it remain supported?)  Need to implement – i.e. register metadata to files  What metadata will be needed for analysis?  Should the catalogue include the file format and compression scheme (gzip ≠ PKzip)? 13

Henry Nebrensky - MICE VC May 2009 MICE Metadata Catalogue for Humans or, in non-Gridspeak:  we have several databases (configuration DB, EPICS, e-Logbook) where we should be able to find all sorts of information about a run/timestamp.  but how do we know which runs to be interested in, for our analysis?  we need an “index” to the MICE data, and for this we need to define the set of “index terms” that will be used to search for relevant datasets. 14

Henry Nebrensky - MICE VC May 2009 Conclusions  The data flow is more complex than people realise…  … and probably won’t work by accident  Some specific issues that need to be understood are the attributes of the data flows (Note 252), the LFC Namespace (Note 247) and the index terms for the metadata catalogue.  This needs discussion and (where necessary) decision pretty soon – by or at CM24 – to be ready for data taking. 15