Summary of Metadata Workshop Peter Hristov 28 February 2005 Alice Computing Day.

Slides:



Advertisements
Similar presentations
Introductory to database handling Endre Sebestyén.
Advertisements

Connecting to Databases. relational databases tables and relations accessed using SQL database -specific functionality –transaction processing commit.
Management Information Systems, Sixth Edition
File Systems Examples.
The AMGA metadata catalog Riccardo Bruno - INFN Madrid, 07-11/05/2007.
1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.
1 Grid services based architectures Growing consensus that Grid services is the right concept for building the computing grids; Recent ARDA work has provoked.
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
Introduction To Databases IDIA 618 Fall 2014 Bridget M. Blodgett.
Conceptual Architecture of PostgreSQL PopSQL Andrew Heard, Daniel Basilio, Eril Berkok, Julia Canella, Mark Fischer, Misiu Godfrey.
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
M.Frank LHCb/CERN - In behalf of the LHCb GAUDI team Data Persistency Solution for LHCb ã Motivation ã Data access ã Generic model ã Experience & Conclusions.
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America The AMGA metadata catalog with use cases.
Chapter 7: Database Systems Succeeding with Technology: Second Edition.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
(Chapter 10 continued) Our examples feature MySQL as the database engine. It's open source and free. It's fully featured. And it's platform independent.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
1.file. 2.database. 3.entity. 4.record. 5.attribute. When working with a database, a group of related fields comprises a(n)…
2005 Epocrates, Inc. All rights reserved. Integrating XML with legacy relational data for publishing on handheld devices David A. Lee Senior member of.
Chapter 10: The Data Tier We discuss back-end data storage for Web applications, relational data, and using the MySQL database server for back-end storage.
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
INFSO-RI Enabling Grids for E-sciencE AMGA Metadata Server - Metadata Services in gLite (+ ARDA DB Deployment Plans with Experiments)
ATLAS Detector Description Database Vakho Tsulaia University of Pittsburgh 3D workshop, CERN 14-Dec-2004.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks AMGA PHP API Claudio Cherubino INFN - Catania.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
Chapter 8 Data and Knowledge Management. 2 Learning Objectives When you finish this chapter, you will  Know the difference between traditional file organization.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Indexing and Selection of Data Items Using Tag Collections Sebastien Ponce CERN – LHCb Experiment EPFL – Computer Science Dpt Pere Mato Vila CERN – LHCb.
Introduction CMS database workshop 23 rd to 25 th of February 2004 Frank Glege.
Metadata Mòrag Burgon-Lyon University of Glasgow.
David Lawrence 7/8/091Intro. to PHP -- David Lawrence.
David Adams ATLAS DIAL: Distributed Interactive Analysis of Large datasets David Adams BNL August 5, 2002 BNL OMEGA talk.
File Systems cs550 Operating Systems David Monismith.
Managing Data DIRAC Project. Outline  Data management components  Storage Elements  File Catalogs  DIRAC conventions for user data  Data operation.
Integration of the ATLAS Tag Database with Data Management and Analysis Components Caitriana Nicholson University of Glasgow 3 rd September 2007 CHEP,
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
M. Oldenburg GridPP Metadata Workshop — July 4–7 2006, Oxford University 1 Markus Oldenburg GridPP Metadata Workshop July 4–7 2006, Oxford University ALICE.
Summary of User Requirements for Calibration and Alignment Database Magali Gruwé CERN PH/AIP ALICE Offline Week Alignment and Calibration Workshop February.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Enabling Grids for E-sciencE CMS/ARDA activity within the CMS distributed system Julia Andreeva, CERN On behalf of ARDA group CHEP06.
STAR Scheduler Gabriele Carcassi STAR Collaboration.
Event Management. EMU Graham Heyes April Overview Background Requirements Solution Status.
Finding Data in ATLAS. May 22, 2009Jack Cranshaw (ANL)2 Starting Point Questions What is the latest reprocessing of cosmics? Are there are any AOD produced.
DAQ & ConfDB Configuration DB workshop CERN September 21 st, 2005 Artur Barczyk & Niko Neufeld.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
The Database Project a starting work by Arnauld Albert, Cristiano Bozza.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Introduction to Database Programming with Python Gary Stewart
ATLAS Distributed Computing Tutorial Tags: What, Why, When, Where and How? Mike Kenyon University of Glasgow.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Security and Replication of Metadata with AMGA
Metadata Services on the GRID
BDII Performance Tests
Alice Off-line Week, February 24th, 2005
GSAF Grid Storage Access Framework
New developments on the LHCb Bookkeeping
Operation System Program 4
Grid Data Integration In the CMS Experiment
MANAGING DATA RESOURCES
The AMGA metadata catalog
ALICE Data Challenges Fons Rademakers Click to add notes.
Database Management Systems
Metadata Services on the GRID
Event Storage GAUDI - Data access/storage Framework related issues
Presentation transcript:

Summary of Metadata Workshop Peter Hristov 28 February 2005 Alice Computing Day

Metadata: Definition Traditionally: Metadata has been understood as “data about data” Example(s): A library catalogue contains information (metadata) about publications (data) A file system maintains permissions (metadata) about files (data) More definitions: try Googletry Google

General Applications of Metadata (Web) Cataloguing (item and collections) Resource discovery Electronic commerce Intelligent software agents Digital signatures Content rating Intellectual property rights Privacy preferences & policies

Statement of the Problem A user wants to process “some events” He/she needs to know where they are Which file Where in the file Sounds simple!

Tools at Hand In Grid file catalogues files may have metadata that identify a file However we are not sure what the grid catalogue will look like in this moment Given a ROOT file, a TRef allows us to “jump” directly to a given object

General Idea on the Grid MD MD MD keys TAG Database Event catalogue MD MD MD LFN SE File catalogue MD MD MD PFN Local catalogue

Main Questions What happens if we are not on the grid The system should not be too different What do we put in each catalogue How much do we depend on the file catalogue being exposed to us Which information do we put in the TAG database Which are its dimensions and its efficiency?

Event Model RAW data; Written once, read (not too) many times. Size: 1 – 50 MB per event, exist only one per event. ESD; Written (not too) many times, read many times. Size: ~ 1/10 of raw per event, exist only one per event. AOD; Written many times, read many times. Size: ~ 1/10 of ESD per event, exist many (~10) per event. … Tag; Written (not too) many times, read many times. Size: 100 B – 1 kb per event, exist many per event. This is done for fast event data selection.  It’s not directly for analysis, histogram production etc.  Even (by chance) if the information is there you may do it. For discussion.  Global experiment tags.  Physics working group tags.  User defined tags.

Terminology Define common terms, first ● Metadata: Key-Value pairs Any data necessary to work on the grid not living in the files ● Entry: Entities to which metadata is attached Denoted by a string, format like file-path in Unix, wild-cards allowed ● Collection: Set of entries Collections are themselves entries, think of Directories ● Attribute: Name or key of piece of metadata Alphanumerical string with starting letter ● Value: Value of an entry's attribute Printable ASCII string ● Schema: Set of attributes of an entry Classifies types of entries: Collection's schema inherited by its entries ● Storage Type: How back end stores a value Back end may have different (SQL) datatypes than application ● Transfer Type: How values are transferred Values are transported as printeable ASCII

Event Metadata Tag: event building and physics related information Catalog information => how to retrieve the event What else => links to another sources of information?

Tag Database Karel Safarik

TAG Structure Event building information. Allows to find all the information about the event.  Event ESD and all the AODs.  Maybe also RAW data (hopefully will not be used often).  … (This is not my job). Physics information. Query-able (that’s on what you select data). Information about trigger, quality etc. Usually same global physics variable. But also one may have there which may not too much physical sense but is good for selection.

TAG Size Has to be reasonable to be able to query in reasonable time Somewhere around disk size --- O(100GB) Typical yearly number of events 10 7 for heavy ion 10 9 for pp However TAG size (in principle) is independent on multiplicity But it is collision-system dependent, trigger dependent… For heavy-ion: few kb gives few 10 GB For pp: 100 B gives 100 GB STAR: 500 physics tag fields in 0.5 kb (in average 1 par B)

TAG Content (Only physics information) Technical part – the same for every TAG database Run number, event number, bunch crossing number, time stamp Trigger flags (an event may be trigger by more than one trigger class), information from trigger detectors Quality information: which detectors were actually on, what was their configuration, quality of reconstruction Physics part – partly standard, partly trigger/physics/user dependent Charged particle multiplicity Maximum pt Sum of the pt Maximum el-mag energy Sum of el-mag energy Number of kaons …

TAG Construction Basic (experiment wide) TAG database Written during reconstruction – ESD production But it has also to navigate to (all ?) AOD (produced later) ? There is part which is untouchable (nobody is allowed to modify) There is part which maybe modified, as result of further analysis From this one all other TAG databases start The real content of definite instant of TAG database  Trigger dependent  Detector configuration dependent  Physics analysis dependent Define the physics group TAG databases  Derived from experiment wide database Maybe allow for user TAG databases  Derived from physics group database Useful tag fields are then pushed up in this hierarchy

TAG Conclusion We have to define prototype of experiment-wide TAG database Implement this in reconstruction program Physics working group – to define physic group databases Test the mechanism of inheritance from experiment-wide TAG database Decide if the ‘event building’ information has to allow to navigate  To all the AODs  Or just to those created within that working group When ?, who ?

Metadata in CDC Fons Rademakers

AliMDC - ROOT Objectifier ROOT Objectifier reads raw data stream via shared memory from the GDC Objectifier has three output stream: Raw event file via rootd to CASTOR Event catalog (tag DB) File catalog  In raw event file  In MySQL  In alien

RAW Data DB Raw data file contains a tree of AliEvent objects: An AliEventHeader object  16 data members (72 bytes) An TObjArray of AliEquipment objects  An AliEquipmentHeader object 7 data members (28 bytes)  An AliRawData object Char array (variable length) An TObjArray of sub-events (also AliEvents) No compression (need more CPU power) Size of individual raw DB files around 1.5 GB

Event Catalog - Tag DB The tag DB contains tree of AliEventHeader objects: Size, type, run number, event number, event id, trigger pattern, detector pattern, type attributes Basic physics parameters (see Karel’s talk) Compressed Used for fast event selection Using compressed bitmap indices (as used in the STAR grid collector)? Do we store LFN’s for the events or do we look the LFN’s up in the file catalog using run/event number? Might need more than one LFN per event (RAW, RECO, ESD, AOD), or do we use naming conventions?

File Catalog The file catalog contains one AliStats object per raw data file: Filename of raw file, number of events, begin/end run number, begin/end event number, begin/end time, file size, quality histogram Same info also stored in a central MySQL RDBMS In AliEn catalog only: LFN, raw filename, file size  LFN: /alice_md/dc/adc- /  : _ _.Root

File/event Collections Not yet addressed A file collection can be one (or more) sub- directories in alien (with symbolic links to original LFN)? An event collection can be a file collection with an associated event list per file? Or fully dynamic: a collection is just a (stored) query in the event and file catalogs (like grid collector)

ARDA Experiences With Metadata Nuno Santos

Experience with Metadata ● ARDA tested several Metadata solutions from the experiments: ● LHCb Bookkeeping – XML-RPC with Oracle backend ● CMS: RefDB - PHP in front of MySQL, giving back XML tables ● Atlas: AMI - SOAP-Server in Java in front of MySQL ● gLite (Alien Metadata) - Perl in front of MySQL parsing command, streaming back text ● Tested performance, scalability and features => a lot of plots omitted, see the original presentation.

Synthesis ● Generally, the scalability is poor ● Sending big responses in a single packet limits scalability ● Use of SOAP and XML-RPC worsens the problem ● Schema evolution not really supported ● RefDB and Alien don't do schema evolution at all. ● AMI, LHCb-Bookkeeping via admins adjusting tables. ● No common Metadata interface Experience with existing Software Propose Generic Interface & Prototype as Proof-Of-Concept

Design decisions ● Metadata organized as a hierarchy - Collect objects with shared attributes into collections ● Collections stored as a table - allows queries on SQL tables ● Analogy to file system: ● Collection Directory ● Entry File ● Abstract the backend – Allows supporting several backends. ● PostgreSQL, MySQL, Oracle, filesystem… ● Values restricted to ASCII strings ● Backend is unknown ● Scalability: ● Transfer large responses in chunks, streaming from DB ● Decreases server memory requirements, no need to store full response in memory ● Implement also a non-SOAP protocol ● Compare with SOAP. What is the performance price of SOAP?

Conclusions on SOAP Tests show: SOAP performance is generally poor when compared to TCP streaming.  gSOAP is significantly faster than other toolkits. Iterators (with stateful server) helps considerably – results returned in small chunks. SOAP puts an additional load on server. SOAP interoperability very problematic. Writing an interoperable WSDL is hard. Not all toolkits are mature.

Conclusions ● Many problems understood studying metadata implementations of experiments ● Common requirements exist ● ARDA proposes generic interface to metadata on Grid: ● Retrieving/Updating of data ● Hierarchical view ● Schema discovery and management (discussed with gLite, GridPP, GAG, accepted by PTF) ● Prototype with SOAP & streaming front end built ● SOAP can be as fast as streaming in many cases ● SOAP toolkits still immature ●