File-Metadata Management System For The LHCb Experiment Carmine Cioffi Department of Physics, University of Oxford CHEP04 Interlaken, 27 September 2004
CHEP04 Interlaken 27 September 2004 File-Metadata Management system2 Outline What are Metadata and why we need them in the LHCb experiment. The File-Metadata Management System –The two schema strategy –XML and the warehousing database –Services and specialised views –Relationship between the warehousing database and views. –Web Services ARDA and future planning
CHEP04 Interlaken 27 September 2004 File-Metadata Management system3 Metadata Generally speaking, metadata are data which characterise data-files The two facets of metadata –Job provenance: Everything you ever wanted to know about how a data-file was created –Bookkeeping: How do I identify the datasets I am interested in for my analysis ? Metadata are needed to get straight to the files of interest, avoiding unnecessary access to the data storage.
CHEP04 Interlaken 27 September 2004 File-Metadata Management system4 The two schema strategy The two schema strategy consists of having a Database (Warehousing DB) and a View of it, both with their own schema. –The Warehousing DataBase (WDB) is meant to store data in a simple way but be flexible enough to accept new data. –The View is designed to be efficient for the service it is made for.
CHEP04 Interlaken 27 September 2004 File-Metadata Management system5 Entity-Relationship model for WDB
CHEP04 Interlaken 27 September 2004 File-Metadata Management system6 XML and the insertion of data Due to the key-value strategy the WDB is liable to be corrupted: –Any data with any semantic can be inserted. –Partial information can be inserted. To prevent this the data must be presented in XML format. In this way, using a predefined DTD/XML-SCHEMA it is possible to verify the correctness of the data.
CHEP04 Interlaken 27 September 2004 File-Metadata Management system7 The DTD for the insertion of a job related metadata – –<!ATTLIST Job ConfigName CDATA #REQUIRED – ConfigVersion CDATA #REQUIRED – Date CDATA #REQUIRED> –<!ATTLIST JobOption Recipient CDATA #REQUIRED – Name CDATA #REQUIRED – Value CDATA #REQUIRED> –<!ATTLIST TypedParameter Name CDATA #REQUIRED – Value CDATA #REQUIRED – Type (Info|Environment_Variable) #REQUIRED> – –<!ATTLIST OutputFile Name CDATA #REQUIRED – TypeName CDATA #REQUIRED – TypeVersion CDATA #REQUIRED> –<!ATTLIST Parameter Name CDATA #REQUIRED – Value CDATA #REQUIRED> –<!ATTLIST Quality Group CDATA #REQUIRED – Flag CDATA #REQUIRED>
CHEP04 Interlaken 27 September 2004 File-Metadata Management system8 Services and the specialised views Sometimes complex SQL queries do not work well for bulk lookups. –But the WDB contains all the information about the file that can be used to generate specialised views for specific service. Knowing the service, the views can be optimised to give the best performance.
CHEP04 Interlaken 27 September 2004 File-Metadata Management system9 Replica FILE_ID REPLICA LOCATION DT_JobSummary JOB_ID CONFIG DBVERSION EVENTTYPE JOBDATE LABORATORY PROGRAM0 INPUTFILE0 PROGRAM1 INPUTFILE1 PROGRAM2 INPUTFILE2 DT_FileSummary FILE_ID JOB_ID EVENTTYPE EVENTDESCRIPTION NBEVENTS FILETYPE FILENAME FILESIZE Jython Web Server SERVLETS XMLRPC SPECIALISED VIEW SCHEMA Web Browser Example of view with service and applications This example shows the specialised view that sits on back of the XMLRPC and SERVLETS Services. These services are used by GANGA and the Web Browser. GANGA application
CHEP04 Interlaken 27 September 2004 File-Metadata Management system10 Jobs JobParams FileParams Files TypeParams ConfigNameConfigVersion Date ValueName Type LogName ValueName ValueName QualityParams ValueName Replica FILE_ID REPLICA LOCATION DT_JobSummary JOB_ID CONFIG DBVERSION EVENTTYPE JOBDATE LABORATORY PROGRAM0 INPUTFILE0 PROGRAM1 INPUTFILE1 PROGRAM2 INPUTFILE2 DT_FileSummary FILE_ID JOB_ID EVENTTYPE EVENTDESCRIPTION NBEVENTS FILETYPE FILENAME FILESIZE Generation of the specialised View Warehouse DB Specialised View Done periodically or on demand based on the needs of the experiment (every night for LHCb). This is fast despite the fact that WDB contains many GB. SQL script
CHEP04 Interlaken 27 September 2004 File-Metadata Management system11 Some Numbers LHCb is using ORACLE 9i technology for its DB –It is hosted on a cluster of two Sun Fire 280R machine –Each with two processors of 750MHz –2 GB RAM –600 GB HD The DB contains ~20GB of data –Shared between real data and indexing tables –~2M jobs rows –~5.5M files rows –~57M rows in parameters.
CHEP04 Interlaken 27 September 2004 File-Metadata Management system12 LHCb services Actually LHCb is using two services to access the information from the databases: –Servlet service : the service allows the selection of datasets based on their history (job provenance) by the web browser. –XML-RPC service: access to and modification of the WDB data allow GANGA to access Bookkeeping data.
CHEP04 Interlaken 27 September 2004 File-Metadata Management system13 Collaboration with ARDA LHCb has engaged a collaboration with ARDA: –Definition of metadata and understanding of LHCb requirements –Elaboration of a new interface for the manipulation of file- metadata. –Possible technology (WSDL). –See how this will fit with the already existing LHCb system. Stress-test the Bookkeeping services, analysing various behaviours: –Different number of clients –Different queries –Comparison with direct RPC calls Implement the new defined interface –Using the actual LHCb File-Metadata DB as back-end –Using the technology developed with ARDA
CHEP04 Interlaken 27 September 2004 File-Metadata Management system14 CONCLUSIONS The two schema strategy works well for LHCb, and with the DC04 its flexibility was well proven, indeed no changes were required to the WDB although new data have been stored. Because of key-value nature of the WDB it can be easily adapted for warehousing of any data, including that of other experiments.