The POOL Persistency Framework

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

Database System Concepts and Architecture
RLS Production Services Maria Girone PPARC-LCG, CERN LCG-POOL and IT-DB Physics Services 10 th GridPP Meeting, CERN, 3 rd June What is the RLS -
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
D. Düllmann - IT/DB LCG - POOL Project1 POOL Release Plan for 2003 Dirk Düllmann LCG Application Area Meeting, 5 th March 2003.
POOL Project Status GridPP 10 th Collaboration Meeting Radovan Chytracek CERN IT/DB, GridPP, LCG AA.
D. Duellmann, CERN Data Management at the LHC1 Data Management at CERN’s Large Hadron Collider (LHC) Dirk Düllmann CERN IT/DB, Switzerland
SEAL V1 Status 12 February 2003 P. Mato / CERN Shared Environment for Applications at LHC.
M.Frank LHCb/CERN - In behalf of the LHCb GAUDI team Data Persistency Solution for LHCb ã Motivation ã Data access ã Generic model ã Experience & Conclusions.
ATLAS and GridPP GridPP Collaboration Meeting, Edinburgh, 5 th November 2001 RWL Jones, Lancaster University.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
LCG Application Area Internal Review Persistency Framework - Project Overview Dirk Duellmann, CERN IT and
NOVA Networked Object-based EnVironment for Analysis P. Nevski, A. Vaniachine, T. Wenaus NOVA is a project to develop distributed object oriented physics.
CHEP 2003 March 22-28, 2003 POOL Data Storage, Cache and Conversion Mechanism Motivation Data access Generic model Experience & Conclusions D.Düllmann,
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
POOL Status and Plans Dirk Düllmann IT-DB & LCG-POOL Application Area Meeting 10 th March 2004.
SEAL Core Libraries and Services CLHEP Workshop 28 January 2003 P. Mato / CERN Shared Environment for Applications at LHC.
David Adams ATLAS DIAL/ADA JDL and catalogs David Adams BNL December 4, 2003 ATLAS software workshop Production session CERN.
SEAL Project Core Libraries and Services 18 December 2002 P. Mato / CERN Shared Environment for Applications at LHC.
CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.
The POOL Persistency Framework POOL Project Review Introduction & Overview Dirk Düllmann, IT-DB & LCG-POOL LCG Application Area Internal Review October.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
NOVA A Networked Object-Based EnVironment for Analysis “Framework Components for Distributed Computing” Pavel Nevski, Sasha Vanyashin, Torre Wenaus US.
G.Govi CERN/IT-DB 1 September 26, 2003 POOL Integration, Testing and Release Procedure Integration  Packages structure  External dependencies  Configuration.
Database authentication in CORAL and COOL Database authentication in CORAL and COOL Giacomo Govi Giacomo Govi CERN IT/PSS CERN IT/PSS On behalf of the.
23/2/2000Status of GAUDI 1 P. Mato / CERN Computing meeting, LHCb Week 23 February 2000.
D. Duellmann - IT/DB LCG - POOL Project1 The LCG Pool Project and ROOT I/O Dirk Duellmann What is Pool? Component Breakdown Status and Plans.
Some Ideas for a Revised Requirement List Dirk Duellmann.
Andrea Valassi (CERN IT-DB)CHEP 2004 Poster Session (Thursday, 30 September 2004) 1 HARP DATA AND SOFTWARE MIGRATION FROM TO ORACLE Authors: A.Valassi,
D. Duellmann - IT/DB LCG - POOL Project1 The LCG Dictionary and POOL Dirk Duellmann.
Overview of C/C++ DB APIs Dirk Düllmann, IT-ADC Database Workshop for LHC developers 27 January, 2005.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
G.Govi CERN/IT-DB 1GridPP7 June30 - July 2, 2003 Data Storage with the POOL persistency framework Motivation Strategy Storage model Storage operation Summary.
Status of tests in the LCG 3D database testbed Eva Dafonte Pérez LCG Database Deployment and Persistency Workshop.
David Adams ATLAS ATLAS Distributed Analysis and proposal for ATLAS-LHCb system David Adams BNL March 22, 2004 ATLAS-LHCb-GANGA Meeting.
INFSO-RI Enabling Grids for E-sciencE Ganga 4 Technical Overview Jakub T. Moscicki, CERN.
D. Duellmann, IT-DB POOL Status1 POOL Persistency Framework - Status after a first year of development Dirk Düllmann, IT-DB.
POOL Based CMS Framework Bill Tanenbaum US-CMS/Fermilab 04/June/2003.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Databases and DBMSs Todd S. Bacastow January 2005.
Jean-Philippe Baud, IT-GD, CERN November 2007
CIIT-Human Computer Interaction-CSC456-Fall-2015-Mr
(on behalf of the POOL team)
By: S.S.Tomar Computer Center CAT, Indore, India On Behalf of
U.S. ATLAS Grid Production Experience
More Interfaces Workplan
Open Source distributed document DB for an enterprise
3D Application Tests Application test proposals
BOSS: the CMS interface for job summission, monitoring and bookkeeping
BOSS: the CMS interface for job summission, monitoring and bookkeeping
POOL: Component Overview and use of the File Catalog
File System Implementation
Savannah to Jira Migration
BOSS: the CMS interface for job summission, monitoring and bookkeeping
Distribution and components
POOL File Catalog: Design & Status
POOL persistency framework for LHC
Dirk Düllmann CERN Openlab storage workshop 17th March 2003
Conditions Data access using FroNTier Squid cache Server
First Internal Pool Release 0.1
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Database Management System (DBMS)
POOL/RLS Experience Current CMS Data Challenges shows clear problems wrt to the use of RLS Partially due to the normal “learning curve” on all sides in.
Data, Databases, and DBMSs
Grid Data Integration In the CMS Experiment
Support for ”interactive batch”
Data Persistency Solution for LHCb
POOL Status & Release Plan for V0.4
SEAL Project Core Libraries and Services
SDMX IT Tools SDMX Registry
Presentation transcript:

The POOL Persistency Framework Dirk Düllmann CERN, Geneva DESY Computing Seminar Hamburg, 12. January 2004 The POOL Persistency Framework

Outline Project goals and organisation POOL components and their high level interaction Main technical problems and solutions chosen First production experience and user feedback The POOL Persistency Framework D.Duellmann, CERN

What is POOL? Pool Of persistent Objects for LHC Project Goal: Develop a common Persistency Framework for physics applications at the LHC Pool Of persistent Objects for LHC Second project ConditionsDB just starting Part of the LHC Computing Grid (LCG) LCG is the computing infrastructure for LHC Physics Applications, Grid Technology, Grid Deployment, Fabric Management POOL is one of the first Application Area Projects Common effort between LHC experiments and CERN IT-DB group for defining its scope and architecture for the development of its components The POOL Persistency Framework D.Duellmann, CERN

POOL Timeline and Statistics POOL project started April 2002 Ramping up from 1.6 to ~10 FTE Persistency Workshop in June 2002 First internal release POOL V0.1 in October 2002 In about one year of active development since then 12 public releases POOL V1.4.0 last year, V1.5 expected this month Some 60 internal releases Often picked up by experiments to confirm fixes/new functionality Very useful to insure that releases meet experiment expectations beforehand Handled some 165 bug reports Savannah web portal proven helpful POOL followed from the beginning a rather aggressive schedule to meet the first production needs of the experiments. The POOL Persistency Framework D.Duellmann, CERN

POOL Objectives To allow the multi-petabyte of experiment data and associated meta data to be stored in a distributed and grid-enabled fashion various types of data of different volumes (event data, physics and detector simulation, detector data and bookkeeping data) Hybrid technology approach, combining C++ object streaming technology Root I/O for the bulk data Transactionally safe Relational Database (RDBMS) services, MySQL for catalogs, collections and meta data In particular POOL provides Persistency for C++ transient objects Transparent navigation among objects across file and technology boundaries Integrated with a external File Catalog to keep track of the file physical location, allowing files to be moved or replicated The POOL Persistency Framework D.Duellmann, CERN

Component Architecture POOL is a component based system Closely follows the LCG Architecture Blueprint Provides a technology neutral API Abstract component C++ interfaces Insulates the experiment framework user code from implementation details of the technologies used today POOL user code is not dependent on implementation libraries No link time dependency on implementation packages (e.g. MySQL, Root, Xerces-C..) Backend component implementations are loaded at runtime via the SEAL plug-in infrastructure Three major domains, weakly coupled, interacting via abstract interfaces The POOL Persistency Framework D.Duellmann, CERN

POOL Component Breakdown The POOL Persistency Framework D.Duellmann, CERN

Work Package Breakdown Storage Manager Streams transient C++ objects into/from disk storage Resolves a logical object reference into a physical object Uses Root I/O. A proof of concept with a RDBMS storage manager prototype underway File Catalog Maintains consistent lists of accessible files (physical and logical names) together with their unique identifiers (FileID), which appear in the object representation in the persistent space Resolves a logical file reference (FileID) into a physical file Collections Provides the tools to manage potentially (large) ensembles of objects stored via POOL persistence services Explicit: server-side selection of object from queryable collections Implicit: defined by physical containment of the objects The POOL Persistency Framework D.Duellmann, CERN

The LHC Computing Centre Tier3 physics department    Desktop Tier2 Lab a Uni a Lab c Uni n Lab m Lab b Uni b Uni y Uni x regional group Germany Tier 1 USA UK France Italy ………. CERN Tier 1 CERN The LHC Computing Centre physics group les.robertson@cern.ch The POOL Persistency Framework D.Duellmann, CERN

POOL on the Grid File Catalog Collections Meta Data Replica Location Service Catalog Grid Dataset Registry Grid Resources Experiment Framework User Application LCG POOL Grid Middleware RootI/O Replica Manager The POOL Persistency Framework D.Duellmann, CERN

POOL off the Grid File Catalog Collections Meta Data XML / MySQL MySQL or RootIO Collection Experiment Framework User Application LCG POOL Disconnected Laptop RootI/O The POOL Persistency Framework D.Duellmann, CERN

HEP Data Models HEP data models are complex! Typically hundreds of structure types (classes) Many relations between them Different access patterns LHC experiments rely on OO technology OO applications deal with networks of C++ objects Pointers (or references) are used to describe relations Event TrackList Tracker Calor. Track HitList Hit The POOL Persistency Framework D.Duellmann, CERN

POOL Storage Hierarchy A simple and generic model exposed to the application A application may access databases (eg ROOT files) from a set of catalogs Each database has containers of one specific technology (eg ROOT trees) Smart Pointers are used to transparently load objects into a client side cache define object associations across file or technology boundaries The POOL Persistency Framework D.Duellmann, CERN

Navigational Access DatabaseID ContainerID ObjectID Internally: a unique Object Identifier (OID) per object Direct access to any object in the distributed store Natural extension of the pointer concept OIDs allow to implement networks of persistent objects (“associations”) Association Cardinality: 1:1 , 1:n, n:m OIDs are used in C++ via so-called “smart-pointers” The POOL Persistency Framework D.Duellmann, CERN Indentity == Physical Location

How do “smart pointers” work? Ref<Track> is a smart pointer to a Track Refs are the most visible part of the POOL C++ API The storage system automatically locates objects in the store as they are accessed and reads them. Transparent to the user/application programmer User does not need to know about physical locations. No host or file names in the code Allows de-coupling of logical and physical model The POOL Persistency Framework D.Duellmann, CERN

A Code Example Event TrackList Tracker Calor. Track HitList Hit Collection<Event> events; // an event collection Collection<Event>::iterator evt; // a collection iterator // loop over all events in the input collection for(evt = events.begin(); evt != events.end(); evt++) { // access the first track in the tracklist Ref<Track> aTrack; aTrack = evt->tracker->trackList[0]; // print the charge of all its hits for (int i = 0; i < aTrack->hits.size(); i++) cout << aTrack->hits[i]->charge << endl; } Event TrackList Tracker Calor. Track HitList Hit The POOL Persistency Framework D.Duellmann, CERN

Physical and Logical Data Model File Catalog Important to clearly separate between the two models Avoid pollution of the user code with unnecessary detail about file and host names Leaving the physical model free to optimise for performance or changing access patterns without affecting existing applications The POOL Persistency Framework D.Duellmann, CERN

POOL and its Environment Exp. DB Services Book Keeping Production Workflow POOL is mainly used from experiment frameworks mostly as client library loaded from user application Production Manager Creates and maintains shared file catalogs and (event) collections End User Uses shared or private collections POOL applications are grid-aware via the Replica Location Service (RLS) File location and meta data queries are forwarded to Grid services The POOL storage manager provides access to local or remote files access via Root I/O (eg RFIO/dCache), possibly later replaced by the Grid File Access Library (GFAL) POOL client on a CPU Node User Application Experiment Framework RDBMS Services Collection Description POOL Collection Location? remote access Collection Access Grid (File) Services Replica Location File Description Remote File I/O The POOL Persistency Framework D.Duellmann, CERN

Storage Components Data Cache StorageMgr Disk Storage Object Pointer Persistent Address Technology DB name Cont.name Item ID Disk Storage database Database Container Objects The POOL Persistency Framework D.Duellmann, CERN

Mapping to Technologies Identify commonalties and differences between technologies Model adapts to any storage technology with direct access Record identifier should be known before flushing to disk Use of RDBMS rather conventional No special object support required Primary key must be uniquely determined before writing The POOL Persistency Framework D.Duellmann, CERN

Technology Independent Type Dictionary To store any object in memory on disk one needs an precise description of the class layout. Logical Type Info Name and type of all data attributes Inheritance hierarchy Logical == independent of platform and compiler Physical Type Info Offset, endianness and size of all data attributes Base classes are not much more than embedded data members of that type Depends on CPU, compiler, compiler flags (eg padding) POOL needs only the data attributes and the dynamic type to persist an object. The POOL Persistency Framework D.Duellmann, CERN

Dictionary: Population/Conversion .h ROOTCINT CINT dictionary code .xml Code Generator GCC-XML LCG dictionary code Technology dependent Dictionary Generation CINT dictionary I/O Data I/O LCG dictionary Gateway Reflection Other Clients The POOL Persistency Framework D.Duellmann, CERN

Client Access via Object Caches Data Service Manages object cache Ref<T> Client access data through References Client Ref<T> Object Cache Data Service Different context Event data Detector data other Ref<T> Object Cache Data Service The POOL Persistency Framework D.Duellmann, CERN

POOL Cache Access Cache Manager Persistency Manager Ref<T> Key Object OID … 2 <pointer> Key CacheMgr Pointer Ref<T> Token Storage Technology Object Type Persistent Location Persistency Manager The POOL Persistency Framework D.Duellmann, CERN

Reducing Storage Overhead - The Link Table <…> <pointer> OID Object (2) (3) (4) Link ID Link Info DB/Cont.name,... ... <number> Local lookup table in each file (1) Entry ID Link ID The POOL Persistency Framework D.Duellmann, CERN

POOL File Catalog Files are referred to inside POOL via a unique and immutable file identifier, (FileID) generated at creation time POOL added the system generated FileID to the standard Grid m-n mapping Stable inter-file reference Optionally, File metadata attributes are associated with the FileID, used for fast query based catalog manipulation. Not used internally by POOL Logical Naming Object Lookup LFN2 LFNn PFN2, technology PFNn, technology File metadata (jobid, owner, …) FileID-LFN mapping supported but not used internally FileID-PFN mapping is sufficient for object lookup The POOL Persistency Framework D.Duellmann, CERN

GUIDs as File Identifiers Global Unique Identifier (GUID) implementation for FileID allows the production of a consistent sets of files with internal references without requiring a central ID allocation service Jobs can execute in complete isolation (as in eg CMS “winter mode”) Catalog Fragments can be used asynchronously as a means of propagating catalog or meta data changes through the system Without loosing the internal object references Without risking to clash with any already existing objects Without the need to rewrite any existing data files The POOL Persistency Framework D.Duellmann, CERN

Catalog Implementations XML Catalog typically used as local file by a single user/process at a time no need for networked supports R/O operations via http tested up to 50K entries Native MySQL Catalog Shared catalog eg in a production LAN handles multiple users and jobs (multi-threaded) tested up to 1M entries EDG-RLS Catalog Grid aware applications production service based on Oracle (from CERN IT/DB), already in use for POOL V1.0 (july ’03) Oracle iAS or Tomcat + Oracle / MySQL backend The POOL Persistency Framework D.Duellmann, CERN

Catalog Queries Application level meta-data can be associated to cataloged files meta-data are normally used to query large file catalogs, e.g. extract catalog subsets based on production parameters Catalog fragments including meta data can be shipped. Cross-catalog operations between different catalog implementations are supported Queries are supported at the client side for XML and EDG and at the server side for MySQL “jobid=‘cmssim’ ”, “owner like ‘%wild%’ ” “pfname like ‘path’ and owner = ‘atlasprod’ ” EDG RLS enhancements requested server side query evaluation of numerical query expressions The POOL Persistency Framework D.Duellmann, CERN

Use case: isolated system XML lookup input files register output files Import jobs Publish No network EDG-RLS The user extracts a set of interesting files and a catalog fragment describing them from a (central) Grid based catalog into a local XML catalog Selection is performed based on file or collection descriptions After disconnecting from the Grid the user executes some standard jobs navigating through the extracted data New output files are registered into the local XML catalog Once the new data is ready for publishing and the user is connected the new catalog fragment is submitted to the Grid based catalog The POOL Persistency Framework D.Duellmann, CERN

Use case: farm production lx1.cern.ch quality check lookup register XML jobs publish RDBMS pc3.cern.ch publish quality check publish lookup register jobs publish EDG A production job runs and creates files and their catalog entries in a local XML file During the production the catalog can be used to cleanup files Once the data quality checks have been passed the production manager decides to publishes the production XML catalog fragment to the site database one and eventually to the Grid based catalog The POOL Persistency Framework D.Duellmann, CERN

CMS & POOL Fruitful collaboration with POOL team since inception 2.6 FTE direct participation Efficient communication Savannah Portal Direct mail and phone exchange among developers In person meetings when required Continuous and prompt feedback CMS typically provides feedback on any new pre-release in few hours POOL typically responds to bug reports in 24/48 hours Only a few took more than a week to be fixed in a new pre-release Review response slides from the experiments.

CMS Current Status First public release of CMS products (COBRA etc.) based on POOL 1.3.x available on 30 Sep 2003 Used in production, deployed to physicists e.g. 2 Million events produced with OSCAR (G4 simulation) in a weekend New tutorials (just started) based on software released against POOL 1.3.3 Essentially same functionality as Objectivity-base code, but: No concurrent update of databases No direct connection to central database while running Remote access limited to RFIO or dCache No schema evolution Still a few bugs, missing features and performance problems Affect more complex use-cases Impact the deployment to a large developer/user community

LHCb Strategy Full LHCb data model in POOL. Complete write/read Adiabatic adaptation of Gaudi to SEAL/POOL Slow integration according to available manpower (1 year for full migration) Take advantage of face-lifting of “bad” interfaces and implementations Minimal changes to interfaces visible to physicists Integration of SEAL in steps Dictionary integration and plugin manager Use of SEAL services later Integration of POOL earlier than SEAL Working prototype by the end of the year Necessity to read “old” ROOT data Keep the LHCb event model unchanged Full LHCb data model in POOL. Complete write/read of a LHCb event from GAUDI achieved in December.

ATLAS SEAL/POOL Status Required functionality is essentially in place Main outstanding problem is that of CLHEP matrix class ATLAS has not yet tested EDG RLS-based file catalog POOL explicit and implicit collections are available via Athena, but fuller integration will require extensions from POOL (and refactoring on ATLAS side) We’re moving from an expert to general developer environment The event data model is being filled out Goal is to have it essentially complete by end of the year Although parts of transient model are still being redesigned Python scripting support is being incorporated now PyROOT, PyLCGDict, GaudiPython, etc.

Summary The LCG Pool project provides a hybrid store integrating object streaming (eg Root I/O) with RDBMS technology (eg MySQL) for consistent meta data handling Strong emphasis on component decoupling and well defined communication/dependencies Transparent cross-file and cross-technology object navigation via C++ smart pointers Integration with Grid wide File Catalogs (eg EDG-RLS) but preserving networked and grid-decoupled working modes Decoupling of logical ond physical data model No file or hostnames in user code POOL has been integrated into LHC experiments software frameworks and is used for production activities in CMS Persistency mechanism for CMS, ATLAS and LHCb data challenges The POOL Persistency Framework D.Duellmann, CERN

How to find out more about POOL? POOL home page http://pool.cern.ch/ POOL savannah web portal (bug reports, cvs) http://savannah.cern.ch/projects/pool POOL binary distribution (provided by LCG-SPI) http://lcgapp.cern.ch/project/spi/lcgsoft The POOL Persistency Framework D.Duellmann, CERN

The end. Thank you! The POOL Persistency Framework