Dirk Düllmann CERN Openlab storage workshop 17th March 2003 POOL Project Overview Dirk Düllmann CERN Openlab storage workshop 17th March 2003
What is POOL? POOL is the LCG Persistency Framework Pool of persistent objects for LHC Started by LCG-SC2 in April ’02 Common effort in which the experiments take over a major share of the responsibility for defining the system architecture for development of POOL components ramping up over the last year from 1.5 to ~10FTE
POOL and the LCG Architecture Blueprint POOL is a component based system A technology neutral API Abstract C++ interfaces Implemented reusing existing technology ROOT I/O for object streaming complex data, simple consistency model (write once) RDBMS for consistent meta data handling simple data, transactional consistency POOL does not replace any of it’s components technologies It integrates them to provides higher level services Insulates physics applications from implementation details of components and technologies used today
Pool as a LCG component Persistency is just one of several projects in the LCG Applications Area Sharing a common architecture and s/w process as described in the Blueprint and Persistency RTAG documents Persistency is important… …but not important enough to allow for uncontrolled direct dependencies eg of experiment code on its implementation Common effort in which the experiments take over a major share of the responsibility for defining the overall and detailed architecture for development of Pool components
LCG Blueprint Software Decomposition
POOL Work Package breakdown Based on outcome of SC2 persistency RTAG File Catalog keep track of files (and their physical and logical names) and their description resolve a logical file reference (FileID) into a physical file pool::IFileCatalog Collections keep track of (large) object collection and their description pool::Collection<T> Storage Service stream transient C++ objects into/from storage resolve a logical object reference into a physical object Object Cache (DataService) keep track of already read objects to speed up repeated access to the same data pool::IDataSvc and pool::Ref<T>
POOL Internal Organisation
POOL and the GRID GRID mostly deals with data of file level granularity File Catalog connects POOL to Grid Resources eg via our EDG-RLS backend POOL Storage Service deals with intra file structure need connection via standard Grid File access Both File and Object based Collections are seen as important End User concepts POOL offers a consistent interface to both types Need to understand to what extend these can be provided in a Grid environment
How does POOL fit into the environment Exp. DB Services Book Keeping Production Workflow POOL client on a CPU Node POOL will be mainly used from experiment frameworks mostly as client library loaded from user application Production Manager Creates and maintains shared file catalogs and (event) collections eg add the catalog fragment for the new simulation data to the published analysis catalog End User Uses shared collections eg iterate over collection X User Application Experiment Framework RDBMS Services Collection Description? POOL Collection Location? Collection Access remote access via ROOT I/O Grid (File) Services Replica Location File Description Remote File I/O?
POOL File Catalog POOL uses GUID implementation for FileID Logical Naming Object Lookup POOL uses GUID implementation for FileID unique and immutable identifier for a file (generated at create time) allows to produce sets of file with internal references without requiring a central ID allocation service catalog fragments created independently can later be merged without modification to data files. Object lookup is based only on right side box! Logical filenames are supported but not required
Use Case: Working in Isolation The user extracts a set of interesting files and a catalog fragment describing them from a (central) grid based catalog into a local (eg XML based) catalog. Selection is performed based on file or collection descriptions After disconnecting from the grid the user executes some standard jobs navigating through the extracted data. New output files are registered into the local catalog Once the new data is ready for publishing and the user is connected the new catalog fragment is submitted to the grid based catalog. File Catalog & Descr Grid File Storage Local File Catalog Local Files New Catalog & Descr New Files Extraction Local Processing Result Publishing
Use Case: Farm Production Production Node 1 Production Node 2 Production Node n Local File Catalog Production manager may pre-register output files with the catalog (eg a “local” MySQL or XML catalog) File ID, physical filename job ID and optionally also logical filenames A production job runs and creates files and their catalog entries locally. During the production the catalog can be used to cleanup files (and their registration) from unsuccessful jobs based on their associated job ID. Once the data quality checks have been passed the production manager decides to publishes the production catalog fragment to the grid based catalog. Local File Catalog Local File Catalog Local Files Local Files Local Files Post Processing New Files New Catalog & Descr Result Publishing Grid Cataloge File Catalog & Descr Grid File Storage
POOL Storage Hierarchy A application may access databases (eg ROOT files) from a set of catalogs Each database has containers of one specific technology (eg ROOT trees) Smart Pointers are used to transparently load objects into a client side cache define object associations across file or technology boundaries
Client Data Access Data Cache Data Service Ref<T> Client
Dictionary:Population/Conversion .h ROOTCINT CINT dictionary code Dictionary Generation .xml Code Generator GCC-XML LCG dictionary code CINT dictionary I/O Data I/O LCG dictionary Gateway Reflection Other Clients
Project Status & Plans First four POOL releases delivered planned functionality on time Aggressive schedule so far focusing on adding functionality no consistent attempt of performance optimisation yet Functional complete (LCG-1 feature set) POOL V1.0 release scheduled for April several functional extensions compared to V0.4 automated system tests are being Bug fix and performance release POOL V1.1 in June Aim to be ready for first deployment together with LCG-1 environment Will release Work on proof of concept storage service re-implementation based on an RDBMS back end starting
Summary The LCG Pool project provides a hybrid store integrating object streaming (eg Root I/O) with RDBMS technology (eg MySQL) for consistent meta data handling Strong emphasis on component decoupling and well defined communication/dependencies Transparent cross-file and cross-technology object navigation via C++ smart pointers Integration with Grid technology (via EDG-RLS) but preserving networked and grid-decoupled working modes Next two releases (V1.0-functionality and V1.1-reliability & performance) will be crucial for POOL acceptance Need tight coupling with experiment development and production teams to validate the feature set Assume tight integration with LCG deployment activities
How to find out more about POOL? POOL Home Page http://lcgapp.cern.ch/project/persist/ POOL savannah portal http://lcgappdev.cern.ch/savannah/projects/pool