DIRAC Data Management: consistency, integrity and coherence of data

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

Lectures on File Management
Andrew C. Smith, 11 th May 2007 EGEE User Forum 2 - DIRAC Data Management System User Forum 2 Data Management System “Tell me and I forget. Show me and.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
LHCb The LHCb Data Management System Philippe Charpentier CERN On behalf of the LHCb Collaboration.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
Managing Data DIRAC Project. Outline  Data management components  Storage Elements  File Catalogs  DIRAC conventions for user data  Data operation.
INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
DIRAC Data Management: consistency, integrity and coherence of data Marianne Bargiotti CERN.
EGI-Engage Data Services and Solutions Part 1: Data in the Grid Vincenzo Spinoso EGI.eu/INFN Data Services.
Data Management The European DataGrid Project Team
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Architecture of LHC File Catalog Valeria Ardizzone INFN Catania – EGEE-II NA3/NA4.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
EGEE Data Management Services
Federating Data in the ALICE Experiment
Jean-Philippe Baud, IT-GD, CERN November 2007
The EDG Testbed Deployment Details
L’analisi in LHCb Angelo Carbone INFN Bologna
ATLAS Use and Experience of FTS
StoRM: a SRM solution for disk based storage systems
Vincenzo Spinoso EGI.eu/INFN
DIRAC Production Manager Tools
Data Bridge Solving diverse data access in scientific applications
gLite Data management system overview
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Gfal/lcg-util -> Gfal2/gfal2-util
Introduction to Data Management in EGI
SRM2 Migration Strategy
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
Data Management in Release 2
GSAF Grid Storage Access Framework
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
GSAF Grid Storage Access Framework
New developments on the LHCb Bookkeeping
R. Graciani for LHCb Mumbay, Feb 2006
Data Management Ouafa Bentaleb CERIST, Algeria
Data Management in LHCb: consistency, integrity and coherence of data
Architecture of the gLite Data Management System
The LHCb Computing Data Challenge DC06
Presentation transcript:

DIRAC Data Management: consistency, integrity and coherence of data Marianne Bargiotti CERN

Outline DIRAC Data Management System (DMS) LHCb catalogues and physical storage DMS integrity checking Conclusions CHEP07 Sep 2-7 2007, Victoria, BC

DIRAC Data Management System DIRAC project (Distributed Infrastructure with Remote Agent Control) is the LHCb Workload and Data Management System DIRAC architecture based on Services and Agents see A. Tsaregorodsev poster [189] The DIRAC Data Management System deals with three components: File Catalogue: allows to know where files are stored Bookkeeping Meta Data DB (BK): allows to know what are the contents of the files Storage Elements: underlying Grid Storage Elements (SE) where files are stored consistency between these catalogues and Storage Elements is fundamental for a reliable Data Management see A.C Smith poster [195] CHEP07 Sep 2-7 2007, Victoria, BC

LHCb File Catalogue LHCb choice: LCG File Catalogue (LFC) it allows registering and retrieving the location of physical replicas in the grid infrastructure. It stores: file information (lfn, size, guid) replica information DIRAC WMS uses LFC information to decide where jobs can be scheduled Fundamental to avoid any kind of inconsistencies both with storages and with related catalogues (BK Meta Data DB) Baseline choice for DIRAC: central LFC one single master (R/W) and many RO mirrors coherence ensured by single write endpoint CHEP07 Sep 2-7 2007, Victoria, BC

Registration of replicas GUID check: before the registration in the LCG File Catalogue, at the beginning of transfer phase, the existence of file GUID to be transferred is checked to avoid GUID mismatch problem in registration After a successful transfer, LFC registration of files is divided into 2 atomic operations booking of meta data fields with the insertion in the dedicated table of lfn, guid and size replica registration if either step fails: possible source of errors and inconsistencies e.g the file is registered without any replica or with zero size CHEP07 Sep 2-7 2007, Victoria, BC

LHCb Bookkeeping Meta Data DB The Bookkeeping (BK) is the system that stores data provenience information. It contains information about jobs and files and their relations: Job: Application name, Application version, Application parameters, which files it has generated etc.. File: size, event, filename, guid, from which job it was generated etc. The Bookkeeping DB represents the main gateway for users to select the available data and datasets. All data visible to users are flagged as ‘Has replica’ All the data stored in the BK and flagged as ‘having replica’, must be correctly registered and available in LFC. CHEP07 Sep 2-7 2007, Victoria, BC

Storage Elements SRM is the standard interface to grid storage DIRAC Storage Element Client provides uniform access to GRID Storage Elements implemented with plug-in modules for access protocols srm, gridftp, bbftp, sftp, http supported SRM is the standard interface to grid storage LHCb has 14 SRM endpoints disk and tape storage for each T1 site SRM will allow browsing the storage namespace (since SRM v2.2) Functionalities are exposed to users through GFAL Library API python binding of GFAL Library is used to develop the DIRAC tools CHEP07 Sep 2-7 2007, Victoria, BC

Data integrity checks Considering the high number of interactions among DM system components, integrity checking is part of the DIRAC Data Management system. Two ways of performing checks: those running as Agents within the DIRAC framework those launched by the Data Manager to address specific situations. The Agent type of checks can be broken into two further distinct types. Those solely based on the information found on SE/LFC/BK BK->LFC LFC->SE SE->LFC Storage Usage Agent those based on a priori knowledge of where files should exist based on the Computing Model i.e DST always present at all T1’s disks CHEP07 Sep 2-7 2007, Victoria, BC

DMS Integrity Agents overview The complete suite for integrity checking includes an assortment of agents: Agents providing independent integrity checks on catalogs and storages and reporting to IntegrityDB Further agent (Data Integrity Agent) processes, where possible, the files contained in the IntegrityDB by correcting, registering or replicating files as needed CHEP07 Sep 2-7 2007, Victoria, BC

Data integrity checks & DM Console The Data Management Console is the interface for the Data Manager. the DM Console allows data integrity checks to be launched. The development of these tools has been driven by experience many catalog operations (fixes) bulk extraction of replica information deletion of replicas according to sites extraction of replicas through LFC directories change of replicas’ SE name in the catalogue creations of bulk transfer/removal jobs CHEP07 Sep 2-7 2007, Victoria, BC

BK - LFC Consistency Agent Main problem affecting BK: many lfns registred in the BK but failed to be registred on LFC missing files in the LFC: users trying to select LFNs in the BK can’t find any replica in the LFC Possible causes: Failing of registration on the LFC due to failure on copy, temporary lack of service.. BKLFC: performs massive check on productions checking from BK dumps of different productions against same directories on LFC for each production: checking for the existence of each entry from BK against LFC check on file sizes In case of missing or problematic files, reports to the IntegrityDB BK LFC CHEP07 Sep 2-7 2007, Victoria, BC

LFC Pathologies Many different possible inconsistencies arising in a complex computing model: zero size files: file metadata registred on LFC but missing information on size (set to 0) missing replica information: missing replica field in the Replica Information Table on the DB wrong SAPath: (bugs from DIRAC old version, now fixed) srm://gridka-dCache.fzk.de:8443/castor/cern.ch/grid/lhcb/production/DC06/v1-lumi2/00001354/DIGI/0000/00001354_00000027_9.digi GRIDKA-tape wrong SE host: CERN_Castor, wrong info in the LHCb Configuration Service wrong protocol sfn, rfio, bbftp… mistakes in files registration blank spaces on the surl path, carriage returns, presence of port number in the surl path.. CHEP07 Sep 2-7 2007, Victoria, BC

LFC – SE Consistency Agent LFC replicas need perfect coherence with storage replicas both in path, protocol and size: Replication issue: check whether the LFC replicas are really resident on Physical storages (check the existence and the size of files) if files are not existing, they are recorded as such in the Integrity DB Registration issues: LFC->SE agent stores problematic files in central IntegrityDB according to different pathologies: zero size files missing replica information wrong SA Path wrong protocol LFC SE CHEP07 Sep 2-7 2007, Victoria, BC

SE – LFC Consistency Agent Checks the SE contents against LCG File Catalogue: lists the contents of the SE checks against the catalogue for corresponding replicas if files are missing (due to any kind of incorrect registration), they are recorded as such in the Integrity DB missing efficient Storage Interface for bulk meta data queries (directory listings) not possible to list the content of remote directories and getting associated meta-data (lcg-ls) Further implementations to be put in place through SRM v2!! SE LFC CHEP07 Sep 2-7 2007, Victoria, BC

Storage Usage Agent Using the registered replicas and their sizes on the LFC, this agent constructs an exhaustive picture of current LHCb storage usage: works through breakdown by directories loops on LFC extracting files sizes according to different storages stores information on central IntegrityDB produce a full picture of disk and tape occupancy on each storage provides an up-to-dated picture of LHCb’s usage of resources in almost real time Foreseen development: using LFC accounting interface to have a global picture per site CHEP07 Sep 2-7 2007, Victoria, BC

Data Integrity Agent The Integrity agent takes actions over a wide number of pathologies stored by agents in the IntegrityDB. Action taken: LFC – SE: in case of missing replica on LFC: produce SURL paths starting from LFN, according to DIRAC Configuration System for all the defined storage elements; extensive search throughout all T1 SEs if search successful, registration of missing replicas. same action in case of zero-size files, wrong SA-Path,.. BK - LFC: if file not present on LFC: extensive research performed on all SEs if file is not found anywhere removal of flag ‘has replica’: no more visible to users if file is found: update of LFC with missing file infos extracted from storages SE – LFC: files missing from the catalogue can be: registered in catalogue if LFN is present deleted from SE if LFN is missing on the catalogue SE LFC DIRAC CS SE SE BK LFC SE LFC

Prevention of Inconsistencies Failover mechanism: each operation that can fail is wrapped in a XML record as a request which can be stored in a Request DB. Request DBs are sitting in one of the LHCb VO Boxes, which ensures that these records will never be lost these requests are executed by dedicated agents running on VO Boxes, and are retried as many times as needed until they succeed examples: files registration operation, data transfer operation, BK registration… Many other internal checks are also implemented within the DIRAC system to avoid data inconsistencies as much as possible. They include for example: checking on file transfers based on file size or checksum, etc.. CHEP07 Sep 2-7 2007, Victoria, BC

Conclusions Integrity checks suite is an important part of Data Management activity Further development will be possible with SRM v2 (SE vs LFC Agent) Most effort now in the prevention of inconsistencies (checksums, failover mechanisms…) Final target: minimizing the number of occurrences of frustrated users looking for non-existing data. CHEP07 Sep 2-7 2007, Victoria, BC

Backup CHEP07 Sep 2-7 2007, Victoria, BC

DIRAC Architecture DIRAC (Distributed Infrastructure with Remote Agent Control) is the LHCb’s grid project DIRAC architecture split into three main component types: Services - independent functionalities deployed and administered centrally on machines accessible by all other DIRAC components Resources - GRID compute and storage resources at remote sites Agents - lightweight software components that request jobs from the central Services for a specific purpose. The DIRAC Data Management System is made up an assortment of these components. CHEP07 Sep 2-7 2007, Victoria, BC

DIRAC DM System Main components of the DIRAC Data Management System: Storage Element abstraction of GRID storage resources: Grid SE (also Storage Element) is the underlying resource used actual access by specific plug-ins srm, gridftp, bbftp, sftp, http supported namespace management, file up/download, deletion etc. Replica Manager provides an API for the available data management operations point of contact for users of data management systems removes direct operation with Storage Element and File Catalogs uploading/downloading file to/from GRID SE, replication of files, file registration, file removal File Catalog standard API exposed for variety of available catalogs allows redundancy across several catalogs CHEP07 Sep 2-7 2007, Victoria, BC

Data Management Clients DM Clients Data Management Clients UserInterface WMS TransferAgent FileCatalogC FileCatalogB DIRAC Data Management Components ReplicaManager FileCatalogA StorageElement SRMStorage GridFTPStorage HTTPStorage Physical storage SE Service CHEP07 Sep 2-7 2007, Victoria, BC