DIRAC Data Management: consistency, integrity and coherence of data Marianne Bargiotti CERN.

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

Lectures on File Management
Andrew C. Smith, 11 th May 2007 EGEE User Forum 2 - DIRAC Data Management System User Forum 2 Data Management System “Tell me and I forget. Show me and.
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
Stuart K. PatersonCHEP 2006 (13 th –17 th February 2006) Mumbai, India 1 from DIRAC.Client.Dirac import * dirac = Dirac() job = Job() job.setApplication('DaVinci',
Data Grid Web Services Chip Watson Jie Chen, Ying Chen, Bryan Hess, Walt Akers.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
ATLAS DQ2 Deletion Service D.A. Oleynik, A.S. Petrosyan, V. Garonne, S. Campana (on behalf of the ATLAS Collaboration)
Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.
INFSO-RI Enabling Grids for E-sciencE Logging and Bookkeeping and Job Provenance Services Ludek Matyska (CESNET) on behalf of the.
Marianne BargiottiBK Workshop – CERN - 6/12/ Bookkeeping Meta Data catalogue: present status Marianne Bargiotti CERN.
Bookkeeping Tutorial. Bookkeeping & Monitoring Tutorial2 Bookkeeping content  Contains records of all “jobs” and all “files” that are created by production.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.
4/5/2007Data handling and transfer in the LHCb experiment1 Data handling and transfer in the LHCb experiment RT NPSS Real Time 2007 FNAL - 4 th May 2007.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.
1 LCG-France sites contribution to the LHC activities in 2007 A.Tsaregorodtsev, CPPM, Marseille 14 January 2008, LCG-France Direction.
LHCb The LHCb Data Management System Philippe Charpentier CERN On behalf of the LHCb Collaboration.
T3 analysis Facility V. Bucard, F.Furano, A.Maier, R.Santana, R. Santinelli T3 Analysis Facility The LHCb Computing Model divides collaboration affiliated.
Author: Andrew C. Smith Abstract: LHCb's participation in LCG's Service Challenge 3 involves testing the bulk data transfer infrastructure developed to.
1 LHCb File Transfer framework N. Brook, Ph. Charpentier, A.Tsaregorodtsev LCG Storage Management Workshop, 6 April 2005, CERN.
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
CHEP 2006, February 2006, Mumbai 1 LHCb use of batch systems A.Tsaregorodtsev, CPPM, Marseille HEPiX 2006, 4 April 2006, Rome.
Managing Data DIRAC Project. Outline  Data management components  Storage Elements  File Catalogs  DIRAC conventions for user data  Data operation.
INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.
Database authentication in CORAL and COOL Database authentication in CORAL and COOL Giacomo Govi Giacomo Govi CERN IT/PSS CERN IT/PSS On behalf of the.
Bookkeeping Tutorial. 2 Bookkeeping content  Contains records of all “jobs” and all “files” that are produced by production jobs  Job:  In fact technically.
David Adams ATLAS ATLAS distributed data management David Adams BNL February 22, 2005 Database working group ATLAS software workshop.
EGI-Engage Data Services and Solutions Part 1: Data in the Grid Vincenzo Spinoso EGI.eu/INFN Data Services.
Data Management The European DataGrid Project Team
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Data Management Highlights in TSA3.3 Services for HEP Fernando Barreiro Megino,
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
1 LHCb view on Baseline Services A.Tsaregorodtsev, CPPM, Marseille Ph.Charpentier CERN Baseline Services WG, 4 March 2005, CERN.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
1 DIRAC agents A.Tsaregorodtsev, CPPM, Marseille ARDA Workshop, 7 March 2005, CERN.
LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Andrea Manzi CERN EGI Conference on Challenges and Solutions for Big Data Processing on cloud 24/09/2014 Storage Management Overview 1 24/09/2014.
Breaking the frontiers of the Grid R. Graciani EGI TF 2012.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Architecture of LHC File Catalog Valeria Ardizzone INFN Catania – EGEE-II NA3/NA4.
SAM architecture EGEE 07 Service Availability Monitor for the LHC experiments Simone Campana, Alessandro Di Girolamo, Nicolò Magini, Patricia Mendez Lorenzo,
VO Box discussion ATLAS NIKHEF January, 2006 Miguel Branco -
1 DIRAC Project Status A.Tsaregorodtsev, CPPM-IN2P3-CNRS, Marseille 10 March, DIRAC Developer meeting.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Riccardo Zappi INFN-CNAF SRM Breakout session. February 28, 2012 Ingredients 1. Basic ingredients (Fabric & Conn. level) 2. (Grid) Middleware ingredients.
EGEE Data Management Services
Federating Data in the ALICE Experiment
Jean-Philippe Baud, IT-GD, CERN November 2007
The EDG Testbed Deployment Details
ATLAS Use and Experience of FTS
StoRM: a SRM solution for disk based storage systems
Vincenzo Spinoso EGI.eu/INFN
gLite Data management system overview
GGF OGSA-WG, Data Use Cases Peter Kunszt Middleware Activity, Data Management Cluster EGEE is a project funded by the European.
Gfal/lcg-util -> Gfal2/gfal2-util
Introduction to Data Management in EGI
LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.
GSAF Grid Storage Access Framework
Stephen Burke, PPARC/RAL Jeff Templon, NIKHEF
GSAF Grid Storage Access Framework
New developments on the LHCb Bookkeeping
Data Management cluster summary
Data Management in LHCb: consistency, integrity and coherence of data
DIRAC Data Management: consistency, integrity and coherence of data
The LHCb Computing Data Challenge DC06
Presentation transcript:

DIRAC Data Management: consistency, integrity and coherence of data Marianne Bargiotti CERN

Marianne Bargiotti CHEP07 Sep , Victoria, BC2 Outline oDIRAC Data Management System (DMS) oLHCb catalogues and physical storage oDMS integrity checking oConclusions

Marianne Bargiotti CHEP07 Sep , Victoria, BC3 DIRAC Data Management System oDIRAC project ( Distributed Infrastructure with Remote Agent Control ) is the LHCb Workload and Data Management System  DIRAC architecture based on Services and Agents see A. Tsaregorodsev poster [189] oThe DIRAC Data Management System deals with three components:  File Catalogue:  File Catalogue: allows to know where files are stored  Bookkeeping Meta Data DB (BK):  Bookkeeping Meta Data DB (BK): allows to know what are the contents of the files  Storage Elements:  Storage Elements: underlying Grid Storage Elements (SE) where files are stored fundamental consistency between these catalogues and Storage Elements is fundamental for a reliable Data Management see A.C Smith poster [195]

Marianne Bargiotti CHEP07 Sep , Victoria, BC4 LHCb File Catalogue oLHCb choice: LCG File Catalogue (LFC)  it allows registering and retrieving the location of physical replicas in the grid infrastructure.  It stores:  file information (lfn, size, guid)  replica information oDIRAC WMS uses LFC information to decide where jobs can be scheduled Fundamental to avoid any kind of inconsistencies both with storages and with related catalogues (BK Meta Data DB) oBaseline choice for DIRAC: central LFC  one single master (R/W) and many RO mirrors  coherence ensured by single write endpoint

Marianne Bargiotti CHEP07 Sep , Victoria, BC5 Registration of replicas oGUID check: before the registration in the LCG File Catalogue, at the beginning of transfer phase, the existence of file GUID to be transferred is checked  to avoid GUID mismatch problem in registration oAfter a successful transfer, LFC registration of files is divided into 2 atomic operations  booking of meta data fields with the insertion in the dedicated table of lfn, guid and size  replica registration  if either step fails:  possible source of errors and inconsistencies  e.g the file is registered without any replica or with zero size

Marianne Bargiotti CHEP07 Sep , Victoria, BC6 LHCb Bookkeeping Meta Data DB oThe Bookkeeping (BK) is the system that stores data provenience information. oIt contains information about jobs and files and their relations:  Job: Application name, Application version, Application parameters, which files it has generated etc..  File: size, event, filename, guid, from which job it was generated etc. oThe Bookkeeping DB represents the main gateway for users to select the available data and datasets. oAll data visible to users are flagged as ‘ Has replica ’ All the data stored in the BK and flagged as ‘having replica’, must be correctly registered and available in LFC.

Marianne Bargiotti CHEP07 Sep , Victoria, BC7 Storage Elements oDIRAC Storage Element Client  provides uniform access to GRID Storage Elements  implemented with plug-in modules for access protocols srm, gridftp, bbftp, sftp, http supported oSRM is the standard interface to grid storage oLHCb has 14 SRM endpoints  disk and tape storage for each T1 site oSRM will allow browsing the storage namespace (since SRM v2.2) oFunctionalities are exposed to users through GFAL Library API  python binding of GFAL Library is used to develop the DIRAC tools

Marianne Bargiotti CHEP07 Sep , Victoria, BC8 Data integrity checks oConsidering the high number of interactions among DM system components, integrity checking is part of the DIRAC Data Management system. oTwo ways of performing checks: those running as Agents within the DIRAC framework those launched by the Data Manager to address specific situations. oThe Agent type of checks can be broken into two further distinct types.  Those solely based on the information found on SE/LFC/BKBK->LFCLFC->SESE->LFC Storage Usage Agent  those based on a priori knowledge of where files should exist based on the Computing Model  i.e DST always present at all T1’s disks

Marianne Bargiotti CHEP07 Sep , Victoria, BC9 DMS Integrity Agents overview oThe complete suite for integrity checking includes an assortment of agents:  Agents providing independent integrity checks on catalogs and storages and reporting to IntegrityDB  Further agent (Data Integrity Agent) processes, where possible, the files contained in the IntegrityDB by correcting, registering or replicating files as needed

Marianne Bargiotti CHEP07 Sep , Victoria, BC10 Data integrity checks & DM Console oThe Data Management Console is the interface for the Data Manager.  the DM Console allows data integrity checks to be launched. oThe development of these tools has been driven by experience  many catalog operations (fixes)  bulk extraction of replica information  deletion of replicas according to sites  extraction of replicas through LFC directories  change of replicas’ SE name in the catalogue  creations of bulk transfer/removal jobs

Marianne Bargiotti CHEP07 Sep , Victoria, BC11 BK - LFC Consistency Agent oMain problem affecting BK: many lfns registred in the BK but failed to be registred on LFC missing files in the LFC: users trying to select LFNs in the BK can’t find any replica in the LFC  Possible causes: Failing of registration on the LFC due to failure on copy, temporary lack of service.. oBK  LFC: performs massive check on productions  checking from BK dumps of different productions against same directories on LFC  for each production: checking for the existence of each entry from BK against LFC check on file sizes In case of missing or problematic files, reports to the IntegrityDB BK LF C

Marianne Bargiotti CHEP07 Sep , Victoria, BC12 LFC Pathologies oMany different possible inconsistencies arising in a complex computing model:  zero size files:  file metadata registred on LFC but missing information on size (set to 0)  missing replica information  missing replica information:  missing replica field in the Replica Information Table on the DB  wrong SAPath:  wrong SAPath: (bugs from DIRAC old version, now fixed)  srm://gridka- dCache.fzk.de:8443/castor/cern.ch/grid/lhcb/production/DC06/v1- lumi2/ /DIGI/0000/ _ _9.digi GRIDKA-tape  wrong SE host:  CERN_Castor, wrong info in the LHCb Configuration Service  wrong protocol  sfn, rfio, bbftp…  mistakes in files registration  blank spaces on the surl path, carriage returns, presence of port number in the surl path..

Marianne Bargiotti CHEP07 Sep , Victoria, BC13 LFC – SE Consistency Agent oLFC replicas need perfect coherence with storage replicas both in path, protocol and size:  Replication issue: check whether the LFC replicas are really resident on Physical storages (check the existence and the size of files) if files are not existing, they are recorded as such in the Integrity DB  Registration issues: LFC->SE agent stores problematic files in central IntegrityDB according to different pathologies: zero size files missing replica information wrong SA Path wrong protocol LFC SE

Marianne Bargiotti CHEP07 Sep , Victoria, BC14 SE – LFC Consistency Agent oChecks the SE contents against LCG File Catalogue:  lists the contents of the SE  checks against the catalogue for corresponding replicas if files are missing (due to any kind of incorrect registration), they are recorded as such in the Integrity DB  missing efficient Storage Interface for bulk meta data queries (directory listings)  not possible to list the content of remote directories and getting associated meta-data (lcg-ls)  Further implementations to be put in place through SRM v2!! SE LFC

Marianne Bargiotti CHEP07 Sep , Victoria, BC15 Storage Usage Agent oUsing the registered replicas and their sizes on the LFC, this agent constructs an exhaustive picture of current LHCb storage usage:  works through breakdown by directories  loops on LFC extracting files sizes according to different storages  stores information on central IntegrityDB  produce a full picture of disk and tape occupancy on each storage  provides an up-to-dated picture of LHCb’s usage of resources in almost real time oForeseen development: using LFC accounting interface to have a global picture per site

Marianne Bargiotti 16 Data Integrity Agent oThe Integrity agent takes actions over a wide number of pathologies stored by agents in the IntegrityDB. oAction taken:  LFC – SE:  in case of missing replica on LFC: produce SURL paths starting from LFN, according to DIRAC Configuration System for all the defined storage elements; extensive search throughout all T1 SEs if search successful, registration of missing replicas.  same action in case of zero-size files, wrong SA-Path,..  BK - LFC:  if file not present on LFC: extensive research performed on all SEs if file is not found anywhere  removal of flag ‘has replica’: no more visible to users if file is found:  update of LFC with missing file infos extracted from storages  SE – LFC:  files missing from the catalogue can be: registered in catalogue if LFN is present deleted from SE if LFN is missing on the catalogue BK SE LFC SE DIRAC CS SE LFC SE

Marianne Bargiotti CHEP07 Sep , Victoria, BC17 Prevention of Inconsistencies oFailover mechanism:  each operation that can fail is wrapped in a XML record as a request which can be stored in a Request DB.  Request DBs are sitting in one of the LHCb VO Boxes, which ensures that these records will never be lost  these requests are executed by dedicated agents running on VO Boxes, and are retried as many times as needed until they succeed  examples: files registration operation, data transfer operation, BK registration… oMany other internal checks are also implemented within the DIRAC system to avoid data inconsistencies as much as possible. They include for example:  checking on file transfers based on file size or checksum, etc..

Marianne Bargiotti CHEP07 Sep , Victoria, BC18 Conclusions oIntegrity checks suite is an important part of Data Management activity oFurther development will be possible with SRM v2 (SE vs LFC Agent) oMost effort now in the prevention of inconsistencies (checksums, failover mechanisms…) oFinal target: minimizing the number of occurrences of frustrated users looking for non-existing data.

Marianne Bargiotti CHEP07 Sep , Victoria, BC19 Backup

Marianne Bargiotti CHEP07 Sep , Victoria, BC20 DIRAC Architecture oDIRAC (Distributed Infrastructure with Remote Agent Control) is the LHCb’s grid project oDIRAC architecture split into three main component types:  Services - independent functionalities deployed and administered centrally on machines accessible by all other DIRAC components  Resources - GRID compute and storage resources at remote sites  Agents - lightweight software components that request jobs from the central Services for a specific purpose. oThe DIRAC Data Management System is made up an assortment of these components.

Marianne Bargiotti CHEP07 Sep , Victoria, BC21 DIRAC DM System oMain components of the DIRAC Data Management System:  Storage Element abstraction of GRID storage resources: Grid SE (also Storage Element) is the underlying resource used actual access by specific plug-ins srm, gridftp, bbftp, sftp, http supported namespace management, file up/download, deletion etc.  Replica Manager provides an API for the available data management operations point of contact for users of data management systems removes direct operation with Storage Element and File Catalogs uploading/downloading file to/from GRID SE, replication of files, file registration, file removal  File Catalog standard API exposed for variety of available catalogs allows redundancy across several catalogs

Marianne Bargiotti CHEP07 Sep , Victoria, BC22 DM Clients Data Management Clients SE Service SE Service SRMStorage GridFTPStorage HTTPStorage StorageElement ReplicaManager UserInterface WMS TransferAgent Physical storage DIRAC Data Management Components FileCatalogC FileCatalogB FileCatalogA