Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to reading and writing files in Grid

Similar presentations


Presentation on theme: "Introduction to reading and writing files in Grid"— Presentation transcript:

1 Introduction to reading and writing files in Grid
Flavia Donno INFN and CERN Scuola per utenti INFN della Grid Bologna, 27 November 2007

2 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 2
Outline The File System Files, aliases, i-nodes in UNIX Local file access and file handlers Remote access: NFS Other remote access protocols Transferring a file via FTP The Grid analogy The Storage Element and Storage Services: CASTOR, dCache, DPM, StoRM The Storage Resource Manager v1.1 and v2.2 Grid Spaces and Space Tokens Grid Files: Permanent and Volatile Grid File Access Protocols: gridftp, gsirfio, gsidcap Logical File Name and Grid Unique ID Site URL (SURL) and Transfer URL (TURL) Grid Catalogues: the Lightweight File Catalogue (LFC) Storing a file in the LFC The basic operations to access a file in Grid Scuola per utenti INFN della Grid, Bologna, 27 November

3 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 3
The File System In computing, a file system is a method for storing and organizing computer files and the data they contain to make it easy to find and access them. File systems may use a data storage device such as a hard disk or CD-ROM and involve maintaining the physical location of the files More formally, a file system is a set of abstract data types that are implemented for the storage, hierarchical organization, manipulation, navigation, access, and retrieval of data. Scuola per utenti INFN della Grid, Bologna, 27 November

4 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 4
The File System File systems allows for the storage of data: Data are stored as files with characteristics associated (stored in i- nodes) Data/Files have a name and are organized hierarchically in directories It is possible to manipulate the namespace (mkdir, rmdir), navigate it (ls), manage it and the data associated (mv, rm, cp, etc.) st_mode File permissions (user, other, group) and flags st_ino File serial number st_dev File Device Number st_nlink File link count st_uid The owner’s user ID st_gid The owner’s group ID st_size File Size in bytes st_atime The last access time st_mtime The last modification time st_ctime The file’s creation time i-node number = unique identifier of a file in a filesystem link = an alias for the file file device = identifies the physical location of a file Scuola per utenti INFN della Grid, Bologna, 27 November

5 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 5
The File System i-node number = unique identifier of a file in a filesystem link = an alias for the file Scuola per utenti INFN della Grid, Bologna, 27 November

6 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 6
The File System It is possible to access and retrieve data Open returns a file handle The file handle can be used with stat, read/write, close A file descriptor or handle is an index for an entry in a kernel-resident data structure containing the details of all open files. The user application passes the abstract key to the kernel through a system call, and the kernel will access the file on behalf of the application, based on the key. The application itself cannot read or write the file descriptor table directly. Scuola per utenti INFN della Grid, Bologna, 27 November

7 The Remote File Systems
NFS is the network file system that allows users to transparently access a file resident on a different computer over the network using the standard system calls: open, read/write, close, stat Today, there are many proprietary systems that allow for transparent access to files over the network in a very efficient way: GPFS, Lustre, etc. No changes are required to the application in order to use network file system files. Scuola per utenti INFN della Grid, Bologna, 27 November

8 Remote File Access protocols
In HEP, many remote file access protocols are in use. Among the most popular: rfio, dcap, root Scuola per utenti INFN della Grid, Bologna, 27 November

9 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 9
RFIO Rfio is one of the components of the CERN Advanced Storage Manager (CASTOR) Rfio is an efficient protocol for remote file access Light weight Control and Data streams are separated Multiple parallel streams are not implemented It consists of : A daemon (rfiod) An API library offering POSIX like open, read, write and seek (libshift) A command line interface Client programs can use RFIO libraries to access files on remote disks or in the CASTOR namespace. The libraries detect the location of the files and take appropriate action to make them available for the client application. rfdir rfmkdir rfcat rfrename rftp rfstat rfchmod rfrm rfcp Scuola per utenti INFN della Grid, Bologna, 27 November

10 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 10
RFIO How is a file addressed in RFIO ? The physical location of a file must be passed. This includes hostname and port where the rfiod daemon runs. It is specified via a Transport URL (TURL – see later): rfio://[host][:port]/[path] CASTOR files are special since they are recognized by the protocol. They are of the form: rfio://[stagehost][:port]/?[svcClass=MySvcClass&][castorVersion=MyCast orVersion&]path=/castor/cern.ch/user/n/nobody/file' or  rfio://[stagehost][:port]/[/castor/cern.ch/user/n/nobody/file][?[svcClass=My SvcClass&][castorVersion=MyCastorVersion]] All details can be found here: Scuola per utenti INFN della Grid, Bologna, 27 November

11 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 11
DCAP dcap (disk cache access protocol) is the native file access protocol of the dCache Storage System used in WLCG dcap comes with dvanced tuning options and a “passive mode” to get access from behind firewalls and networks using NAT. dcap supports regular file access functions, offering POSIX-like IO, including open, read, write, seek, stat, and close, as well as the standard filesystem namespace operations: To make non dCache aware applications access files within dCache through DCAP you need to set the following environment variables: export LD_PRELOAD=/opt/d-cache/dcap/lib/libpdcap.so export DCACHE_IO_TUNNEL=/opt/d-cache/dcap/lib/libgsiTunnel.so Dccp provides a cp like functionality on the dCache file system called Pretty Normal File System (pnfs). Scuola per utenti INFN della Grid, Bologna, 27 November

12 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 12
DCAP How is a file addressed in dcap ? When using the DCAP library, the pnfs full filename must be used When using ROOT two possible URL syntax can be used: dcache:/pnfs/<path>/<file>.root or dcap://<nodename.org>/<path>/<file>.root All details can be found here: Scuola per utenti INFN della Grid, Bologna, 27 November

13 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 13
ROOT ROOT is a basic framework that offers a common set of features and tools for many domains that include event generators, detector simulation, event reconstructions, data analysis However ROOT is also an efficient protocol to read and write ROOT files over the network ROOT files can be addressed as follows: roots://hpsalo/files/aap.root root://hpbrun.cern.ch/root/hsimple.root root://pcna49a:5151/~na49/data/run821.root root://pcna49d.cern.ch:5050//v1/data/run810.root % root root [0] TFile *f = TFile::Open("root://fsgi02.fnal.gov:5151/file.root","new") Details about the ROOT protocol can be found here: Scuola per utenti INFN della Grid, Bologna, 27 November

14 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 14
FTP The File Transfer Protocol (FTP) allows for the efficient transfer and management of files over the network It offers put/get functionalities It allows also for namespace browsing and manipulation (ls, mkdir, rmdir, etc.) Scuola per utenti INFN della Grid, Bologna, 27 November

15 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 15
The Grid Analogy Accessing a file over the Grid implies several operations that are normally not required on a LAN Where is the file that one needs to access ? Are there many copy of the same file ? What is the storage system serving the file ? What file access protocol do I need to use ? Do I have the read/write access privileges ? Is the file going to be available for the time I need ? Is there space to write a new file ? How do I get a file “handle” ? In what follows we will try to answer all these questions Scuola per utenti INFN della Grid, Bologna, 27 November

16 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 16
The Storage Element A Grid Storage Element provides storage space for files Initially was disk-only based with a Grid-aware FTP server Such an arrangement was soon considered insufficient given the amount of data that LHC experiments had to handle and the multitude of storage solutions Furthermore, a uniform interface to shield users from the peculiarities of a specific storage system was highly needed Scuola per utenti INFN della Grid, Bologna, 27 November

17 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 17
The Storage Element The storage systems that are today in use in WLCG are: CERN Advanced Storage Manager (CASTOR) developed at CERN and RAL dCACHE developed at DESY/FNAL/NDGF Disk Pool Manager (DPM) developed at CERN StoRM (Storage Resource Manager) developed at INFN/CNAF CASTOR and DCACHE are complex systems that can manage tape libraries and offer a filesystem view of the available data DPM is a storage solution that manages the space served by multiple disk servers StoRM is a storage solution based on parallel or distributed filesystems Scuola per utenti INFN della Grid, Bologna, 27 November

18 The Storage Service Interface
The need for a standard interface for storage services in Grid was recognized back in 2001 International collaboration: LBNL, FNAL, CERN, JLAB Provide basic functionality required SRM v1.1 implemented by all major storage providers: CASTOR, dCache, DPM The main functions: Get, getRequestStatus, pin, unpin Put, setFileStatus, Copy getProtocols, AdvisoryDelete, FileMetaData Main features: Asynchronous operations Support for bulk requests Protocol negotiation Main problems: Missing reference implementation/No clear specs Advisory delete No space management No explicit quality of storage management No abort operations No staging operations Scuola per utenti INFN della Grid, Bologna, 27 November

19 The Storage Resource Manager v2.2
It was only in May 2006 that an agreement on the functionality was reached. After more than one year, SRM v2.2 is finally being deployed in production Details will be given later Scuola per utenti INFN della Grid, Bologna, 27 November

20 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 20
Storage Classes In SRM v2.2 it is possible to select the quality of storage A storage class is a quality of storage defined by the Retention Policy and Access Latency Retention Policy: Custodial or Replica Access Latency: Nearline or Online The WLCG SRM v2.2 MoU defines 3 cases: Custodial x Nearline  “Tape1Disk0” Custodial x Online  “Tape1Disk1” Replica x Online  “Tape0Disk1” TapeN  N copies guaranteed on tape Or other high-quality media Tape1/Custodial  “Do not lose this data!” Tape0/Replica  “No disaster if this data is lost.” (a custodial copy may be elsewhere) DiskM  M copies guaranteed on disk Disk0 managed by system, Disk1 managed by VO Scuola per utenti INFN della Grid, Bologna, 27 November

21 Grid Spaces and Space Tokens
In SRM v2.2 it is possible to reserve space of a given quality. The space reserved always refers to the space on disk. The WLCG SRM v2.2 MoU establishes that spaces can be reserved statically, even though dynamic space reservation can be supported by a storage system. A “space token description” is a tag that identifies a “chunk of space” with given characteristics (such as its storage class, size, protocols supported, etc.) The reserved space can be used by the VO or VOMS FQAN specified when the space has been reserved The space token description is used whenever files are created. For read operations the token is not needed. Scuola per utenti INFN della Grid, Bologna, 27 November

22 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 22
SRM v2.2 Files In SRM v2.2 files are permanent: only the user can remove them from the system The copy on disk on the file in Tape1Disk0 space can be temporarily removed by the system if space is needed. Copies can be “pinned” to prevent the system from deleting them from disk while not in use Copies can be “released” when no longer needed. The garbage collector can then delete the copy on disk in order to make space for other copies A file in Tape1Disk0 space for which a copy does not exist on disk can be staged from tape to disk before the file is needed Scuola per utenti INFN della Grid, Bologna, 27 November

23 SRM v2.2 File Access Protocols
SRM v2.2 allows for the negotiation of the file access protocols The application can contact the Storage Server asking for a list of possible file access protocols. The server responds providing the file handle for the supported protocol Almost all file access protocols mentioned before have a GSI-aware version: gsirfio gsidcap gsiftp Scuola per utenti INFN della Grid, Bologna, 27 November

24 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 24
SRM v2.2: usage scenario Reserve ATLAS_RAW T1D9 SRM V2.2 Interface File Access Protocols dcap rfio root gsiftp Castor DPM StoRM dCache SRB ATLAS_RAW gsiftp://<host>/<file> gsiftp://<host>/<file> Put <file> ATLAS_RAW gsiftp Get <file> gsiftp Scuola per utenti INFN della Grid, Bologna, 27 November

25 Localization of a Grid file
Files in a file systems have a name. Files in Grid have a Logical File Name (LFN) As in the file system a file can have multiple links, in the Grid a file can have multiple LFNs lfn:cms/ /run2/track1 However, it must be possible to univocally identify a Grid file. The Grid Unique ID (GUID) serves this purpose. A non-human-readable unique identifier for an item of data guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 Scuola per utenti INFN della Grid, Bologna, 27 November

26 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 26
SURL, TURL and PFNs A Site URL (SURL) allows a user to contact a Storage Service at a site asking for file access srm://pcrd24.cern.ch:8443/srm/managerv2?SFN=/flatfiles/cms/output10 srm://pcrd24.cern.ch:8443/srm/managerv1?SFN=/flatfiles/cms/output10 srm://pcrd24.cern.ch:8443/flatfiles/cms/output10 srm – control protocol for the storage service Fully specified SURL A Transport URL (TURL) is temporary locator of a replica accessible via a specified access protocol understood by the storage service rfio://lxshare0209.cern.ch/data/alice/ntuples.dat A Site File Name (SFN) is the file location as understood by a local storage system /castor/cern.ch/user/n/nobody/file?svcClass=custorpublic&castorVersion=2 Scuola per utenti INFN della Grid, Bologna, 27 November

27 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 27
What is a catalogue ? File Catalog SE SE SE gLite UI Scuola per utenti INFN della Grid, Bologna, 27 November

28 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 28
What is a catalogue ? Users and applications need to locate files (or replicas) on the whole Grid. The File Catalog is the service which allows it and it maintains the mappings between LFNs, GUIDs and SURLs. In WLCG, file cataloguing operations are provided by the LFC (LCG File Catalog). LFC is deployed as a centralized service (sometime replicated) and its endpoint is published on the Information Service in order to be found by the LCG data management tools and/or other GRID services. SRM File Catalog Symbolic Link 1 Physical File SURL 1 TURL 1 . LFN GUID . . Symbolic Link n Physical File SURL n TURL n Scuola per utenti INFN della Grid, Bologna, 27 November

29 Scuola per utenti INFN della Grid, Bologna, 27 November 2007 - 29


Download ppt "Introduction to reading and writing files in Grid"

Similar presentations


Ads by Google