EGEE-II INFSO-RI Enabling Grids for E-sciencE Data management in EGEE
Enabling Grids for E-sciencE EGEE-II INFSO-RI Data services on Grids Simple data files on grid-specific storage Middleware supporting –Replica files to be close to where you want computation For resilience –Logical filenames –Catalogue: maps logical name to physical storage device/file –Virtual filesystems, POSIX-like I/O –Services provided: storage, transfer, catalogue that maps logical filenames to replicas. Solutions include –gLite data service –Globus: Data Replication Service –Storage Resource Broker Other data! e.g. …. –Structured data: RDBMS, XML databases,… –Files on project’s filesystems –Data that may already have other user communities not using a Grid Require extendable middleware tools to support –Computation near to data –Controlled exposure of data without replication Basis for integration and federation OGSA –DAI –In Globus 4 –Not (yet...) in gLite
Enabling Grids for E-sciencE EGEE-II INFSO-RI Scope of data services in gLite Files that are write-once, read-many –If users edit files then They manage the consequences! Maybe just create a new filename! –No intention of providing a global file management system 3 service types for data –Storage –Catalogs –Transfer
Enabling Grids for E-sciencE EGEE-II INFSO-RI Data management example ResourceBrokerStorage Element 1 ComputingElement Input “sandbox” Input “sandbox” + Broker Info Output “sandbox” “User interface” Storage Element 2 1 st job writes and replicates output onto 2 SEs Max. 20MByt e DataSets info LCG FileCatalogue (LFC)
Enabling Grids for E-sciencE EGEE-II INFSO-RI Data management example 2 ResourceBrokerStorage Element 1 ComputingElement Input “sandbox” Input “sandbox” + Broker Info Output “sandbox” “User interface” Storage Element 2 job reads input from an SE Max. 20MByt e DataSets info LCG FileCatalogue (LFC) Keep computation close to data
Enabling Grids for E-sciencE EGEE-II INFSO-RI Logical file namesStorage Element 1 “User interface” LCG FileCatalogue (LFC) Storage Element 2 Content is available on 2 SEs “Myfile.dat” Myfile.dat File_on_se1 File_on_se2 guid
Enabling Grids for E-sciencE EGEE-II INFSO-RI Storage Element 1 “User interface” LCG FileCatalogue (LFC) Storage Element 2 “Myfile.dat” Myfile.dat “Logical filename” File_on_se1 (“SURL”: site URL) File_on_se2 (“SURL”: site URL) “GUID” Global Unique Identifier Resolving logical file name Content is available on 2 SEs File content cannot change No need to synchronize replicas
Enabling Grids for E-sciencE EGEE-II INFSO-RI Name conventions Logical File Name (LFN) –An alias created by a user to refer to some item of data, e.g. lfn:/grid/gilda/budapest23/run2/track1 Globally Unique Identifier (GUID) –A non-human-readable unique identifier for an item of data, e.g. guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6 Site URL (SURL) (or Physical File Name (PFN) or Site FN) –The location of an actual piece of data on a storage system, e.g. srm://pcrd24.cern.ch/flatfiles/cms/output10_1 (SRM) sfn://lxshare0209.cern.ch/data/alice/ntuples.dat (Classic SE) Transport URL (TURL) –Temporary locator of a replica + access protocol: understood by a SE, e.g. rfio://lxshare0209.cern.ch//data/alice/ntuples.dat
Enabling Grids for E-sciencE EGEE-II INFSO-RI Name conventions Users primarily access and manage files through “logical filenames” Mapping by the “LFC” catalogue server Defined by the userLFC Namespace LFC has a directory tree structure lfn:/grid/ /
Enabling Grids for E-sciencE EGEE-II INFSO-RI Storage Element 3 sfn://trigriden01.unime.it/flatfiles/SE00/gilda/generated/ /filec79a9e3c a2a5-235f Storage Element 2 srm://aliserv6.ct.infn.it/dpm/ct.infn.it/home/gilda/generated/ /filea21ab3e2-8ff6-4a44-82a7-f2 LFC directories LFC directories = virtual directories –Each entry in the directory is a pointer to files stored on SEs lfn:/grid/gilda/budapest23/run2/ input1 input2 input3 Storage Element 1 sfn://grid005.iucc.ac.il/storage/gilda/generated/ /fileb233d43f-5bc6-4ede-a5fe-611d48be2ba5 LCG FileCatalogue (LFC) Storage Element 4 sfn://grid005.iucc.ac.it/flatfiles/SE00/gilda/generated/ /filec79a9e3c a2a5-235f
Enabling Grids for E-sciencE EGEE-II INFSO-RI Two sets of commands lfc-* LFC = LCG File Catalogue LCG = LHC Compute Grid LHC = Large Hadron Collider –Use LFC commands to interact with the catalogue only To create catalogue directory List files –Used by you, your application and by lcg-utils (see below) lcg-* –Couples catalogue operations with file management Keeps SEs and catalogue in step! –Copy files to/from/between SEs –Replicated
Enabling Grids for E-sciencE EGEE-II INFSO-RI LFC basics Defined by the userLFC Namespace LFC has a directory tree structure /grid/ / All members of a given VO have read-write permissions in their directory Commands look like UNIX with “lfc-” in front (often)
Enabling Grids for E-sciencE EGEE-II INFSO-RI Storage Element Provides –Storage for files : massive storage system - disk or tape based –Transfer protocol (gsiFTP) ~ GSI based FTP server Striped file transfer – cluster as back-end Storage Element server File request + VOMS proxy File system Authentication, authorization
Enabling Grids for E-sciencE EGEE-II INFSO-RI GFAL C API GFAL (Grid File Access Library) is a POSIX interface for operation on file on Storage Element Enable remote handling of files Libraries are in C and can be included in C/C++ sources GFAL Java API – wrapper around the C code The most common of I/O operations are available, just prefix gfal_ to the function name (open(), read()…) man gfal for further details The destination SE must provide secure rfio (classic SEs don’t) GFAL API Description – deployment/documentation/LFC_DPM/gfal/htmlhttp://grid-deployment.web.cern.ch/grid- deployment/documentation/LFC_DPM/gfal/html
Enabling Grids for E-sciencE EGEE-II INFSO-RI EGEE Tutorial, Taipei, 1 May 2006 GFAL API code sniffet Examples in gLite3 User Guide (Appendix F) – int fd; struct stat remote_file_stat; fd = gfal_open(file_ref, O_RDONLY, 0644); cod_ex = gfal_stat(file_ref, &file_stat)... cod_ex = gfal_read(fd, buffer, file_stat.st_size));... cod_ex = gfal_close(fd);
Enabling Grids for E-sciencE EGEE-II INFSO-RI Metadata on the GRID Metadata is data about data On the EGEE Grid: information about files –Describes files –Locate files based on their metadata You many have 1000’s of files, being shared with other researchers –Either: You all access data by remembering lfns (or guids…) .. And hope you know what is in the file… –Or Have a metadata catalogue Allow selection of files based on metadata Metadata is fundamental to e-research
Enabling Grids for E-sciencE EGEE-II INFSO-RI AMGA Implementation AMGA – ARDA Metadata Grid Application –ARDA: A Realisation of Distributed Analysis for LHC Hundreds of millions of files No special security requirements Protection against DoS attacks Now part of gLite middleware –Official Metadata Service for EGEE –Also available as standalone component Expanding user community –HEP, Biomed, UNOSAT…
Enabling Grids for E-sciencE EGEE-II INFSO-RI Metadata concepts Schema Attribute 1: name 1– type 1 Attribute 2: name 2 – type 2 … Collection Entry 1 Entry 2 Entry 3 … A set of entries. Entries: The objects (e.g. files) that need to be described with metadata Schema: a set of attributes. Defines the structure of the metadata
Enabling Grids for E-sciencE EGEE-II INFSO-RI Metadata concepts Metadata catalog Schema Attribute 1: name 1– type 1 Attribute 2: name 2 – type 2 … Collection Entry 1 Entry 2 Entry 3 … Schema 2 Attribute 1: name 1– type 1 Attribute 2: name 2 – type 2 … Collection 2 Entry 1 Entry 2 Entry 3 … Schema 3 Attribute 1: name 1– type 1 Attribute 2: name 2 – type 2 … Collection 3 Entry 1 Entry 2 Entry 3 … Schema 4 Attribute 1: name 1– type 1 Attribute 2: name 2 – type 2 … Collection 4 Entry 1 Entry 2 Entry 3 …
Enabling Grids for E-sciencE EGEE-II INFSO-RI Metadata Concepts Some Concepts –Metadata - List of attributes associated with entries –Attribute – name/value pair with type information Type – The type (int, float, string,…) Name – The name of the attribute Value - Value of an entry's attribute –Schema – A set of attributes –Collection – A set of entries associated with a schema –Think of schemas as tables, attributes as columns, entries as rows
Enabling Grids for E-sciencE EGEE-II INFSO-RI Implementation of the concept in AMGA Schema lfn varchar(100) description varchar(200) Collection /grid/sipos/run2 AMGA server lfn:/grid/gilda/sipos/maps/hungary “Map of Hungary” The collection is a directory on the AMGA file system A schema is a table in an Relational Data Base. One schema is associated to each directory of the file system Input1 Input2 lfn:/grid/gilda/sipos/temp/data “Temperature values of Hungarian cities, ” images Files in an AMGA directory are entities described by metadata Content of AMGA files are irrelevant. Metadata is stored in the DB records. A DB record is stored for each file Collections can be nested Sub-Schema lfn varchar(100) description varchar(200) X_res int Y_res int
Enabling Grids for E-sciencE EGEE-II INFSO-RI An example: AMGA and LFC in UNOSAT ◘ LFC Catalogue ➸ Mapping of LFN to TURL ◘ UNOSAT requires ➸ User will give as input data certain coordinates (x, y, z) ➸ As output, want the satellite image file for downloading ◘ The ARDA Group assists us setting up the AMGA tool for UNOSAT AMGA Oracle DB ARDA APP LFC Storage Element SRM Metadata (x,y,z) LFN TURL
Enabling Grids for E-sciencE EGEE-II INFSO-RI During practicals1: LFC and LCG utils List directory Create a local file then upload it to an SE and register with a logical name (lfn) in the catalogue Create a duplicate in another SE List the replicas LCG File Catalogue (LFC) Storage Element 1 “User interface” Storage Element 2 lfc-* lcg-*
Enabling Grids for E-sciencE EGEE-II INFSO-RI List directory Create a local file then upload it to an SE and register with a logical name (lfn) in the catalogue Create a duplicate in another SE List the replicas Create a second logical file name for a file Download a file from an SE to the UI LCG File Catalogue (LFC) Storage Element 1 “User interface” Storage Element 2 ? lcg-* lfc-* During practicals1: LFC and LCG utils
Enabling Grids for E-sciencE EGEE-II INFSO-RI During practicals2: GFAL examples Write a file to an SE Read a file from an SE Submit the reader code as a job into the GILDA, read the file remotelyStorage Element 1 GFAL writer GFAL reader “User interface” ComputingElement GFAL reader
Enabling Grids for E-sciencE EGEE-II INFSO-RI During practicals3: AMGA examples Create metadata collections Manage metadata schemas … $ mdclient Connecting to amga.ct.infn.it: ARDA Metadata Server Query> help commands Query> help command_name
Enabling Grids for E-sciencE EGEE-II INFSO-RI Please go to the web page for this practical
Enabling Grids for E-sciencE EGEE-II INFSO-RI Spare slides follow – could be used after the practical
Enabling Grids for E-sciencE EGEE-II INFSO-RI LFC Catalog commands Add/replace a commentlfc-setcomment Set file/directory access control listslfc-setacl Remove a file/directorylfc-rm Rename a file/directorylfc-rename Create a directorylfc-mkdir List file/directory entries in a directorylfc-ls Make a symbolic link to a file/directorylfc-ln Get file/directory access control listslfc-getacl Delete the comment associated with the file/directorylfc-delcomment Change owner and group of the LFC file-directorylfc-chown Change access mode of the LFC file/directorylfc-chmod Summary of the LFC Catalog commands
Enabling Grids for E-sciencE EGEE-II INFSO-RI Summary of lcg-utils commands Replica Management lcg-cpCopies a grid file to a local destination lcg-crCopies a file to a SE and registers the file in the catalog lcg-delDelete one file lcg-repReplication between SEs and registration of the replica lcg-gtGets the TURL for a given SURL and transfer protocol lcg-sdSets file status to “Done” for a given SURL in a SRM request
Enabling Grids for E-sciencE EGEE-II INFSO-RI Summary of fts client commands FTS client glite-transfer-submitSubmit a transfer job : needs at least source and destination SURL glite-transfer-statusGiven one or more job ID, query about their status glite-transfer-cancelDelete the transfer with the give Job ID glite-transfer-listQuery about status of all user’s jobs; support options for query restrictions glite-transfer- channel-list Show all available channel; detailed info only if user has admin privileges
Enabling Grids for E-sciencE EGEE-II INFSO-RI EGEE Tutorial, Taipei, 1 May 2006 LFC server If a site acts as a central catalog for several VOs, it can either have: One LFC server, with one DB account containing the entries of all the supported VOs. You should then create one directory per VO. Several LFC servers, having each a DB account containing the entries for a given VO. Both scenarios have consequences on the handling of database backups Minimum requirements (First scenario) 2Ghz processor with 1GB of memory (not a hard requirement) Dual power supply Mirrored system disk