ESRIN Grid Workshop Tutorial

Slides:



Advertisements
Similar presentations
Data Management Expert Panel. RLS Globus-EDG Replica Location Service u Joint Design in the form of the Giggle architecture u Reference Implementation.
Advertisements

Workflows over Grid-based Web services General framework and a practical case in structural biology gLite 3.0 Data Management David García Aristegui Grid.
EGEE is a project funded by the European Union under contract IST Grid Data Management Hands-on Simone Campana LCG Experiment Integration and.
Grid Data Management Assaf Gottlieb - Israeli Grid NA3 Team EGEE is a project funded by the European Union under contract IST EGEE tutorial,
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Services Abderrahman El Kharrim
EGEE is a project funded by the European Union under contract IST EGEE tutorial introduction Roberto Barbera University of Catania and INFN.
EGEE is a project funded by the European Union under contract IST Data Services Simone Campana LCG Experiment Integration and Support CERN-IT.
The LCG File Catalog (LFC) Jean-Philippe Baud – Sophie Lemaitre IT-GD, CERN May 2005.
EGEE is a project funded by the European Union under contract IST Grid Data Management Gabor Hermann on the base of lecture of Simone Campana.
Data Management Kelly Clynes Caitlin Minteer. Agenda Globus Toolkit Basic Data Management Systems Overview of Data Management Data Movement Grid FTP Reliable.
EGEE-II INFSO-RI Enabling Grids for E-sciencE gLite Data Management System Yaodong Cheng CC-IHEP, Chinese Academy.
The LCG File Catalog (LFC) Jean-Philippe Baud – Sophie Lemaitre IT-GD, CERN May 2005.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware Data Management in gLite.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Nov. 18, EGEE and gLite are registered trademarks gLite Middleware Usage Dusan.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE middleware: gLite Data Management EGEE Tutorial 23rd APAN Meeting, Manila Jan.
Enabling Grids for E-sciencE Introduction Data Management Jan Just Keijser Nikhef Grid Tutorial, November 2008.
Jan 31, 2006 SEE-GRID Nis Training Session Hands-on V: Standard Grid Usage Dušan Vudragović SCL and ATLAS group Institute of Physics, Belgrade.
INFSO-RI Enabling Grids for E-sciencE Αthanasia Asiki Computing Systems Laboratory, National Technical.
EGEE is a project funded by the European Union under contract IST Grid Data Management Roberto Barbera Univ. Of Catania and INFN
EGEE is a project funded by the European Union under contract IST EGEE tutorial introduction Roberto Barbera University of Catania and INFN.
Managing Data DIRAC Project. Outline  Data management components  Storage Elements  File Catalogs  DIRAC conventions for user data  Data operation.
SEE-GRID-SCI Storage Element Installation and Configuration Branimir Ackovic Institute of Physics Serbia The SEE-GRID-SCI.
INFSO-RI Enabling Grids for E-sciencE Introduction Data Management Ron Trompert SARA Grid Tutorial, September 2007.
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Command Line Grid Programming Spiros Spirou Greek Application Support Team NCSR “Demokritos”
INFSO-RI Enabling Grids for E-sciencE Job Description Language (JDL) Giuseppe La Rocca INFN First gLite tutorial on GILDA Catania,
Data Management The European DataGrid Project Team
E-infrastructure shared between Europe and Latin America FP6−2004−Infrastructures−6-SSA Special Jobs Valeria Ardizzone INFN - Catania.
Data Management The European DataGrid Project Team
EGEE is a project funded by the European Union under contract IST Grid Data Management Simone Campana LCG Experiment Integration and Support.
GRID’05 EGEE Summer School Budapest - 1
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America Data Management Hands-on Juan Eduardo Murrieta.
1 DIRAC Data Management Components A.Tsaregorodtsev, CPPM, Marseille DIRAC review panel meeting, 15 November 2005, CERN.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Istituto Nazionale di Astrofisica Information Technology Unit INAF-SI Job with data management Giuliano Taffoni.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Architecture of LHC File Catalog Valeria Ardizzone INFN Catania – EGEE-II NA3/NA4.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) Algiers, EUMED/Epikh Application Porting Tutorial, 2010/07/04.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) LFC Installation and Configuration Dong Xu IHEP,
GRID commands lines Original presentation from David Bouvet CC/IN2P3/CNRS.
Grid Data Management Assaf Gottlieb Tel-Aviv University assafgot tau.ac.il EGEE is a project funded by the European Union under contract IST
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Data Management Maha Metawei
FP6−2004−Infrastructures−6-SSA E-infrastructure shared between Europe and Latin America LFC Server Installation and Configuration.
FESR Consorzio COMETA - Progetto PI2S2 Jobs with Input/Output data Fabio Scibilia, INFN - Catania, Italy Tutorial per utenti e.
Scuola Grid INFN, Trieste, 1-12 Dic Managing Confidential Data in the gLite Middleware – The Secure Storage.
EGEE Data Management Services
Grid2Win Porting of gLite middleware to Windows XP platform
GFAL Grid File Access Library
GFAL: Grid File Access Library
LFC Server Installation & Configuration
gLite Basic APIs Christos Filippidis
Classic Storage Element
StoRM: a SRM solution for disk based storage systems
Java API del Logical File Catalog (LFC)
LFC Installation and Configuration
GRID’05 EGEE Summer School Budapest - 1.
Scuola Grid INFN, Martina Franca, Nov
gLite Data management system overview
Grid2Win: Porting of gLite middleware to Windows XP platform
Hands-On Session: Data Management
GFAL 2.0 Devresse Adrien CERN lcgutil team
5. Job Submission Grid Computing.
Data Management in Release 2
Data Management Ouafa Bentaleb CERIST, Algeria
Data services in gLite “s” gLite and LCG.
Grid Data Replication Kurt Stockinger Scientific Data Management Group Lawrence Berkeley National Laboratory.
Architecture of the gLite Data Management System
gLite Data and Metadata Management
INFNGRID Workshop – Bari, Italy, October 2004
Data Management system in gLite middleware
Job Submission M. Jouvin (LAL-Orsay)
Presentation transcript:

ESRIN Grid Workshop Tutorial Introduction to Grid Computing Frascati, 3 February 2005 www.eu-egee.org Data Services Presented by Julian Linford Based on INFN-GRID/EGEE User Tutorial EGEE is a project funded by the European Union under contract IST-2003-508833

Overview Introduction on Data Management (DM) File catalogs General Concepts Some details on transport protocols Data management operations Files & replicas: Name Convention File catalogs Cataloging requirements and catalogs in egee/LCG RLS file catalog LCG file catalog DM tools: overview Data Management CLI lcg_utils Data Management API Advanced concepts Advanced utilities: CLI&APIs Conclusions Grid Data Management - 2

Data Management: general concepts A uniform approach to Facilitate distribution of data throughout the Grid Provide common tools and services to handle files on the Grid Granularity is at the “file” level (no data “structures”) Files are stored in appropriate storage resources (large disks, or archive system) Normally associated with a site’s computing resources Each CE normally has a ‘CloseSE’ configured Data Management treats storage device as “black box” hides internals of the storage resource hides details on transfer protocols Grid Data Management - 3

Data Management: general concepts A Grid file is READ-ONLY (at least in EGEE/LCG) It can not be modified It can be deleted (so it can be replaced) Files can be any type of data (text, binary, data, programs) High-level Data Management tools Standard approach with automation deals with Different transport layer details Different sites storage configurations Low-level tools expose these differences More details for the user to handle Details of the transport layer Details of Storage Element implementation Only really useful in “non-standard” situations Grid Data Management - 4

Some details on protocols Basic data transfer protocol between hosts is “gridFTP” (gsiftp) secure and efficient data movement extends the standard FTP protocol Public-key-based Grid Security Infrastructure (GSI) support Third-party control of data transfer Parallel data transfer Other data access protocols are available file protocol: for local file access rfio protocol gsidcap protocol SRM provides standard access to storage devices Grid Data Management - 5

Data Management operations Upload a file to the grid User needs to store data in SE (from a UI) Application needs to store data in SE (from a WN) User needs to store the application (to be retrieved and run from WN) For small files the InputSandbox can be used (see WMS lecture) CE SE CE SE Installation tool (Tango) Several Grid Components UI Grid Data Management - 6

Data Management operations Download files from the grid User needs to retrieve data stored into SE For small files produced in WN the OutputSandbox can be used Application needs to copy data locally (into the WN) and use them The application itself must be downloaded onto the WN and run CE SE CE SE spark … Several Grid Components UI Grid Data Management - 7

Data Management operations Replicate a file on several SEs Load balancing of shared computing resources Often a job needs to run at a site where a copy of input data is present JDL InputData attribute allows this Performance improvement in data access Several applications might need to access the same file concurrently Redundancy of key files provides backup CE SE CE SE Several Grid Components UI Grid Data Management - 8

Data management operations Data Management means movement and replication of files across/on grid elements Grid DM tools/applications/services can be used for all kinds of files HOWEVER Data Management focuses on “large” files large means greater than ~20MB Typically on the order of hundreds of MB Tools/applications/services are optimized to deal with large files Small files can be efficiently transferred using WM User can send programms & data to the WN using the InputSandbox User can retrieve data generated by a job (on the WN) using the OutputSandbox Grid Data Management - 9

Files & replicas: Name Convention Globally Unique Identifier (GUID) A non-human-readable unique identifier for a file, e.g. “guid:f81d4fae-7dec-11d0-a765-00a0c91e6bf6” Site URL (SURL) (or Physical/Site File Name (PFN/SFN)) The location of the actual file on a storage system, e.g. “sfn://lxshare0209.cern.ch/data/alice/ntuples.dat” Logical File Name (LFN) An alias created by a user to refer to some file, e.g. “lfn:cms/20030203/run2/track1” Transport URL (TURL) Temporary locator of a replica + access protocol: understood by a SE, e.g. “gsiftp://lxshare0209.cern.ch//data/alice/ntuples.dat” Logical File Name 1 Logical File Name n GUID Physical File SURL n Physical File SURL 1 . Grid Data Management - 10

Replica Manager The Replica Manager allows to keep track of files on the Grid storage resources To track our files on the Grid we use a Replica Catalogue Potentially, millions of files need to be registered and located Requirement for performance Distributed architecture might be desirable scalability prevent single-point of failure Site managers need to change autonomously file locations Grid Data Management - 11

Replica Catalogs in EGEE/LCG Access to the file catalog The DM tools and APIs and the WMS interact with the catalog Hide catalogue implementation details Lower level tools allow direct catalogue access Replica Location Service (RLS) Catalogs in use in LCG-2 Replica Metadata Catalog (RMC) + Local Replica Catalog (LRC) Some performance problems detected during LCG Data Challenges New LCG File Catalog (LCF) deployment in January 2005 Coexistence with RLS and migration tools provided Better performance and scalability Provides new features: security, hierarchical namespace, transactions... Grid Data Management - 12

File Catalogs: The RLS LRC: RMC: RMC LRC Stores GUID-SURL mappings Accessible by edg-lrc CLI + API RMC: Stores LFN-GUID mappings Accessible by edg-rmc CLI + API RLS LRC RMC Logical File Name 1 Logical File Name 2 Logical File Name n GUID Physical File SURL SE2 Physical File SURL SE1 RMC LRC Grid Data Management - 13

Possible Improvements Fix performance and scalability problems Progress indicators for large queries Timeouts and retries from the client More features User exposed transaction API (+ auto rollback on failure of mutating method call) Hierarchical namespace and namespace operations (for LFNs) Integrated GSI Authentication + Authorization Access Control Lists (Unix Permissions and POSIX ACLs) Checksums Interaction with other components Support Oracle and MySQL database back ends Security VOMS will be integrated Grid Data Management - 14

Data management tools Replica manager: lcg-* commands + lcg_* API Provide (all) the functionality needed by the EGEE/LCG user Combine file transfer and cataloging as an atomic transaction Insure consistent operations on catalogues and storage systems Offers high level layer over technology specific implementations Based on the Grid File Access Library (GFAL) API Discussed in SE section Grid Data Management - 15

DM CLIs & APIs: Old EDG tools Old versions of EDG CLIs and APIs still available File & replica management edg-rm Implemented (mostly) in java Catalog interaction (only for EDG catalogs) edg-lrc edg-rmc Java and C++ APIs Use discouraged Worse performance (slower) New features added only to lcg_utils Less general than GFAL and lcg_utils Grid Data Management - 16

lcg_utils: Replica mgm. commands lcg-cp Copies a Grid file to a local destination lcg-cr Copies a file to a SE and registers the file in the LRC lcg-del Deletes one file (either one replica or all replicas) lcg-rep Copies a file from SE to SE and registers it in the LRC lcg-sd set file status to “Done” in a specified request Grid Data Management - 17

lcg_utils: Catalog interaction cmd’s lcg-aa Adds an alias in RMC for a given GUID lcg-gt Gets the TURL for a given SURL and transfer protocol lcg-la Lists the aliases for a given LFN, GUID or SURL lcg-lg Gets the GUID for a given LFN or SURL lcg-lr Lists the replicas for a given LFN, GUID or SURL lcg-ra Removes an alias in RMC for a given GUID lcg-rf Registers a SE file in the LRC (optionally in the RMC) lcg-uf Unregisters a file residing on an SE from the LRC Grid Data Management - 18

Gathering informations: lcg-infosites [scampana@grid019:~]$ lcg-infosites --vo gilda se ************************************************************* These are the related data for gilda: (in terms of SE) Avail Space(Kb) Used Space(Kb) SEs ---------------------------------------------------------- 1570665704 576686868 grid3.na.astro.it 225661244 1906716 grid009.ct.infn.it 523094840 457000 grid003.cecalc.ula.ve 1570665704 576686868 testbed005.cnaf.infn.it 15853516 1879992 gilda-se01.pd.infn.it Grid Data Management - 19

lcg_utils CLI : usage example [scampana@grid019:~]$ lcg-rep --vo gilda \ -d grid003.cecalc.ula.ve lfn:simone-important [scampana@grid019:~]$ ls -l important-file.txt -rw-r--r-- 1 scampana users 19 Oct 31 17:09 important-file.txt [scampana@grid019:~]$ lcg-lr --vo gilda lfn:simone-important sfn://grid3.na.astro.it/flatfiles/SE00/gilda/generated/2004-10-31/ \ file4c7c2ad6-4d93-4cd2-be24-bf4239f58208 [scampana@grid019:~]$ lcg-cr --vo gilda -l lfn:simone-important \ -d grid3.na.astro.it file://`pwd`/important-file.txt guid:08d02e56-bdf6-4833-a4da-e0247c188242 [scampana@grid019:~]$ lcg-lr --vo gilda lfn:simone-important sfn://grid003.cecalc.ula.ve/flatfiles/SE00/gilda/generated/2004-10-31/ \ file39568d15-e873-4f17-9371-b8862ae77c36 sfn://grid3.na.astro.it/flatfiles/SE00/gilda/generated/2004-10-31/ \ file4c7c2ad6-4d93-4cd2-be24-bf4239f58208 [scampana@grid019:~]$ lcg-del --vo gilda -a lfn:simone-important [scampana@grid019:~]$ lcg-lr --vo gilda lfn:simone-important lcg_lr: No such file or directory IMPORTANT The lcg_utils (both CLI and API described later) need to access the Information System (BDII). The name of the BDII host used by lcg_utils is specified in the environment variable LCG_GFAL_INFOSYS REMEMBER THAT, ESPECIALLY WHEN PERFORMING DATA MANAGEMENT OPERATIONS FROM THE WN We have a local file in our UI in Catania Upload the file in Naples (Italy) The file is effectively there … Delete all the replicas in the storage elements. …. Let’ s replicate it to Merida now … Grid Data Management - 20

JDL Data Management Attributes InputSandbox (optional) List of files on the UI will be automatically sent to the WN before execution InputSandbox={“myscript.sh”,”/tmp/cc,sh”}; OutputSandbox (optional) List of files on the WN will be automatically retrieved by the “edg-job-get-output” command on the UI OutputSandbox ={ “std.out”, ”std.err”, “image.png”}; Grid Data Management - 21

JDL Replica Manager Attributes InputData (optional) This is a string or a list of strings representing the Logical File Name (LFN) or Grid Unique Identifier (GUID) The Resource Broker will select the “best” CE (i.e. which has the replicas stored on a ‘close’ SE) InputData = {“lfn:mytestfile”, “guid:135b7b23-4a6a-11d7-87e7-9d101f8c8b70”}; Grid Data Management - 22

JDL Replica Manager Attributes DataAccessProtocol (mandatory if InputData has been specified) The protocol which the job running on the WN will use for accessing files listed in InputData Supported protocols are currently gridftp, file and rfio. DataAccessProtocol = {“file”,“gridftp”,“rfio”}; Grid Data Management - 23

JDL Replica Manager Attributes OutputSE (optional) URI of a Storage Element (SE) The Resource Broker will select the “best” CE (i.e. that has OuputSE as ‘close’ SE) OutputSE = “grid009.ct.infn.it”; Grid Data Management - 24

JDL Data Management Attributes OutputData (optional) This attribute allows the user to ask for the automatic upload and registration of datasets produced by the job on the Worker Node (WN). This attribute contains the following three attributes: OutputFile StorageElement LogicalFileName Grid Data Management - 25

JDL OutputData Attributes OutputFile (mandatory if OutputData has been specified) Name of the file on the WN StorageElement (optional) URI of the target Storage Element LogicalFileName (optional) LFN to be associated with the specified file Grid Data Management - 26

JDL OutputData Example [ OutputFile = “dataset1.out”; LogicalFileName = “lfn:test-result1”; ], OutputFile = “dataset2.out”; LogicalFileName = “lfn:test-result2”; StorageElement = “grid009.ct.infn.it”; OutputFile = “dataset3.out”; ] }; Grid Data Management - 27

JDL without Replica Management Executable = "script.sh"; Arguments = "Hello World"; StdOutput = "stdout"; StdError = "stderr"; InputSandbox = {"script.sh"}; OutputSandbox = {"stderr", "stdout"}; Grid Data Management - 28

JDL with Replica Management Executable = "script.sh"; Arguments = "Hello World"; StdOutput = "stdout"; StdError = "stderr"; InputSandbox = {"script.sh"}; OutputSandbox = {"stderr", "stdout"}; InputData = "lfn:myoutdata.1"; DataAccessProtocol = {"gridftp", "rfio"}; Grid Data Management - 29

Low-Level Commands globus-url-copy <sourceURL> <destURL> low level file transfer URL may have file or gsiftp as protocol Interaction with RLS components edg-lrc command (actions on LRC) edg-rmc command (actions on RMC) C++ and Java API for all catalog operations http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-lrc-devguide.pdf http://edg-wp2.web.cern.ch/edg-wp2/replication/docu/r2.1/edg-rmc-devguide.pdf Avoid using low level CLI and API where possible Risk: loose consistency between SEs and catalogues REMEMBER: a file is in Grid if it is BOTH: stored in a Storage Element registered in the file catalog globus-url-copy is a low level tool for secure copying. It should only be used if just a copy without registering the file in the RC is desired. Otherwise, higher level tools as we discuss now should be used. Grid Data Management - 30

lcg_utils API lcg_utils API: Single shared library Single header file High-level data management C API Same functionality as lcg_util command line tools Single shared library liblcg_util.so Single header file lcg_util.h (+ linking against libglobus_gass_copy_gcc32.so) Grid Data Management - 31

lcg_utils: Replica management int lcg_cp (char *src_file, char *dest_file, char *vo, int nbstreams, char * conf_file, int insecure, int insecure); int lcg_cr (char *src_file, char *dest_file, char *guid, char *lfn, char *vo, char *relative_path, int nbstreams, char *conf_file, int insecure, int verbose, char *actual_guid); int lcg_del (char *file, int aflag, char *se, char *vo, char *conf_file, int insecure, int verbose); int lcg_rep (char *src_file, char *dest_file, char *vo, char *relative_path, int nbstreams, char *conf_file, int insecure, int verbose); int lcg_sd (char *surl, int regid, int fileid, char *token, int oflag); Grid Data Management - 32

lcg_utils: Catalog interaction int lcg_aa (char *lfn, char *guid, char *vo, char *insecure, int verbose); int lcg_gt (char *surl, char *protocol, char **turl, int *regid, int *fileid, char **token); int lcg_la (char *file, char *vo, char *conf_file, int insecure, char ***lfns); int lcg_lg (char *lfn_or_surl, char *vo, char *conf_file, int insecure, char *guid); int lcg_lr (char *file, char *vo, char *conf_file, int insecure, char ***pfns); int lcg_ra (char *lfn, char *guid, char *vo, char *conf_file, int insecure); int lcg_rf (char *surl, char *guid, char *lfn, char *vo, char *conf_file, int insecure, int verbose, char *actual_guid); int lcg_uf (char *surl, char *guid, char *vo, char *conf_file, int insecure); Grid Data Management - 33

Bibliography General egee/LCG information EGEE Homepage http://public.eu-egee.org/ EGEE’s NA3: User Training and Induction http://www.egee.nesc.ac.uk/ LCG Homepage http://lcg.web.cern.ch/LCG/ LCG-2 User Guide https://edms.cern.ch/file/454439//LCG-2-UserGuide.html GILDA http://gilda.ct.infn.it/ GENIUS (GILDA web portal) http://grid-tutor.ct.infn.it/ Grid Data Management - 34

Bibliography Information on Data Management middleware LCG-2 User Guide (chapters 3rd and 6th) https://edms.cern.ch/file/454439//LCG-2-UserGuide.html Evolution of LCG-2 Data Management. J-P Baud, James Casey. http://indico.cern.ch/contributionDisplay.py?contribId=278&sessionId=7&confId=0 Globus 2.4 http://www.globus.org/gt2.4/ GridFTP http://www.globus.org/datagrid/gridftp.html Grid Data Management - 35

Bibliography Information on egee/LCG tools and APIs Manpages (in UI) lcg_utils: lcg-* (commands), lcg_* (C functions) Header files (in $LCG_LOCATION/include) lcg_util.h CVS developement (sources for commands) http://isscvs.cern.ch:8180/cgi-bin/cvsweb.cgi/?hidenonreadable=1&f=u& logsort=date&sortby=file&hideattic=1&cvsroot=lcgware&path Information on other tools and APIs EDG CLIs and APIs http://edg-wp2.web.cern.ch/edg-wp2/replication/documentation.html Globus http://www-unix.globus.org/api/c/ , ...globus_ftp_client/html , ...globus_ftp_control/html Grid Data Management - 36