Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Management The European DataGrid Project Team

Similar presentations


Presentation on theme: "Data Management The European DataGrid Project Team"— Presentation transcript:

1 Data Management The European DataGrid Project Team http://www.eu-datagrid.org

2 EDG DataManagement Tutorial - n° 2 Problem Statement: How to connect User/Programs/Data?  User n logged in to a Grid “User Interface” machine, or n Logged in to a “desktop” machine  Programs n On desktop n On UI n On Grid machines “god knows where”  Data n May need to supply (Grid or non-Grid) data to GNW programs n GNW program may generate data, need to put it somewhere safe n How do you retrieve it from somewhere safe?

3 EDG DataManagement Tutorial - n° 3 Common Grid Data Management Tasks  Dealing with Data Your Job Generates n Getting the data back to your desktop n Putting the data “on the Grid”  Getting Data to your Job n Submitting data along with your job n Putting your data onto the Grid (from outside) n Sending your Grid job to your Grid data  Moving Data on the Grid  How to find your data if you don’t remember where you put it  Example scripts and files: ~dgttutor/dm-tests/

4 EDG DataManagement Tutorial - n° 4 Grid Data Management Tools  Data Transfer mostly through gsiftp n Like good old FTP except uses grid auth(oriza)(entica)tion n No passwords! n Can also use multiple streams for faster transfer  Resource Broker can send (small amounts) of data to/from jobs  Replica Catalog keeps track of where various copies of “grid datasets” are located  Edg-replica-manager uses gsiftp & RC to manage instantiation, registration, and replication of grid datasets  Resource Broker can use RC to find your data, and send your job to it, if you tell RB about the data you need

5 EDG DataManagement Tutorial - n° 5 Grid Program -> Data on your desktop  You can set up your job for “data pickup” n Job generates data in current working directory on WN n At job end, the data files are placed in temp storage at RB n You get them back via “dg-job-get-output”  Key items: n You need to know names of files you want to get back n OutputSandbox = {“higgs.root",“graviton.HDF"}; n not intended for large files (> hundred MB) – storage limitation on Resource Broker machine  Example: output-sandbox.{jdl,sh}

6 EDG DataManagement Tutorial - n° 6 Putting the data “on the Grid”  Here we talk about a running Grid program, the output of which you want on the Grid. Two cases: n You let the program write output on the WN, and after the program finishes you have the job script move the data to Grid storage n You arrange for the program to write directly to Grid storage  In both cases, data is not really “on the Grid” until it is registered in the “replica catalog”

7 EDG DataManagement Tutorial - n° 7 Grid-generated data to Grid storage I  Your program generates data to some local file  You have to know (or be able to figure out) what the local file name is  Use the edg-replica-manager commands to n Put the data onto Grid storage n Register the data as a Grid dataset  A few extras are needed n Some idea of where to put the data n A “logical file name” – location-independent grid file name

8 EDG DataManagement Tutorial - n° 8 GGDGS (I) Cont’d  How to find out where to put data? Need to know which storage elements are out there n ldapsearch -h lxshare0225.cern.ch -p 2170 -x -b \ "Mds-vo-name=local,o=grid" (objectclass=storageelement) \ seid  The command which will move your data to the desired location, and register it in the replica catalog, is edg-replica-manager-copyAndRegisterFile n edg-replica-manager-copyAndRegisterFile \ -s $(hostname)/$(pwd)/$DFILE -l $LFN -d $DEST_SE  See cr-mov-reg.{sh,jdl} examples

9 EDG DataManagement Tutorial - n° 9 Grid-generated data to Grid storage II  Your program generates data directly to a “close SE”  Close means you can use normal file IO to write it  You have to use a brokerinfo command to find out what the close SE is (you don’t know where your job will go!) and what the dir is  You write the data  Use the edg-replica-manager commands to n Register the data as a Grid dataset  An extra is needed n A “logical file name” – location-independent grid file name

10 EDG DataManagement Tutorial - n° 10 GGDGS II (cont’d)  Restriction: the “local file name” has to be the same as the logical file name (at least the “base” name) n File on disk: /data/spool/123fred7; LFNs:  123fred7 is OK  123fred is not OK  fred7 is not OK  Skippy is not OK  spool/123fred7 is OK  Logical file name must not already be in catalogue  You also probably want to check that the file doesn’t exist on disk before you start to write it  Example files: cr-on-se-and-reg.{jdl,sh}  Check if it was successful: edg-replica-manager-listReplicas -c /opt/edg/etc/tutor/rc.conf \ -l whomp.119

11 EDG DataManagement Tutorial - n° 11 Submitting Data Along With Your Job  This is fairly easy: use the Input Sandbox  Careful – not a sandbox in the javascript sense  InputSandbox = {“input-ntuple.root"};  Example files: inp-sbox.{jdl,sh}

12 EDG DataManagement Tutorial - n° 12 Moving Data Onto Grid from Outside  This is almost identical to GGDGS I  Use edg-replica-manager-copyAndRegisterFile  Need to specify rc.conf file (either with RC_CONFIG_FILE variable or with –c option) … defaults in /opt/edg/etc/ /rc.conf  Remember restrictions: n LFN and remote file name have to match n source and destination files must include hostnames  edg-replica-manager-copyAndRegisterFile –c rc.conf –l whomp.145 –s $(hostname)/$(pwd)/gls –d gppse05.gridpp.rl.ac.uk

13 EDG DataManagement Tutorial - n° 13 Having Grid Send Job to Your Data  Need to have data “on the Grid” == listed in RC  Tell your job (JDL) about the grid data: n InputData = “LF:myfile.dat”  Resource Broker puts info about data matching in “brokerinfo” file on remote execution node  In your job execution script, use the “edg-brokerinfo” command (getselectedfile) to find location of job-local copy  Example files: find-data.{jdl,sh}

14 EDG DataManagement Tutorial - n° 14 Moving Data Around  Edg-replica-manager-replicateFile –c rc.conf –l -d -s  Try the previous test (w/ dg-job-list-match) – should find a new site willing to accept your job

15 EDG DataManagement Tutorial - n° 15 Finding Your Data  ldapsearch –LLL –h grid-vo.nikhef.nl –p 10389 –x –b “rc=EDGtutorialReplicaCatalog,dc=eu-datagrid,dc=org” ‘(filename=jtdmtest1)’ dn  Shows “dn”s wherever the selected “filename” exist

16 EDG DataManagement Tutorial - n° 16 GDMP  Tool for replication of large sets of files between sites  Can do a lot with it  Easy to get commands wrong n Can’t recover from certain errors n Possible to wreck the GDMP subsystem badly enough that remote sysadmins will have to make manual fixes  Recommend not to use unless you really need it!  Ex: you don’t normally use the “dd” command to copy files!

17 EDG DataManagement Tutorial - n° 17 Gotchas  Edg-replica-manager commands n Error messages not always on target n Careful not to use commands in ways other than intended – error trapping not good, and sometimes the command will do something but not necessarily what you want n Build error checking & trapping into your job scripts n Remember restrictions on LFN/PFN correspondence  Replica catalog n Leaving out pieces of the command generally neither works nor provides helpful messages – type carefully!

18 EDG DataManagement Tutorial - n° 18 EDG Replica Catalog  Based upon the Globus LDAP Replica Catalog  Stores LFN/PFN mappings and additional information (e.g. filesize): n Physical File Name (PFN): host + full path & and file name n Logical File Name (LFN): logical name that may be resolved to PFNs n LFN : PFN = 1 : n  Only files on storage elements may be registered  Each VO has a specific storage dir on an SE  Example PFN: lxshare0222.cern.ch/flatfiles/SE1/iteam/file1.dat host storage dir  LFN must be full path of file starting from storage dir LFN of above PFN: file1.dat

19 EDG DataManagement Tutorial - n° 19 globus-url-copy  Low level tool for secure copying globus-url-copy :// \ ://  Main Protocols: n gsiftp – for secure transfer, only available on SE and CE n file – for accessing files stored on the local file system on e.g. UI, WN globus-url-copy file://`pwd`/file1.dat \ gsiftp://lxshare0222.cern.ch/ \ flatfiles/SE1/EDGTutorial/file1.dat

20 EDG DataManagement Tutorial - n° 20 The Replica Manager APIs  (un)registerEntry(LogicalFileName lfn, FileName source) n Replica Catalogue operations only - no file transfer  copyFile(FileName source, FileName destination, String protocol) n allows for third-party transfer n transfer between:  two StorageElements or  ComputingElement and Storage Element  Space management policies under development n all tools support parallel streams for file transfers

21 EDG DataManagement Tutorial - n° 21  copyAndRegisterFile(LogicalFileName lfn, FileName source, FileName destination, String protocol) n third-party transfer but : files can only be registered in Replica Catalogue if destination PFN contains a valid SE (i.e. needs to be registered in the RC)!  replicateFile(LogicalFileName lfn, FileName source, FileName destination, String protocol)  deleteFile(LogicalFileName lfn, FileName source) The Replica Manager APIs


Download ppt "Data Management The European DataGrid Project Team"

Similar presentations


Ads by Google