INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, , CNAF
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 2 Cloud Status Scheduled and unscheduled downtimes –direct s from sites –EGEE broadcasts –GOCDB: ARDA Dashboard pages –T0 to T1 transfers –all other transfers
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 3 VOBoxes at CERN aManagementARDAMachineshttps://twiki.cern.ch/twiki/bin/view/Atlas/DistributedDat aManagementARDAMachines separate machines for db services and site services CNAF: –dq2db-cnaf – db services –dq2cnaf – site services for CNAF and T2’s Access via an account ddmusr02 –limited possibilities, check /tmp/dq2.log Account ddmusr01 restricted to developers –why ??? Installation done by developers
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 4 Panda Monitoring panda pages –DS on sites erview=dslist –AOD: ode=listAODReplications ode=listAODReplications –aborted DS: ode=listAbortedDatasets –M4: ode=listM4
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 5 More Monitoring Stephane’s overview of disks occupancies les_sites/all_sites/list_sites.html Per data type version, DE cloud: muenchen.de/ddm/DE/summary.html Site status monitored by GOC – gstat – IHEP/GIISQuery_Usage_store_.htmlhttp://goc.grid.sinica.edu.tw/gstat/RU-Protvino- IHEP/GIISQuery_Usage_store_.html
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 6 FTS monitoring FTS 1.5 –DE cloud: –SARA: –glite-transfer commands: glite-transfer-channel-list -s transfer-fts/services/ChannelManagementhttps://fts.grid.sara.nl:8443/glite-data- transfer-fts/services/ChannelManagement
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 7 Typical tasks Errors spotted via monitoring –check reasons –contact site –possibly close the FTS channel –verify when corrected –open FTS channel
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 8 Deletion of Aborted DS Mail sent to T1 cloud responsibles (usually 1 per week) Different procedures in different clouds –FZK Cedric’s script delete_dataset_aborted.py run regularly from a crontab uses: dq2.deleteDatasetReplicas, dq2.deleteDatasetSubscription, dq2.listFilesInDataset, lcg-del, lcg-uf list of DS from a file part of MyFrameWork: /afs/cern.ch/user/s/serfon/public/ddm/Myframework will be published on Thursday
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 9 Deletion of Aborted DS II SARA cloud: wrappers around dq2_cleanup: dq2_delete_aborted.sh #!/bin/sh # delete aborted DS using dq2_cleanup # start 1 d2_cleanup instance per site # input via parameter. # Parameter 1: list of aborted dataset and sites # example: # ideal0_mc singlepart_gamma_Et60.simul.HITS.v _tid ITEP # tested from lxplus, when grid and dq2 environment was set and # production proxy obtained like this: # # source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh # voms-proxy-init -voms atlas:/atlas/Role=production -valid 96:0 # source /afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2.sh SITES="SARADISK SARATAPE NIKHEF ITEP IHEP JINR SINP" DSLIST=$1 for SITE in $SITES ; do dq2_delete_aborted_site.sh $DSLIST $SITE & done
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 10 Deletion of Aborted DS II dq2_delete_aborted_site.sh #!/bin/sh # delete aborted DS from a site using dq2_cleanup # # Input # parameter 1: list of aborted DS # parameter 2: SITENAME DSLIST=$1 SITE=$2 DQ2_CLEANUP=/afs/cern.ch/atlas/offline/external/GRID/ddm/pro03/dq2_clea nup LOG="${SITE}_${DSLIST}_`date +%Y%m%d_%H%M`.log" touch $LOG grep $SITE $DSLIST | while read DS ; do $DQ2_CLEANUP $DS >>$LOG 2>&1 done
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 11 Integrity checks Cedric’ script – atlas.cgi/offline/Production/swing/scripts/ddm/integrity_check.py?view=loghttp://atlas-sw.cern.ch/cgi-bin/viewcvs- atlas.cgi/offline/Production/swing/scripts/ddm/integrity_check.py?view=log –some assumptions (/pnfs access) Simple compare of dumps: #!/bin/bash # read files from a DPM dump and match them with an LFC dump # DPM dump obtained by select name from Cns_file_metadata where gid=1307 and filesize > 0; DPM_DUMP=$1 LFC_DUMP=$2 FOUND=$1.found MISS=$1.miss cat $DPM_DUMP | while read FN FILEID; do grep -q $FN $LFC_DUMP if [ $? == 0 ] ; then echo "$FN $FILEID" >> $FOUND else echo "$FN $FILEID" >> $MISS fi done
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 12 Data loss Only production files are treated Get list of lost files (provided by a sysadmin) Remove information about lost files from the SE db (must be done by a sysadmin) – see later talk Delete lost entries from an LFC catalogue Locate replicas of lost files. If they exist, consider replication to the affected SE. If they do not exist, remove lost files from datasets (DQ2 db) and pass the list of really lost files to prodsys group. DB of lost files – will be part of DQ2
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 13 T2 cleaning remove_t2_in_t1.py by Stephane –A file is deleted if it fullfills all the following requests: The file in the T2 is replicated in the T1DISK? of the name cloud? The file belongs to a dataset which is not complete at the site The file belongs to a dataset (with _tid) which is not subscribed to the T2 site ( Be carefull: During DDM migration to 0.3, all subscriptions are removed. You might deleted too many files untill subscriptions are put back. ) –Since v1.4, you can provide a list of restricted datasets to be deleted (even if subscribed) –It first scan the LFC catalog at the Tier1 (it is possible to use a local dump of the LFC catalog), scans the T2 entries in the LFC and deletes duplicated files on the T2 (using lcg-del). To run : python remove_t2_in_t1.py LAPP LPC or python remove_t2_in_t1.py LAPP LPC dataset1 dataset2
Enabling Grids for E-sciencE INFSO-RI ATLAS DDM Operations 14 More scripts Framework in preparation