Presentation is loading. Please wait.

Presentation is loading. Please wait.

Day-to-day testing & troubleshooting

Similar presentations


Presentation on theme: "Day-to-day testing & troubleshooting"— Presentation transcript:

1 Day-to-day testing & troubleshooting
Marco Serra - INFN Roma

2 outline basic hints to test a grid site
miscellaneous information for a site administrator (few) typical problems

3 site status: monitoring tools
to have a quick idea of site status you may want to use a monitoring tool ganglia, nagios, .... - monitoring gives you real-time status of your site and speed-up checklist procedures hardware status, cpu usage, disk usage, ..... monitoring gives history of resources usage/downtime post-accidend analisys, problems coorelation, .... you may define allarms for specific conditions no connectivity, load, demons not running, .....

4 site status: monitoring tools (2)
to have a quick idea of site status you may want to use a monitoring tool ganglia, nagios, .... - monitoring gives you real-time status of your site and speed-up checklist procedures hardware status, cpu usage, disk usage, ..... monitoring gives history of resources usage/downtime post-accidend analisys, problems coorelation, .... you may define allarms for specific conditions no connectivity, load, demons not running, ..... how many machines at your site ? ... not so many, so I like simple tests ...

5 site status: “digging in the dark”
user: my job is not running what is wrong? right place to search? different software layers: OS (ssh, nfs, ...)* batch system grid-middleware globus, EDG, LCG, security, .... experiment software* * problems may come from “external” sources: file system full (log files, user’s core dump, ...) Atlas release not available (not installed or nfs?)

6 site status: simple tests
To identify a problem try to test single layers ping, telnet, scp, ... (OS,firewall,....) qsub (batch system) globus-job-run gridftp (globus) ldapsearch edg-job-list-match / edg-job-submit (EDG/EDT/LCG) lcg-cp

7 site status: simple tests (2)
ping, telnet, scp, ... firewall port open (from “external” host): telnet my_SE 8080 ssh keys (from my_WN to my_CE scp without pwd): scp my_file my_CE:. qsub (from a user account account) qsub my_test.sh globus-job-run globus-job-run my_CE:2119/jobmanager-lcgpbs -q short /bin/hostname gridftp globus-url-copy file://`pwd`/test gsiftp://my_SE/tmp/test ldapsearch ldapsearch -x -H ldap://my_CE:2135 -b "mds-vo-name=local,o=grid" edg-job-listmatch / edg-job-submit edg-job-listmatch -r my_CE Test.jdl lcg-cp lcg-cp --vo dteam file://`pwd`/test gsiftp://my_SE/tmp/test (or lcg-cr ...)

8 your site is tested every day: stability
If your nodes are in LCG infrastructure every day there will be an automatic test to check site status ..... Note: NOT all sites are in LCG (!) ..... moreover there are also INFN-GRID tests ..... .... these tests check all main functionalities (job submission, data management, information system, ...) ... and if there are problems you will recive a ticket -> Read carfully problem report/ticket/....

9 LCG ticket: one example
Date: Mon, 21 Feb :40: (CET) From: "noreply [Sophie Lemaitre]" To: Sophie Lemaitre Italian ROC Subject: [IT-ROC] [task #1731] /opt/edg/etc/profile.d/edg-rgma-env.sh cannot be found. This is an automated notification sent by LCG Savannah. It relates to: task #1731, project LCG2 sites ======================================================= OVERVIEW of task #1731: URL: < Summary: /opt/edg/etc/profile.d/edg-rgma-env.sh cannot be found. Project: LCG2 sites Submitted by: slemaitr Submitted on: 2005-Feb-21 15:40 Should Start On: 2005-Feb-21 00:00 Should be Finished on: 2005-Feb-24 00:00 Category: INFN-CATANIA Priority: 5 - Normal Item Group: R-GMA problem Status: None Assigned to: rocitaly Percent Complete: 0% Action taken: Mail to site admin Open/Closed: Open Person contacted: site + ROC Response:

10 LCG ticket: some help Date: Mon, 21 Feb 2005 16:40:48 +0100 (MET)
From: CIC on Duty To: Subject: [IT-ROC] task # RGMA at INFN-CATANIA Dear Site Administrators, According to Sites Functional Tests (used to have name TestZone tests, URL is /opt/edg/etc/profile.d/edg-rgma-env.sh cannot be found. Could you have a look at it please ? Please see the following pages : (section "Addding an R-GMA server to the SE" in the Admin's Guides onGOC wiki page where you find instructions about RGMA setup and configuration. Thank you The EGEE CIC on Duty team

11 INFN/LCG Monitoring & Information pages
INFN & LCG offer web pages with: site test results, monitoring, configuration hints, ..... Grid services monitoring: GRIDICE LCG monitoring tools: links to site test results (“GIIS monitoring”) LCG GridOperationCenter home: main access point for LCG GOC info (news,monitoring,operation,...)

12 Installation Guides, Documentations, FAQ
Quick reference pages for mw configuration and troubleshooting INFN pages: -> Documents Knowledge base LCG/EGEE: -> Admin's Guides Troubleshooting Guide

13 your site in the Information System (IS)
IS components are distributed in your site: CE GRIS, SE GRIS, CE GIIS -> CE BDII , (GRIDICE GRIS) And in the upper layer ... regional BDII, VO BDII, ... ... where you site is available for “client application” (RB,) Sanity check of IS content for your site are usefull to detect different problems (also in info-provider scripts) my site is not in the infrastructure, wrong CPU number, disk space, .... experiment software tag simple options to query direct ldapsearch query many graphical browser ( , )

14 simple ldap query ldapsearch -x -H ldap://t2-ce-01.lnl.infn.it:2135 -b "mds-vo-name=infn-lnl,o=grid“ GlueCEStateFreeCPUs: 0 GlueCEStateRunningJobs: 25 GlueCEStateStatus: Production GlueCEStateTotalJobs: 25 GlueCEStateWaitingJobs: 0 GlueCEStateWorstResponseTime: GlueCEPolicyMaxCPUTime: GlueCEPolicyMaxRunningJobs: 40 GlueCEPolicyMaxTotalJobs: 40 GlueHostApplicationSoftwareRunTimeEnvironment: VO-cms-OSCAR_3_4_0 GlueHostApplicationSoftwareRunTimeEnvironment: VO-cms-CMKIN_3_1_0_dar GlueHostApplicationSoftwareRunTimeEnvironment: VO-cms-OSCAR_2_4_5_dar GlueHostOperatingSystemName: Redhat GlueHostOperatingSystemRelease: legacysmp GlueHostOperatingSystemVersion: 1 SMP Fri Feb 20 10:12:55 PST 2004 GlueHostProcessorClockSpeed: 1004 GlueHostProcessorModel: Pentium III (Coppermine) GlueHostProcessorVendor: GenuineIntel GlueSEAccessProtocolType: gsiftp GlueSEAccessProtocolPort: 2811 GlueSEAccessProtocolSupportedSecurity: GSI GlueChunkKey: GlueSEUniqueID=t2-se-02.lnl.infn.it

15 log files log files are scattered in different directories
gridftp: /var/log/globus-gridftp.log gatekeeper: /var/log/globus-gatekeeper.log (log-rotate) other edg mw components: /opt/edg/var/log torque/openpbs: /var/spool/pbs -> server_logs sched_logs mom_logs (.... many others for RB,RLS,....) and sometimes info are logged also in system log location: /var/log/messages -> Feb 22 14:23:18 t2-ce-01 GRAM gatekeeper[19644]: "/C=IT/O=INFN/OU=Personal Certificate/L=Milano/CN=Guido mapped to atlassgm (18943/1307)

16 security, firewall ports, ... hacker
LCG security info: ( INFN: -> certificate expiration) Site open ports (few info may be obsolete ...) hacker and other friends chkrootkit ( ) automatic security scan in your site ?

17 site downtime If a site will be anavailable for upgrade/maintenance/..... this must be inserted in INFN and LCG downtime calendar INFN: -> Downtime Advices LCG: site-admin must be registred in GOC db

18 “typical” problems nfs (... cannot find experiment software)
full disks (... gridftp copy stuck) job stuck in the batch sysytem (... CPU time ) sync (black hole effect) info-provider script (... some info missing in Information System) jobs are not coming in my site ... site is not in IS or ? I cannot submit to site xyz /etc/grid-security/grid-mapfile /opt/edg/etc/lcmaps/ gridmapfile & groupmapfile performance optimization (example: different queues priority)

19 getting help / reporting problems
M. Verlato’s talk - friday INFN ticketing system ( -> Ticketing System) LCG bug submission: savannah (

20 Questions ???


Download ppt "Day-to-day testing & troubleshooting"

Similar presentations


Ads by Google