Presentation is loading. Please wait.

Presentation is loading. Please wait.

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 1 ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Similar presentations


Presentation on theme: "Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 1 ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)"— Presentation transcript:

1 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 1 ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

2 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Production lFor ATLAS production we run shift teams who look at the state of sites nThere’s almost 24 hour coverage, but not 24 hour expert support 2 Number 1 problem site Number 2 problem site

3 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Offline lIf the problems are diagnosed as site issues then nAn ATLAS eLog entry is created nA GGUS ticket is sent to the site  This will be a directly routed ticket nThe site is set offline if the problems are serious enough lIf the problem is well understood and resolved quickly by the site then the site will usually be set directly online again

4 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Suspension and Revalidation lIf the problem is extended, not well explained or the site have done, e.g., a major upgrade, then their queues can be put into test status while they are revalidated nTest status queues can pull only test jobs from PanDA! nTest jobs are usually small event generations, but do do a full chain test: lIf the site runs all it’s test jobs successfully it will be set online

5 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Production System Functional Tests lIn addition to these very targeted site specific tests we run weekly ‘production system functional tests’ for the whole cloud nJobs are similar short evgens to the site targeted tests lThese are very useful, e.g., after a Tier-1 downtime lGood test of whole cloud, especially if there is little or no production nThese tests are becoming more routine

6 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 User Analysis: Ganga Robot lFor user analysis jobs there is a similar ‘standard candle’ analysis job sent every day through the ganga framework lIf your site fails either of the last 2 tests it’s automatically blacklisted for ATLAS user analysis in EGEE. nDistributed Analysis Shift Team will also raise GGUS tickets if a clear site problem is identified Tests of different sw releasesTests of different storage areas

7 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 DDM Dashboard lShifters also monitor the ATLAS DDM Dashboard nThis monitors file transfer success rates across the grid

8 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Drill Down to Sites Scheduled Downtime

9 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Data Distribution Tests lATLAS also runs weekly Distributed Data Management Functional Tests These tests distribute a small amount of /dev/random data to each ATLAS site according to the ATLAS computing model nAs these run all the time then they test the system’s functionality even when there are no other activities

10 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Failing Site? lThe procedure is the usual one nE-log (for us) nGGUS (for site) lIf the problem would affect MC production the site will also be taken offline for production nBut often a broken SE means you can’t get input data anyway… lIf the problem is very grave then the site will be removed from the DDM site services machine and/or the subscription engine will be stopped to that cloud (e.g., T1 problems) because it swamps the dashboard nThis prevents any transfer attempts to that site at all  Although DDM can try to get data from the site nIt is a manual operation  So we don’t like to do it, because it’s easy to forget that a site/cloud was removed nThis will improve soon with ATLAS Grid Information System (AGIS) lAfter a period of suspension a cloud/site must succeed in DDM functional tests for 36 hours before being allowed to take ATLAS data

11 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 How do we know about downtimes? lGoing to the GOC to see if a site is in downtime is far to slow for shifters triaging dozens of problems lThere is a feed from the GOC to an ATLAS Grid Downtime Calendar lProblems: nExtensions not shown nDowntimes can be marked for secondary services

12 Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Communications lFrom us to you nWe primarily use GGUS tickets for problems  Direct ticketing of sites is generally used and is much preferred by us nWe also use our cloud contacts and WLCG operations meetings for requests  E.g., requests to change space token setup nAnd we have operational mailing lists which sites should sign up to  In particular: atlas-project-adc-operations@cern.ch nAnd weekly ADC operations meetings, several jamborees a year  To which sites are not just welcome, but encouraged to come lFrom you to us nYou can use ggus tickets  But responses may be slower as the ticket needs to be routed to the correct ATLAS responsibles nPlease do use your ATLAS cloud contacts  You should know who they are! nOr ask a question  On a mailing list  In a meeting – including the daily WLCG operations meeting


Download ppt "Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 1 ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)"

Similar presentations


Ads by Google