Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 1 ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Slides:



Advertisements
Similar presentations
Adcos – one shifters wildest dreams… Wahid Bhimji.
Advertisements

1 Bridging Clouds with CernVM: ATLAS/PanDA example Wenjing Wu
AMI S.A. Datasets… Solveig Albrand. AMI S.A. A set is… A number of things grouped together according to a system of classification, or conceived as forming.
AMOD Report Doug Benjamin Duke University. Hourly Jobs Running during last week 140 K Blue – MC simulation Yellow Data processing Red – user Analysis.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES News on monitoring for CMS distributed computing operations Andrea.
LHC Experiment Dashboard Main areas covered by the Experiment Dashboard: Data processing monitoring (job monitoring) Data transfer monitoring Site/service.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
AMOD Report Simone Campana CERN IT-ES. Grid Services A very good week for sites – No major issues for T1s and T2s The only one to report is
Enabling Grids for E-sciencE COD 19 meeting, Bologna Nordic ROD experiences Michaela Lechner COD-19, Bologna.
Computing and LHCb Raja Nandakumar. The LHCb experiment  Universe is made of matter  Still not clear why  Andrei Sakharov’s theory of cp-violation.
03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.
AMOD Report Doug Benjamin Duke University. Running Jobs last 7 days 120K MC sim Users MC Rec Group.
GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals
CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.
GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.
Overview of day-to-day operations Suzanne Poulat.
Belle MC Production on Grid 2 nd Open Meeting of the SuperKEKB Collaboration Soft/Comp session 17 March, 2009 Hideyuki Nakazawa National Central University.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI AMOD report – Fernando H. Barreiro Megino CERN-IT-ES-VOS.
Offline shifter training tutorial L. Betev February 19, 2009.
WLCG Service Report ~~~ WLCG Management Board, 1 st September
Enabling Grids for E-sciencE System Analysis Working Group and Experiment Dashboard Julia Andreeva CERN Grid Operations Workshop – June, Stockholm.
Experiment Support ANALYSIS FUNCTIONAL AND STRESS TESTING Dan van der Ster, CERN IT-ES-DAS for the HC team: Johannes Elmsheuser, Federica Legger, Mario.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.
Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)
WLCG Service Report ~~~ WLCG Management Board, 9 th August
DDM Monitoring David Cameron Pedro Salgado Ricardo Rocha.
EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
Muon Shift Organization Outline Class 1 Atlas Control Room Shifts Booking Present Status + Experience+ Feedback Class 2 reminder A. Polini & Z. Yan May.
Monitoring for CCRC08, status and plans Julia Andreeva, CERN , F2F meeting, CERN.
EGEE-III INFSO-RI Enabling Grids for E-sciencE Ricardo Rocha CERN (IT/GS) EGEE’08, September 2008, Istanbul, TURKEY Experiment.
8 th CIC on Duty meeting Krakow /2006 Enabling Grids for E-sciencE Feedback from SEE first COD shift Emanoil Atanassov Todor Gurov.
WLCG Tier1 [ Performance ] Metrics ~~~ Points for Discussion ~~~ WLCG GDB, 8 th July 2009.
INFSO-RI Enabling Grids for E-sciencE ATLAS DDM Operations - II Monitoring and Daily Tasks Jiří Chudoba ATLAS meeting, ,
CCRC’08 Monthly Update ~~~ WLCG Grid Deployment Board, 14 th May 2008 Are we having fun yet?
WLCG Service Report ~~~ WLCG Management Board, 7 th July 2009.
Julia Andreeva on behalf of the MND section MND review.
Service Availability Monitor tests for ATLAS Current Status Tests in development To Do Alessandro Di Girolamo CERN IT/PSS-ED.
CERN IT Department CH-1211 Genève 23 Switzerland t Experiment Operations Simone Campana.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.
Global ADC Job Monitoring Laura Sargsyan (YerPhI).
WLCG Service Report ~~~ WLCG Management Board, 18 th September
Pavel Nevski DDM Workshop BNL, September 27, 2006 JOB DEFINITION as a part of Production.
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals
Operation Issues (Initiation for the discussion) Julia Andreeva, CERN WLCG workshop, Prague, March 2009.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.
CERN IT Department CH-1211 Genève 23 Switzerland t Future Needs of User Support (in ATLAS) Dan van der Ster, CERN IT-GS & ATLAS WLCG Workshop.
CERN IT Department CH-1211 Genève 23 Switzerland t Distributed Analysis User Support in ATLAS (and LHCb) Dan van der Ster, CERN IT-GS & ATLAS.
Dynamic Data Placement: the ATLAS model Simone Campana (IT-SDC)
User Support of WLCG Storage Issues Rob Quick OSG Operations Coordinator WLCG Collaboration Meeting Imperial College, London July 7,
ATLAS Distributed Computing ATLAS session WLCG pre-CHEP Workshop New York May 19-20, 2012 Alexei Klimentov Stephane Jezequel Ikuo Ueda For ATLAS Distributed.
WLCG Service Report ~~~ WLCG Management Board, 17 th February 2009.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,
GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals
Joe Foster 1 Two questions about datasets: –How do you find datasets with the processes, cuts, conditions you need for your analysis? –How do.
Dario Barberis: ATLAS DB S&C Week – 3 December Oracle/Frontier and CondDB Consolidation Dario Barberis Genoa University/INFN.
Data Distribution Performance Hironori Ito Brookhaven National Laboratory.
ADC Operations Shifts J. Yu Guido Negri, Alexey Sedov, Armen Vartapetian and Alden Stradling coordination, ADCoS coordination and DAST coordination.
Site notifications with SAM and Dashboards Marian Babik SDC/MI Team IT/SDC/MI 12 th June 2013 GDB.
CERN IT Department CH-1211 Genève 23 Switzerland t EGEE09 Barcelona ATLAS Distributed Data Management Fernando H. Barreiro Megino on behalf.
Cross-site problem resolution Focus on reliable file transfer service
POW MND section.
Readiness of ATLAS Computing - A personal view
1 VO User Team Alarm Total ALICE ATLAS CMS
Job workflow Pre production operations:
Offline shifter training tutorial
WLCG Collaboration Workshop;
Presentation transcript:

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague ATLAS Suspension and Downtime Procedures Graeme Stewart (for ATLAS Central Operations Team)

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Production lFor ATLAS production we run shift teams who look at the state of sites nThere’s almost 24 hour coverage, but not 24 hour expert support 2 Number 1 problem site Number 2 problem site

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Offline lIf the problems are diagnosed as site issues then nAn ATLAS eLog entry is created nA GGUS ticket is sent to the site nThe site is set offline if the problems are serious enough lIf the problem is well understood and resolved quickly by the site then the site will usually be set directly online again

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Suspension and Revalidation lIf the problem is extended, not well explained or the site have done, e.g., a major upgrade, then their queues can be put into test status while they are revalidated nTest status queues can pull only test jobs from PanDA! nTest jobs are usually small event generations, but do do a full chain test: [Insert pictute of test jobs here] lIf the site runs all it’s test jobs successfully it will be set online

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Production System Functional Tests lIn addition to these very targeted site specific tests we run weekly ‘production system functional tests’ for the whole cloud nJobs are similar short evgens to the site targeted tests lThese are very useful, e.g., after a Tier-1 downtime lGood test of whole cloud, especially if there is little or no production

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 User Analysis: Ganga Robot lFor user analysis jobs there is a similar ‘standard candle’ analysis job sent every day through the ganga framework lIf your site fails 3 tests in a row it’s blacklisted for ATLAS user analysis Tests of different sw releasesTests of different storage areas

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 DDM Dashboard lShifters also monitor the ATLAS DDM Dashboard nThis monitors file transfer success rates across the grid

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Drill Down to Sites Scheduled Downtime

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Data Distribution Tests lATLAS also runs weekly Distributed Data Management Functional Tests These tests distribute a small amount of /dev/random data to each ATLAS site according to the ATLAS computing model nAs these run all the time then they test the system’s functionality even when there are no other activities l[Insert plots here when available…]

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Failing Site? lThe procedure is the usual one nE-log (for us) nGGUS (for site) lIf the problem would affect MC production the site will also be taken offline for production nBut often a broken SE means you can’t get input data anyway… lIf the problem is very grave then the site will be removed from the DDM site services machine and/or the subscription engine will be stopped to that cloud (e.g., T1 problems) nThis prevents any transfer attempts to that site at all nIt is a manual operation  So we don’t like to do it, because it’s easy to forget that a site/cloud was removed lAfter a period of suspension a cloud/site must succeed in DDM functional tests for 36 hours before being allowed to take ATLAS data

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 How do we know about downtimes? lGoing to the GOC to see if a site is in downtime is far to slow for shifters triaging dozens of problems lThere is a feed from the GOC to an ATLAS Grid Downtime Calendar lProblems: nExtensions not shown nDowntimes can be marked for secondary services

Graeme Stewart: ATLAS Computing WLCG Workshop, Prague 2009 Communications lFrom us to you nWe primarily use GGUS tickets  Direct ticketing of sites is generally used and is much preferred by us nWe also use our cloud contacts  E.g., requests to change space token setup nAnd we have operational mailing lists which sites should sign up to  In particular: nAnd weekly operations meetings, several jamborees a year  To which sites are welcome and encouraged to come lFrom you to us nYou can use ggus tickets  But responses may be slower as the ticket needs to be routed to the correct ATLAS responsibles nPlease do use your ATLAS cloud contacts  You should know who they are! nOr ask a question  On a mailing list  In a meeting