Cross-site problem resolution Focus on reliable file transfer service

Slides:

Advertisements

Similar presentations

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Advertisements

Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.

High Availability 24 hours a day, 7 days a week, 365 days a year… Vik Nagjee Product Manager, Core Technologies InterSystems Corporation.

June 23rd, 2009Inflectra Proprietary InformationPage: 1 SpiraTest/Plan/Team Deployment Considerations How to deploy for high-availability and strategies.

Business Continuity and DR, A Practical Implementation Mich Talebzadeh, Consultant, Deutsche Bank

Lesson 12 – NETWORK SERVERS Distinguish between servers and workstations. Choose servers for Windows NT and Netware. Maintain and troubleshoot servers.

1 Disaster Recovery Planning & Cross-Border Backup of Data among AMEDA Members Vipin Mahabirsingh Managing Director, CDS Mauritius For Workgroup on Cross-Border.

Windows Server MIS 424 Professor Sandvig. Overview Role of servers Performance Requirements Server Hardware Software Windows Server IIS.

Tier-1 Overview Andrew Sansum 21 November Overview of Presentations Morning Presentations –Overview (Me) Not really overview – at request of Tony.

LHCC Comprehensive Review – September WLCG Commissioning Schedule Still an ambitious programme ahead Still an ambitious programme ahead Timely testing.

PPD & CLRC's response to the (IS) Security Threat Gareth Smith PPD/CG Christmas Lectures 2002.

CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR Operational experiences HEPiX Taiwan Oct Miguel Coelho dos Santos.

EGEE ARM-2 – 5 Oct LCG Security Coordination Ian Neilson LCG Security Officer Grid Deployment Group CERN.

WLCG Service Report ~~~ WLCG Management Board, 27 th October

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals

GGUS summary ( 4 weeks ) VOUserTeamAlarmTotal ALICE ATLAS CMS LHCb Totals 1.

1 24x7 support status and plans at PIC Gonzalo Merino WLCG MB

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.

1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Feedback from the Tier1s GDB, September CNAF 24x7 support On-call person for all critical infrastructural services (cooling, power etc..) Manager.

WLCG Service Report ~~~ WLCG Management Board, 16 th December 2008.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Virtualizing the Banner Infrastructure Morey Roof Information Services Department New Mexico Tech

Busy Storage Services Flavia Donno CERN/IT-GS WLCG Management Board, CERN 10 March 2009.

WLCG Service Report ~~~ WLCG Management Board, 7 th September 2010 Updated 8 th September

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

GGUS summary (4 weeks) VOUserTeamAlarmTotal ALICE4015 ATLAS CMS LHCb Totals

4 March 2008CCRC'08 Feb run - preliminary WLCG report 1 CCRC’08 Feb Run Preliminary WLCG Report.

WLCG Service Report ~~~ WLCG Management Board, 16 th September 2008 Minutes from daily meetings.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.

A Service-Based SLA Model HEPIX -- CERN May 6, 2008 Tony Chan -- BNL.

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.

Virtual Machine Movement and Hyper-V Replica

ASCC Site Report Eric Yen & Simon C. Lin Academia Sinica 20 July 2005.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

WLCG Service Report Jean-Philippe Baud ~~~ WLCG Management Board, 24 th August

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

Troubleshooting Windows Vista Lesson 11. Skills Matrix Technology SkillObjective DomainObjective # Troubleshooting Installation and Startup Issues Troubleshoot.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE7029 ATLAS CMS LHCb Totals

CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.

Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.

High Availability 24 hours a day, 7 days a week, 365 days a year…

High Availability Linux (HA Linux)

U.S. ATLAS Grid Production Experience

James Casey, IT-GD, CERN CERN, 5th September 2005

Service Challenge 3 CERN

Offline shifter training tutorial

WLCG Management Board, 16th July 2013

Castor services at the Tier-0

Inter-site issues

The ADC Operations Story

WLCG Service Interventions

1 VO User Team Alarm Total ALICE ATLAS CMS

IT OPERATIONS Session 7.

The Troubleshooting theory

Presentation transcript:

Cross-site problem resolution Focus on reliable file transfer service Gavin McCance Service challenge technical meeting 15 September 2006

Service challenge technical meeting 15/09/06 Service issues For the distributed transfer service there are two broad class of failure that cause service degradation Internal failures of software or hardware FTS daemons lock-up, disk failure, kernel panics, file-system fills up… Degradation or failure of an external service MyProxy problems, information system problems Castor problems Tier-1 SRM problems Networking problems Service challenge technical meeting 15/09/06

Service challenge technical meeting 15/09/06 Internal failures Internal failures of the FTS FTS daemons lock-up, disk failure, kernel panics, file-system fills up… Most of these problems can be handled by procedure, redundancy, etc RAID, redundant power supplies, operator procedures Recovery procedures in case of server failure There is already 24 x 7 support for known internal problems For previously unknown problems there is expert backup Office-hours week-day only Is this enough for software still under active development? (bearing in mind changes in behaviour of dependent software can also and do affect FTS) Service challenge technical meeting 15/09/06

Service challenge technical meeting 15/09/06 External failures Degradation or failure of an external service MyProxy problems, information system problems Castor problems Tier-1 SRM problems Networking problems Two stages: Detection Resolution Service challenge technical meeting 15/09/06

Service challenge technical meeting 15/09/06 External failures Detection is ‘easy’ SAM tests are being improved to cover much of this FTS ‘knows’ about all the transfers and can expose this information – the failure rate is measured This needs work to integrate with current systems Resolution: if the source of the problem is obvious: Obvious problems can be sent to the owner of the problem ~semi-automatically (FTS sends an alarm). e.g. 30% of transfers failed because your SRM didn’t answer. Appropriate for problems where the problem is obviously localised to one site FTS knows where the problem is and sends an alarm to someone. This person with this role calls the right people using the appropriate system and follows up. Service challenge technical meeting 15/09/06

Service challenge technical meeting 15/09/06 External failures There are still many service degradations for which the cause is harder to determine “Transfer failure” (gridFTP isn’t always obvious about the cause). Networking problems and firewall issues Problems on SRM-copy channels (FTS doesn’t have much logging about what went wrong) “This channel seems to be going slow for the last few hours” type problems These require ‘expert’ involvement and investigation Local experts on FTS, Castor, networking Remote experts on tier-1 site SRM, networking Of course, the goal is to move as much of this as possible to the ‘automatic’ system Packaging ‘canned problems’ takes time and experience with the problem Some things will never be moved Service challenge technical meeting 15/09/06

Service challenge technical meeting 15/09/06 Who can we use? For easy problems that require an alarm and follow-up we have CIC-on-duty Prerequisite is adequate monitoring Can also handle problems that require a (small) bit of digging provided the tools and procedures are there This needs to be our next development priority … but CIC-on duty is office-hours week-day only (and moves time-zone) We will not meet WLCG 99% service availability target with just this - two weekends downtime and you’ve failed to meet the target For harder problems, we require an expert team Service challenge technical meeting 15/09/06

Service challenge technical meeting 15/09/06 Core hours proposal Core hours = weekday office-hours Easy problems go to CIC-on-duty Alarms come from SAM and from FTS Obvious alarms can be sent to correct site immediately Procedures and tools are provided to dig (a little) deeper if the problem is not immediately obvious The monitoring needs to be the next FTS development priority Harder problems and problems requiring cross-site coordination go to an expert piquet team CIC-on-duty will get the alarm detecting service degradation If the cause isn’t obvious, call expert team to investigate Send alarm ticket to site (incl. CERN) Investigate with remote experts CIC-on-duty should follow-up the issue Service challenge technical meeting 15/09/06

Out-of-hours proposal The WLCG expert piquet team extends the coverage provided by the CIC-on-duty Proposal is 12-hour coverage including weekends The flow is the same – the team should make use of the same monitoring systems and alarm raising systems as CIC-on-duty CIC-on-duty should perform follow-up during weekdays With this we accept up to 12 hours unattended service degradation Recover using the transfer service catch-up Maybe this needs review during accelerator running Need to resolve CIC-on-duty time-zone issues Service challenge technical meeting 15/09/06