INFSO-RI-508833 Enabling Grids for E-sciencE www.eu-egee.org www.glite.org FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.

Slides:

Advertisements

Similar presentations

1 Perspectives from Operating a Large Scale Website Dennis Lee VP Technical Operations, Marchex.

Advertisements

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Implementing A Simple Storage Case Consider a simple case for distributed storage – I want to back up files from machine A on machine B Avoids many tricky.

CCRC’08 Jeff Templon NIKHEF JRA1 All-Hands Meeting Amsterdam, 20 feb 2008.

INFSO-RI Enabling Grids for E-sciencE 1 Downtime Process Author : Osman AIDEL Hélène Cordier.

FlareCo Ltd ALTER DATABASE AdventureWorks SET PARTNER FORCE_SERVICE_ALLOW_DATA_LOSS Slide 1.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software SC3 Gavin McCance – JRA1 Data Management Cluster LCG Storage Management.

G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?

G Robert Grimm New York University Pulling Back: How to Go about Your Own System Project?

Debugged!.  You know that old line about an ounce of prevention?  It’s true for debugging.

Summary of issues and questions raised. FTS workshop for experiment integrators Summary of use  Generally positive response on current state!  Now the.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks gLite Release Process Maria Alandes Pradillo.

1 Perspectives from Operating a Large Scale Website Dennis Lee.

FZU participation in the Tier0 test CERN August 3, 2006.

INFSO-RI Enabling Grids for E-sciencE Incident Response Policies and Procedures Carlos Fuentes

INFSO-RI Enabling Grids for E-sciencE SRMv2.2 experience Sophie Lemaitre WLCG Workshop.

Failsafe systems Fail by Failing to be Failsafe. Or to put it simply Don’t worry, nothing can go wrong click go wrong click go wrong click.

2 Sep Experience and tools for Site Commissioning.

INFSO-RI Enabling Grids for E-sciencE DAGs with data placement nodes: the “shish-kebab” jobs Francesco Prelz Enzo Martelli INFN.

Practical Reports on Dependability Manifestation of System Failure Site unavailability System exception /access violation Incorrect result Data loss/corruption.

WLCG Service Report ~~~ WLCG Management Board, 1 st September

INFSO-RI Enabling Grids for E-sciencE 1 Downtime Process Author : Osman AIDEL Hélène Cordier.

INFSO-RI Enabling Grids for E-sciencE SA1 and gLite: Test, Certification and Pre-production Nick Thackray SA1, CERN.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

EGEE-III INFSO-RI Enabling Grids for E-sciencE Overview of STEP09 monitoring issues Julia Andreeva, IT/GS STEP09 Postmortem.

Workflows II: Collect feedback for a file How to collect feedback Collecting feedback for a file can be challenging. However, if you save a file to a document.

EGI-InSPIRE EGI-InSPIRE RI DDM Site Services winter release Fernando H. Barreiro Megino (IT-ES-VOS) ATLAS SW&C Week November

Deadlock Detection and Recovery

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

INFSO-RI Enabling Grids for E-sciencE Scenarios for Integrating Data and Job Scheduling Peter Kunszt On behalf of the JRA1-DM Cluster,

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

GGUS Slides for the 2012/07/24 MB Drills cover the period of 2012/06/18 (Monday) until 2012/07/12 given my holiday starting the following weekend. Remove.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

INFSO-RI Enabling Grids for E-sciencE Building a robust distributed system: some lessons from R-GMA WLCG Service Reliability.

High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.

EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN

If condition1 then statements elif condition2 more statements […] else even more statements fi.

WLCG Service Report ~~~ WLCG Management Board, 18 th September

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The LCG interface Stefano BAGNASCO INFN Torino.

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

Interactions & Automations

Enabling Grids for E-sciencE INFSO-RI Enabling Grids for E-sciencE Gavin McCance GDB – 6 June 2007 FTS 2.0 deployment and testing.

INFSO-RI Enabling Grids for E-sciencE gLite Test and Certification Effort Nick Thackray CERN.

MCast Errors and HV Adjustments Multicast Errors (seen on the DATA ERIS connection) have caused a disruption of a HV Adjustment due to a timeout (since.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

INFSO-RI Enabling Grids for E-sciencE Operations Parallel Session Summary Markus Schulz CERN IT/GD Joint OSG and EGEE Operations.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Configuration Data or “What should be.

Disk Failures Skip. Index 13.4 Disk Failures Intermittent Failures Organizing Data by Cylinders Stable Storage Error- Handling.

INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.

EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Best Practices and Use cases David Bouvet,

First test of the PoC. Caveats I am not a developer ;) I was also beta tester of Crab3+WMA in 2011; I restarted testing it ~2 weeks ago to have a 1 to.

LHCb 2009-Q4 report Q4 report LHCb 2009-Q4 report, PhC2 Activities in 2009-Q4 m Core Software o Stable versions of Gaudi and LCG-AA m Applications.

Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.

Core and Framework DIRAC Workshop October Marseille.

Difference between External and Internal Server Monitoring.

EGEE-II INFSO-RI Enabling Grids for E-sciencE WLCG File Transfer Service Sophie Lemaitre – Gavin Mccance Joint EGEE and OSG Workshop.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks The Dashboard for Operations Cyril L’Orphelin.

How to Choose the Best Website Monitoring Service.

INFSO-RI Enabling Grids for E-sciencE FTS Administrators Tutorial for Tier-2s Paolo Badino

Taverna allows you to automatically iterate through large data sets. This section introduces you to some of the more advanced configuration options for.

INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre

Enabling Grids for E-sciencE EGEE-II INFSO-RI ROC managers meeting at EGEE 2007 conference, Budapest, October 1, 2007 Admin Matters Vera Hanser.

ATLAS Use and Experience of FTS

Service Restore Flow Receives/retrives input of list of server involved in the process Flow Performs multi level health check like process status, replication.

Cross-site problem resolution Focus on reliable file transfer service

1 VO User Team Alarm Total ALICE ATLAS CMS

Presentation transcript:

INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June 2006

Enabling Grids for E-sciencE INFSO-RI FTS workshop for experiment integrators FTS failures What to do if a channel starts to fail –It starts dropping files (after retries) –It “drains” pretty fast  Default for a file is 3 retries with 10 minutes between tries We sometimes observe intermittent problems at 1am –The SRM goes bad –Drops 10,000 files after multiple retries –35 minutes later, it all works again, apparently without human intervention –Do it get you out your bed or not?

Enabling Grids for E-sciencE INFSO-RI FTS workshop for experiment integrators How to avoid “draining” Make use of the FTS Hold state Jobs that have failed after X retries go to Hold instead of Failed Twice daily procedure: check your channel for Hold jobs, understand and fix the problems, set Hold jobs to Pending for another retry VO dependent –Not all VOs want this: they prefer fail-fast –The retry policy and “Hold” policy is configurable per VO This has always been part of FTS –But we don’t run the procedure just now

Enabling Grids for E-sciencE INFSO-RI FTS workshop for experiment integrators Halting channels On repeated failures –Some criteria to detect this –100 files failed consecutively, say –and/or 70% failure over last 1000 files Halt the channel and send an alarm –GOOD: you don’t drain many jobs –BAD: if it’s 1 am, you get someone out their bed -- since no transfers will run until someone restarts the channel (after fixing the problem), and you can’t afford to lose 8 hours transfers

Enabling Grids for E-sciencE INFSO-RI FTS workshop for experiment integrators Halting channels On repeated failure Stuttering halt –Halt the channel for X minutes  and send notification –Then retry  If it’s still bad, send alarm –BAD: you drop more jobs on each retry if the problem is still there –GOOD: the pause may be sufficient to allow the storage to recover from intermittent problems / overload

Enabling Grids for E-sciencE INFSO-RI FTS workshop for experiment integrators Combination? Using the Hold state to limit the failures from “job drains” –Jobs can always be rescued in the morning –This is a daily operations procedure Use stuttering halt to prevent 8 hours downtime for what was a 45 minute intermittent problem –Send notification for first couple of halts –Send alarm (get-out-of-bed) for repeated failures

Enabling Grids for E-sciencE INFSO-RI FTS workshop for experiment integrators Alarms and debugging FTS server at the tier-0 has an alarm for you –How does your local monitoring know about this? –Web-page that your local monitoring system polls?  Either binary – “it’s good / it’s bad”  or… X failures in last hour – you decide what’s worth getting out of bed for Once you’re debugging… –Exposing the information the FTS has about what’s going wrong –It does have a lot of information: we don’t expose it well enough  performance figures, aggregate failure classes, log files of individual transfers, all the error messages we get back  Work in progress to make this more accessible  Critical for operations: this is why the FTS is there…