Olof.barring@cern.ch Miguel.coelho.santos@cern.ch Inter-site issues Olof.barring@cern.ch Miguel.coelho.santos@cern.ch.

Slides:

Advertisements

Similar presentations

Applications Area Issues RWL Jones GridPP13 – 5 th June 2005.

Advertisements

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

Storage Issues: the experiments’ perspective Flavia Donno CERN/IT WLCG Grid Deployment Board, CERN 9 September 2008.

EEC-681/781 Distributed Computing Systems Lecture 3 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Applications Area Issues RWL Jones Deployment Team – 2 nd June 2005.

1 CMPT 471 Networking II DHCP Failover and multiple servers © Janice Regan,

US ATLAS Western Tier 2 Status and Plan Wei Yang ATLAS Physics Analysis Retreat SLAC March 5, 2007.

Data management at T3s Hironori Ito Brookhaven National Laboratory.

 Communication Tasks  Protocols  Protocol Architecture  Characteristics of a Protocol.

Data management for ATLAS, ALICE and VOCE in the Czech Republic L.Fiala, J. Chudoba, J. Kosina, J. Krasova, M. Lokajicek, J. Svec, J. Kmunicek, D. Kouril,

WLCG Service Report ~~~ WLCG Management Board, 1 st September

Byzantine fault-tolerance COMP 413 Fall Overview Models –Synchronous vs. asynchronous systems –Byzantine failure model Secure storage with self-certifying.

CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.

WLCG Service Report ~~~ WLCG Management Board, 9 th August

Alberto Aimar CERN – LCG1 Reliability Reports – May 2007

CASTOR evolution Presentation to HEPiX 2003, Vancouver 20/10/2003 Jean-Damien Durand, CERN-IT.

CIT 470: Advanced Network and System AdministrationSlide #1 CIT 470: Advanced Network and System Administration System Monitoring.

IT ELECTRONIC COMMERCE THEORY NOTES

1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.

Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Automatic server registration and burn-in framework HEPIX’13 28.

INFSO-RI Enabling Grids for E-sciencE The gLite File Transfer Service: Middleware Lessons Learned form Service Challenges Paolo.

Storage and Data Movement at FNAL D. Petravick CHEP 2003.

SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.

The new FTS – proposal FTS status. EMI INFSO-RI /05/ FTS /05/ /05/ Bugs fixed – Support an SE publishing more than.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

WLCG Latency Mesh Comments + – It can be done, works consistently and already provides useful data – Latency mesh stable, once configured sonars are stable.

File Transfer And Access (FTP, TFTP, NFS). Remote File Access, Transfer and Storage Networks For different goals variety of approaches to remote file.

Role Of Network IDS in Network Perimeter Defense.

FTS monitoring work WLCG service reliability workshop November 2007 Alexander Uzhinskiy Andrey Nechaevskiy.

GGUS summary (3 weeks) VOUserTeamAlarmTotal ALICE4004 ATLAS CMS LHCb Totals

CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.

INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.

Service Challenge Meeting “Review of Service Challenge 1” James Casey, IT-GD, CERN RAL, 26 January 2005.

8 August 2006MB Report on Status and Progress of SC4 activities 1 MB (Snapshot) Report on Status and Progress of SC4 activities A weekly report is gathered.

Data transfers and storage Kilian Schwarz GSI. GSI – current storage capacities vobox LCG RB/CE GSI batchfarm: ALICE cluster (67 nodes/480 cores for batch.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI Mario Reale – GARR NetJobs: Network Monitoring Using Grid Jobs.

Distributed Data Management Miguel Branco 1 DQ2 status & plans BNL workshop October 3, 2007.

SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

A Computing Tier 2 Node Eric Fede – LAPP/IN2P3. 2 Eric Fede – 1st Chinese-French Workshop Plan What is a Tier 2 –Context and definition To be a Tier 2.

Data Management at Tier-1 and Tier-2 Centers Hironori Ito Brookhaven National Laboratory US ATLAS Tier-2/Tier-3/OSG meeting March 2010.

2.2 Interfacing Computers MR JOSEPH TAN CHOO KEE TUESDAY 1330 TO 1530

CERN IT Department CH-1211 Genève 23 Switzerland t Towards end-to-end debugging for data transfers Gavin McCance Javier Conejero Banon Sophie.

Kevin Thaddeus Flood University of Wisconsin

Simulation Production System

LCG Service Challenge: Planning and Milestones

ATLAS Use and Experience of FTS

Status of the SRM 2.2 MoU extension

BNL Tier1 Report Worker nodes Tier 1: added 88 Dell R430 nodes

Cross-site problem resolution Focus on reliable file transfer service

THE OSI MODEL By: Omari Dasent.

James Casey, IT-GD, CERN CERN, 5th September 2005

BNL FTS services Hironori Ito.

Establishing End-to-End Guaranteed Bandwidth Network Paths Across Multiple Administrative Domains The DOE-funded TeraPaths project at Brookhaven National.

Philippe Charpentier CERN – LHCb On behalf of the LHCb Computing Group

WLCG Management Board, 16th July 2013

WLCG Service Interventions

1 VO User Team Alarm Total ALICE ATLAS CMS

CT1303 LAN Rehab AlFallaj.

DCache things Paul Millar … on behalf of the dCache team.

File Transfer and access

Storage Virtualization

Data Management cluster summary

Chapter 16: Distributed System Structures

lundi 25 février 2019 FTS configuration

Any Data, Anytime, Anywhere

INFNGRID Workshop – Bari, Italy, October 2004

The LHCb Computing Data Challenge DC06

Presentation transcript:

Olof.barring@cern.ch Miguel.coelho.santos@cern.ch Inter-site issues Olof.barring@cern.ch Miguel.coelho.santos@cern.ch

Intra-site issues Power cuts, cooling problems Hardware failures Service problems (storage, CPU, database, monitoring,…) Performance problems Bottlenecks Overload Ownership: site Motivation to fix: underused or wasted resources Contract: MoU Workaround: avoid site

Site issue lifecycle Site staff  User Announcement Awareness Identification monitoring Fixing… Announcement  Developers Escalation  User

Inter-site issues Issue involves more than one site WAN network problem Service problems (file transfer, db synchronization…) Performance bottlenecks Errors Overload Ownership: unknown/none Motivation to fix: underused or wasted resources Contract: none Workaround: avoid sites?

Inter-site issue lifecycle  User Not obvious for the involved sites: If you announce it, you’re responsible for fixing it? Psychological and political aspects: you think it’s the other site but don’t want to point finger Not obvious for the involved sites: Network monitoring may show high utilization because of retries Client (e.g. FTS) reports high error/retry rate Announcement Not obvious for the involved sites: Initially both sites cannot do very much else than try to prove their own innocence Collaborative approach is often hampered by: Time-zones Firewalls Difficulty accessing remote resources (rules & regulations) Difficulty access to logfiles Differences in log format Awareness Identification Fixing… Escalation This is usually the easy bit… 

Examples (I) (2006) Tier-2 -> Tier-0 CMS phedex transfers were debugged by CASTOR Ops team at CERN because it was a ‘CASTOR problem’. At the end most problems were related to firewall settings on the sources (Tier-2s). A throughput problem was caused by a faulty module on the HTAR firewall at CERN.

Examples (II) (2007) Tier-0 -> Tier-1 ATLAS DQ2 transfers investigated because of several unclear error messages reported by FTS concerning the 3rd-Party Grid-ftp transfer. [GRIDFTP] an end-of-file was reached [GRIDFTP] the server sent an error response: 451 451 Local resource failure: malloc: Cannot allocate memory. Error transmitted by the dCache client when file system is full. After a timeout caused by inactivity on the data channels the CASTOR Grid-ftp server (at VDT level) tries to read the rest of the file into memory and fails on the malloc

Error reporting *** Failures Report*** *** Error "451 Local resource failure: malloc: Cannot allocate memory." - 2410 requests ****.sara.nl - 2179 errors ****.in2p3.fr - 175 errors ****.bnl.gov - 56 errors *** Error "426 Data connection. data_write() failed: Handle not in the proper state" - 228 requests ****.sara.nl - 161 errors ****.bnl.gov - 67 errors *** Error "425 Can't open data connection. data_connect_failed() failed: a system call failed (Connection refused)." -69 requests ****.sara.nl - 69 errors *** Error "421 Timeout (900 seconds): closing control connection." - 20 requests ****.sara.nl - 11 errors ****.bnl.gov - 8 errors ****.in2p3.fr - 1 errors

How to improve (1) Awareness of the problem Inter-site monitoring Global repository of FTS error reports SAM probes for inter-site problems? SE – SE criss-cross Announcement of the problem … yeah?

How to improve (2) Identification of the problem Remote access Permanent access not feasible… Personal login  combinatory problem with many sites and people involved at the sites Generic/anonymous login not acceptable Temporary access expiring after X hours? Is there a need for a ‘site transfer responsible on duty’ assigned at each site? Probably only working hours coverage so the timezone is still an issue However, it could be useful to know who to contact whenever you want to investigate some problem involving yours and some other site