INFSO-RI-508833 Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre

Slides:



Advertisements
Similar presentations
How We Manage SaaS Infrastructure Knowledge Track
Advertisements

Overview of local security issues in Campus Grid environments Bruce Beckles University of Cambridge Computing Service.
Manage & Configure SQL Database on the Cloud Haishi Bai Technical Evangelist Microsoft.
CERN DNS Load Balancing Vladimír Bahyl IT-FIO. 26 November 2007WLCG Service Reliability Workshop2 Outline  Problem description and possible solutions.
CERN IT Department CH-1211 Genève 23 Switzerland t Integrating Lemon Monitoring and Alarming System with the new CERN Agile Infrastructure.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec
Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Quattor-for-Castor Jan van Eldik Sept 7, Outline Overview of CERN –Central bits CDB template structure SWREP –Local bits Updating profiles.
And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR
Training and Dissemination Enabling Grids for E-sciencE Jinny Chien, ASGC 1 Training and Dissemination Jinny Chien Academia Sinica Grid.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Site Monitoring with Nagios E. Imamagic,
An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Stuart Kenny and Stephen Childs Trinity.
CERN - IT Department CH-1211 Genève 23 Switzerland t Oracle Real Application Clusters (RAC) Techniques for implementing & running robust.
INFSO-RI Enabling Grids for E-sciencE gLite Data Management and Interoperability Peter Kunszt (JRA1 DM Cluster) 2 nd EGEE Conference,
CERN-IT Oracle Database Physics Services Maria Girone, IT-DB 13 December 2004.
INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Implementation and performance analysis of.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
Rutherford Appleton Lab, UK VOBox Considerations from GridPP. GridPP DTeam Meeting. Wed Sep 13 th 2005.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Oracle for Physics Services and Support Levels Maria Girone, IT-ADC 24 January 2005.
Distributed Logging Facility Castor External Operation Workshop, CERN, November 14th 2006 Dennis Waldron CERN / IT.
INFSO-RI Enabling Grids for E-sciencE /10/20054th EGEE Conference - Pisa1 gLite Configuration and Deployment Models JRA1 Integration.
EGEE-II INFSO-RI Enabling Grids for E-sciencE Operations procedures: summary for round table Maite Barroso OCC, CERN
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Grid Monitoring Tools E. Imamagic, SRCE CE.
INFSO-RI Enabling Grids for E-sciencE SRMv2.2 in DPM Sophie Lemaitre Jean-Philippe.
1Maria Dimou- cern-it-gd LCG November 2007 GDB October 2007 VOM(R)S Workshop report Grid Deployment Board.
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Regional Nagios Emir Imamagic /SRCE EGEE’09,
INFSO-RI Enabling Grids for E-sciencE FTS failure handling Gavin McCance Service Challenge technical meeting 21 June.
1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
CERN - IT Department CH-1211 Genève 23 Switzerland CASTOR F2F Monitoring at CERN Miguel Coelho dos Santos.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
DB Questions and Answers open session (comments during session) WLCG Collaboration Workshop, CERN Geneva, 24 of April 2008.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
INFSO-RI Enabling Grids for E-sciencE FTS Administrators Tutorial for Tier-2s Paolo Badino
INFSO-RI Enabling Grids for E-sciencE Workshop WLCG Security for Grid Sites Louis Poncet System Engineer SA3 - OSCT.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Status of the SAM/Nagios/GSTAT Components.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Nagios Grid Monitor E. Imamagic, SRCE OAT.
CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.
Servizi core INFN Grid presso il CNAF: setup attuale
Jean-Philippe Baud, IT-GD, CERN November 2007
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
High Availability Linux (HA Linux)
N-Tier Architecture.
Service Challenge 3 CERN
3D Application Tests Application test proposals
LFC Status and Futures INFN T1+T2 Cloud Workshop
Jean-Philippe Baud - Sophie Lemaitre IT-GD, CERN May 2005
Evolution of SAM in an enhanced model for monitoring the WLCG grid
FTS Monitoring Ricardo Rocha
Maximum Availability Architecture Enterprise Technology Centre.
Castor services at the Tier-0
NGS Oracle Service.
Oracle Database Monitoring and beyond
Leigh Grundhoefer Indiana University
SAP R/3 Installation on WIN NT-ORACLE
Specialized Cloud Architectures
Deploying Production GRID Servers & Services
Presentation transcript:

INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre WLCG service Reliability Workshop CERN – Geneva – 26th November 2007

Enabling Grids for E-sciencE INFSO-RI Talk Overview What is reliability? LFC usage at CERN Turning the LFC into a reliable service at CERN

Enabling Grids for E-sciencE INFSO-RI Message What is reliability? –Ability of a system/component to perform its required functions under stated conditions What does it mean for you? –Ability to meet Tier1 SLA  99% availability with maximum 12 hours downtime Increasing reliability = Increasing time when service is available = Reducing downtime Good practice –Don’t lose the Experiments data! –Automate everything –Limit the server downtime  Front-end dynamic load balancing –Limit the database downtime  Oracle RAC –Limit time before noticing problem  Monitoring –Make sure to be able to react as fast as possible  Create procedures for operators (first level support) –Plan all interventions

Enabling Grids for E-sciencE INFSO-RI Talk Overview What is reliability? LFC usage at CERN Turning the LFC into a reliable service at CERN

Enabling Grids for E-sciencE INFSO-RI Experiment Usage at CERN Different experiments decided on different usage patterns –LHCb  Central catalog  Read-only catalog (for scalability and redundancy)  Replica read-only catalogs at Tier1s Replication done via Oracle Streams –ATLAS  Central and local catalog –CMS  Central catalog Shared catalog –Shared “catch-all” catalog for dteam, unosat, sixt, …

Enabling Grids for E-sciencE INFSO-RI General Infrastructure Backend storage is in Oracle –On the LCG Oracle RAC instance –Separate database accounts per experiment & catalog type  Better VO isolation All front-end nodes are in 2-node load-balanced clusters –Mid-range servers –Using automatic load-balancing to isolate “broken” nodes  Separate DNS alias for each VO & catalog type Full Service Description:

Enabling Grids for E-sciencE INFSO-RI LFC DNS Aliases

Enabling Grids for E-sciencE INFSO-RI LFC Service Deployment

Enabling Grids for E-sciencE INFSO-RI Talk Overview What is reliability? LFC usage at CERN Turning the LFC into a reliable service at CERN –Checklist –Database backup –Monitoring –Dynamic DNS load balancing –Firewall –Quattor vs. YAIM –Operators procedures

Enabling Grids for E-sciencE INFSO-RI Checklist: requirements See

Enabling Grids for E-sciencE INFSO-RI Checklist: hardware

Enabling Grids for E-sciencE INFSO-RI Checklist: operations

Enabling Grids for E-sciencE INFSO-RI Good practice Database backups –Don’t lose the Experiments data! –Ask your friendly database administrator to backup the LFC database Plan all interventions –Even the transparent ones –To prevent coordination / communication problems as much as possible

Enabling Grids for E-sciencE INFSO-RI Monitoring (1) What to monitor? Note: avoid monitoring based on creating an LFC file/directory –To avoid using file ids in the LFC database tables –Update file utime instead (LFC_NOWRITE)

Enabling Grids for E-sciencE INFSO-RI Monitoring (2) Existing tools –Generic LFC probe  See Monitoring Working Group web page  Direct link to LFC probe onitoring/trunk/probe/ch.cern/src/LFC-probe –At CERN, a Lemon sensor is used  “lemon-sensor-grid-lfc” package  See corresponding source in CVS bin/cvsweb.cgi/elfms/lemon/sensors/sensor-grid- lfc/?cvsroot=elfms  A Lemon wrapper to the LFC probe will be written in the future

Enabling Grids for E-sciencE INFSO-RI DNS load balancing (1) Dynamic DNS load balancing –Solution implemented by Vlado Bahyl and Nick Garfield –See Vlado’s presentation this morning  erialId=slides&confId= erialId=slides&confId=20080 For the LFC nodes, checking whether: –Cannot ssh, or –Node in maintenance, or –/tmp full, or –Alarm among  LFC_NOREAD, LFC_NOWRITE, LFCDAEMON_WRONG Machine removed from DNS aliases

Enabling Grids for E-sciencE INFSO-RI DNS load balancing (2) Advantage: –All LFC software upgrades are transparent for the users –Except when database schema changes Ex: two DNS aliased nodes A and B 1) Put node A in maintenance  Wait for node A to be taken out of production by dynamic DNS load balancing 2) Stop + upgrade + start LFC on node A 3) Take node A out of maintenance  Wait for node A to be put back into production by dynamic DNS load balancing 4) Start at step 1) with node B Transparent upgrade done!

Enabling Grids for E-sciencE INFSO-RI Firewall At CERN, using LANDB sets to control firewall access –Common firewall settings for the whole LFC cluster –If a change is made, it is applied to all machines automatically  No more “oooops, forgot to configure this one”…

Enabling Grids for E-sciencE INFSO-RI Partitions The LFC log files are stored under –/var/log/lfc Make sure the /var partition is big enough! –This problem hit us at CERN…  –WLCG log retention policy: 90 days –Now, /var size is 200G on the LFC nodes

Enabling Grids for E-sciencE INFSO-RI Yaim Yaim is used to configure the grid aspects of the service Some duplication between QUATTOR and yaim –Approach is to use QUATTOR for system functionality  E.g. chkconfig, FS, accounts, access control, package management –And yaim for grid functionality Oracle configuration is handled specially –We run “upgrade” scripts by hand, not from yaim auto-upgrade

Enabling Grids for E-sciencE INFSO-RI Avoiding Yaim/Quattor conflicts For pieces of system functionality configured by both yaim and quattor, we remove them from YAIM Simple mechanism –place an empty config_XXX file in the yaim functions/local directory –Handled by an rpm ‘CERN-CC-glite-yaim-local’

Enabling Grids for E-sciencE INFSO-RI Operators procedures React to problems as fast as possible –Get operators to respond directly to known issues –At CERN, most alarms open standard tickets  LFC_ACTIVE_CONN, LFC_SLOW_READDIR don’t  … because there’s no immediate action you can take ;( It’s a sign of heavy usage of the service by the VO

Enabling Grids for E-sciencE INFSO-RI Message What is reliability? –Ability of a system/component to perform its required functions under stated conditions What does it mean for you? –Ability to meet Tier1 SLA  99% availability with maximum 12 hours downtime Increasing reliability = Increasing time when service is available = Reducing downtime Good practice –Don’t lose the Experiments data! –Automate everything –Limit the server downtime  Front-end dynamic load balancing –Limit the database downtime  Oracle RAC –Limit time before noticing problem  Monitoring –Make sure to be able to react as fast as possible  Create procedures for operators (first level support) –Plan all interventions This will save you from a lot of trouble

Enabling Grids for E-sciencE INFSO-RI Acknowledgements Integration with CERN fabric infrastructure –James Casey LFC administrators at CERN –Jan Van Eldik, Miguel Coelho, Ignacio Reguero, David Collados, Diana Bosio Dynamic DNS load balancing –Vlado Bahyl, Nick Garfield LFC expert –Jean-Philippe Baud

Enabling Grids for E-sciencE INFSO-RI Questions ?