Deploying Production GRID Servers & Services

Slides:



Advertisements
Similar presentations
FP7-INFRA Enabling Grids for E-sciencE EGEE Induction Grid training for users, Institute of Physics Belgrade, Serbia Sep. 19, 2008.
Advertisements

Basic Grid Job Submission Alessandra Forti 28 March 2006.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Makrand Siddhabhatti Tata Institute of Fundamental Research Mumbai 17 Aug
CERN DNS Load Balancing Vladimír Bahyl IT-FIO. 26 November 2007WLCG Service Reliability Workshop2 Outline  Problem description and possible solutions.
CERN IT Department CH-1211 Genève 23 Switzerland t Some Hints for “Best Practice” Regarding VO Boxes Running Critical Services and Real Use-cases.
ASGC 1 ASGC Site Status 3D CERN. ASGC 2 Outlines Current activity Hardware and software specifications Configuration issues and experience.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
1 The new Fabric Management Tools in Production at CERN Thorsten Kleinwort for CERN IT/FIO HEPiX Autumn 2003 Triumf Vancouver Monday, October 20, 2003.
Large Farm 'Real Life Problems' and their Solutions Thorsten Kleinwort CERN IT/FIO HEPiX II/2004 BNL.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
Installing, running, and maintaining large Linux Clusters at CERN Thorsten Kleinwort CERN-IT/FIO CHEP
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
Glite. Architecture Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware Higher-Level Grid Services are supposed.
CERN DNS Load Balancing VladimírBahylIT-FIO NicholasGarfieldIT-CS.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE Site Architecture Resource Center Deployment Considerations MIMOS EGEE Tutorial.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
The CMS Top 5 Issues/Concerns wrt. WLCG services WLCG-MB April 3, 2007 Matthias Kasemann CERN/DESY.
High Availability Technologies for Tier2 Services June 16 th 2006 Tim Bell CERN IT/FIO/TSI.
Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.
David Foster LCG Project 12-March-02 Fabric Automation The Challenge of LHC Scale Fabrics LHC Computing Grid Workshop David Foster 12 th March 2002.
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
CERN - IT Department CH-1211 Genève 23 Switzerland Operations procedures CERN Site Report Grid operations workshop Stockholm 13 June 2007.
LHC Logging Cluster Nilo Segura IT/DB. Agenda ● Hardware Components ● Software Components ● Transparent Application Failover ● Service definition.
Patricia Méndez Lorenzo Status of the T0 services.
WLCG critical services update Andrea Sciabà WLCG operations coordination meeting December 18, 2014.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
CERN site report Operational aspects of Grid Services at the Tier-0.
Reaching MoU Targets at Tier0 December 20 th 2005 Tim Bell IT/FIO/TSI.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
CERN IT Department CH-1211 Geneva 23 Switzerland t Service Reliability & Critical Services January 15 th 2008.
Servizi core INFN Grid presso il CNAF: setup attuale
Service Availability Monitoring
Jean-Philippe Baud, IT-GD, CERN November 2007
WLCG IPv6 deployment strategy
Introduction of load balancers at the RAL Tier-1
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
Grid Computing: Running your Jobs around the World
REPLICATION & LOAD BALANCING
Bentley Systems, Incorporated
LCG Service Challenge: Planning and Milestones
High Availability Linux (HA Linux)
U.S. ATLAS Grid Production Experience
IT-DB Physics Services Planning for LHC start-up
Database Services at CERN Status Update
Service Challenge 3 CERN
BDII Performance Tests
Database Readiness Workshop Intro & Goals
How to enable computing
Status and plans of central CERN Linux facilities
Castor services at the Tier-0
WLCG Service Interventions
Short update on the latest gLite status
A Messaging Infrastructure for WLCG
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Ákos Frohner EGEE'08 September 2008
Workshop Summary Dirk Duellmann.
Database Services for CERN Deployment and Monitoring
R. Graciani for LHCb Mumbay, Feb 2006
LHCb status and plans Ph.Charpentier CERN.
EGEE Middleware: gLite Information Systems (IS)
INFNGRID Workshop – Bari, Italy, October 2004
Site availability Dec. 19 th 2006
Client/Server Computing and Web Technologies
The LHCb Computing Data Challenge DC06
Presentation transcript:

Deploying Production GRID Servers & Services Thorsten Kleinwort CERN IT WLCG Service Workshop Mumbai, 11 10.02.2006

Thorsten Kleinwort CERN-IT Introduction How to deploy GRID services In a production environment Various techniques & First experiences & Recommendations Scale: o(500), Tools: Still migrating from old to new tools Tools: still in migration phase from old to new tools. New tools were written with final scale in mind. Scope: mostly LXBATCH and LXPLUS: well defined environment, new tools were forged for this scope. 12 February 2006 Thorsten Kleinwort CERN-IT

Production Servers/Services What does production mean? Reach MoU agreements Avoid single points of failure for the services: In terms of hardware In terms of software In terms of people ‘Procedurise’: Production/Failure recover/maintenance/… 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT General remarks Production Service Managers ‡ Software/Service Developer/Guru Procedure needed for: S/w installation, upgrade, checking,… Requirements needed for: h/w: disk, memory, backup, high availability Operations: 24/7, 9-5, piquet, problem escalation 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT GRID Service issues Most GRID Services do not have built in failover functionality Stateful servers: Information lost on crash Domino effects on failures Scalability issues: e.g. >1 CE 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT Different approaches Server hardware setup: Cheap ‘off the shelf’ h/w satisfies only for batch nodes (WN) Improve reliability (cost ^) Dual Power supply (and test it works) UPS Raid on system/data disks (hot swappable) Remote console/BIOS access Fiber Channel disk infrastructure CERN: ‘Mid-Range-(Disk-)Server’ 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT Various Techniques Where possible, increase number of ‘servers’ to improve ‘service’: WN, UI (trivial) BDII: possible, state information is very volatile and re-queried CE, RB,… more difficult, but possible for scalability 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT Load Balancing For multiple Servers, use DNS alias/load balancing to select the right one: Simple DNS aliasing for selecting master or slave server: Useful for scheduled interventions Pure DNS Round Robin (no server check) configurable in DNS: to equally balance load Load Balancing Service, based on Server load/availability/checks: Uses monitoring information to determine best machine(s) [See CHEP talk by Vlado Bahyl] 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT DNS Load Balancing LXPLUS.CERN.CH Update LXPLUS DNS Select least “loaded” SNMP/Lemon LB Server LXPLUSXXX 12 February 2006 Thorsten Kleinwort CERN-IT

Load balancing & scaling Publish several (distinguished) Servers for same site to distribute load: Can also be done for stateful services, because clients stick to the selected server Helps for high load (CE, RB) 12 February 2006 Thorsten Kleinwort CERN-IT

Use existing ‘Services’ All GRID applications that support ORACLE as a data backend (will) use the existing ORACLE Service at CERN for state information Allows for stateless and therefore load balanced application ORACLE Solution is based on RAC servers with FC attached disks 12 February 2006 Thorsten Kleinwort CERN-IT

High Availability Linux Add-on to standard Linux Switches IP address between two servers: If one Server crashes On request Machines monitor each other To further increase high availability: State information can be on (shared) FC 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT HA Linux MYPROXY.CERN.CH PX102 PX101 Heartbeat (FC) Share Disk/Storage 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT HA Linux + FC disk MYPROXY.CERN.CH Heartbeat No single Point of failure FC Switches FC Disks 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT GRID Services: WMS WMS (Workload Management System): WN UI CE RB (GRPK) 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT WMS: WN & UI WN & UI: Simple, add as many machines as you need Do not depend on single point of failures Service can be kept up while interventions are ongoing: Have procedures for s/w, kernel, OS upgrades 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT WMS: CE & RB CE & RB are stateful Servers: load balancing only for scaling: Done for RB and CE CE’s on ‘Mid-Range-Server’ h/w RB’s not yet done, but foreseen If one server goes, the information it keeps is lost Possibly FC storage backend behind load balanced, stateless servers 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT GRID Services: DMS & IS DMS (Data Management System) SE: Storage Element FTS: File Transfer Service LFC: LCG File Catalogue IS (Information System) BDII: Berkley Database Information Index 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT DMS: SE SE: castorgrid: Load balanced front end cluster, with CASTOR storage backend 8 simple machines (batch type) Here load balancing and failover works (but not for simple GRIDFTP) 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT DMS:FTS & LFC LFC & FTS Service: Load Balanced Front End ORACLE database backend +FTS Service: VO agent daemons One ‘warm’ spare for all: Gets ‘hot’ when needed Channel agent daemons (Master & Slave) 12 February 2006 Thorsten Kleinwort CERN-IT

FTS: VO agent (failover) ALICE ATLAS SPARE CMS LHCB 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT IS: BDII BDII: Load balanced Mid-Range-Servers: LCG-BDII top level BDII PROD-BDII site BDII In preparation: <VO>-BDII State information in ‘volatile’ cache re-queried within minutes 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT Grid Services: AAS & MS AAS: Authentication & Authorization Services PX: (My)Proxy Service VOMS: Virtual Organisation Membership Service MS: Monitoring Service MONB: RGMA GRVW: GridView SFT: Site Functional Tests 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT AAS:MyPRoxy MyProxy has a replication function for a Slave Server, that allows read-only proxy retrieval Slave Server gets ‘read-write’ copy from Master regularly to allow DNS switch over HALinux handles failover 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT AAS: VOMS 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT MS: SFT, GRVW, MONB Not seen as critical Nevertheless servers are migrated to ‘Mid-Range-Servers’ GRVW has a ‘hot-standby’ SFT & MONB on Mid-Range-Servers 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT Other Experiences Sometimes difficult to integrate the GRID middleware with the local fabric (configuration tools, Quattor) RPM dependencies Monitoring Network connectivity UI incompatible Scalability issues CE already overloaded 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT Monitoring (Local) Fabric Monitoring (Lemon) GRID Server Monitoring GRID Service Monitoring GRID Monitoring: Gstat, SFT, GRIDVIEW 12 February 2006 Thorsten Kleinwort CERN-IT

Thorsten Kleinwort CERN-IT People Operators -> SysAdmins -> Service Managers (Fabric+Service) -> Application Managers Regular daily Meeting (Merge of two) Regular weekly meeting (CCSR) WLCG weekly conference call Procedures 12 February 2006 Thorsten Kleinwort CERN-IT

Some interesting links CERN-IT/FIO: http://cern.ch/it-dep-fio/ Some information about CERN-LCG on TWIKI: TechnicalDetails < LCG < TWiki ScFourServiceTechnicalFactors < LCG < TWiki 12 February 2006 Thorsten Kleinwort CERN-IT