Servizi core INFN Grid presso il CNAF: setup attuale

Slides:



Advertisements
Similar presentations
LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.
Advertisements

EGEE is a project funded by the European Union under contract IST Using SRM: DPM and dCache G.Donvito,V.Spinoso INFN Bari
Grid and CDB Janusz Martyniak, Imperial College London MICE CM37 Analysis, Software and Reconstruction.
CCR GRID 2010 (Catania) Daniele Gregori, Stefano Antonelli, Donato De Girolamo, Luca dell’Agnello, Andrea Ferraro, Guido Guizzunti, Pierpaolo Ricci, Felice.
CERN IT Department CH-1211 Genève 23 Switzerland t EIS section review of recent activities Harry Renshall Andrea Sciabà IT-GS group meeting.
Monitoring in EGEE EGEE/SEEGRID Summer School 2006, Budapest Judit Novak, CERN Piotr Nyczyk, CERN Valentin Vidic, CERN/RBI.
02/07/09 1 WLCG NAGIOS Kashif Mohammad Deputy Technical Co-ordinator (South Grid) University of Oxford.
PROOF Cluster Management in ALICE Jan Fiete Grosse-Oetringhaus, CERN PH/ALICE CAF / PROOF Workshop,
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Performance Improvements to BDII - Grid Information.
CERN Using the SAM framework for the CMS specific tests Andrea Sciabà System Analysis WG Meeting 15 November, 2007.
Certification and test activity IT ROC/CIC Deployment Team LCG WorkShop on Operations, CERN 2-4 Nov
Lemon Monitoring Miroslav Siket, German Cancio, David Front, Maciej Stepniewski CERN-IT/FIO-FS LCG Operations Workshop Bologna, May 2005.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
WLCG Service Report ~~~ WLCG Management Board, 9 th August
Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)
1 Andrea Sciabà CERN Critical Services and Monitoring - CMS Andrea Sciabà WLCG Service Reliability Workshop 26 – 30 November, 2007.
INFSO-RI Enabling Grids for E-sciencE ARDA Experiment Dashboard Ricardo Rocha (ARDA – CERN) on behalf of the Dashboard Team.
SAM Sensors & Tests Judit Novak CERN IT/GD SAM Review I. 21. May 2007, CERN.
Certification and test activity ROC/CIC Deployment Team EGEE-SA1 Conference, CNAF – Bologna 05 Oct
Last update 29/01/ :01 LCG 1Maria Dimou- cern-it-gd Maria Dimou IT/GD CERN VOMS server deployment LCG Grid Deployment Board
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
Patricia Méndez Lorenzo Status of the T0 services.
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
Operations model Maite Barroso, CERN On behalf of EGEE operations WLCG Service Workshop 11/02/2006.
SAM Status Update Piotr Nyczyk LCG Management Board CERN, 5 June 2007.
Replicazione e QoS nella gestione di database grid-oriented Barbara Martelli INFN - CNAF.
INFSO-RI Enabling Grids for E-sciencE File Transfer Software and Service SC3 Gavin McCance – JRA1 Data Management Cluster Service.
Co-ordination & Harmonisation of Advanced e-Infrastructures for Research and Education Data Sharing Research Infrastructures Grant Agreement n
Probes Requirement Review OTAG-08 03/05/ Requirements that can be directly passed to EMI ● Changes to the MPI test (NGI_IT)
II EGEE conference Den Haag November, ROC-CIC status in Italy
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks CYFRONET site report Marcin Radecki CYFRONET.
Grid Monitoring and Diagnostic Tools: GridICE, GSTAT, SAM Giuseppe Misurelli INFN-CNAF giuseppe.misurelli cnaf.infn.it.
Vendredi 27 avril 2007 Management of ATLAS CC-IN2P3 Specificities, issues and advice.
WLCG Operations Coordination Andrea Sciabà IT/SDC GDB 11 th September 2013.
Andrew Lahiff HEP SYSMAN June 2016 Hiding infrastructure problems from users: load balancers at the RAL Tier-1 1.
INFSO-RI Enabling Grids for E-sciencE Running reliable services: the LFC at CERN Sophie Lemaitre
1-2 March 2006 P. Capiluppi INFN Tier1 for the LHC Experiments: ALICE, ATLAS, CMS, LHCb.
A Nordic Tier-1 for LHC Mattias Wadenstein Systems Integrator, NDGF Grid Operations Workshop Stockholm, June the 14 th, 2007.
Service Availability Monitoring
Daniele Bonacorsi Andrea Sciabà
Jean-Philippe Baud, IT-GD, CERN November 2007
Introduction of load balancers at the RAL Tier-1
Davide Salomoni INFN-CNAF Bologna, Jan 12, 2006
RHEV Platform at LHCb Red Hat at CERN 17-18/1/17
NGI and Site Nagios Monitoring
LCG Service Challenge: Planning and Milestones
High Availability Linux (HA Linux)
Use of Nagios in Central European ROC
Status of Fabric Management at CERN
Key Activities. MND sections
Summary on PPS-pilot activity on CREAM CE
GDB 8th March 2006 Flavia Donno IT/GD, CERN
Service Challenge 3 CERN
Technical Board Meeting, CNAF, 14 Feb. 2004
LFC Status and Futures INFN T1+T2 Cloud Workshop
Monitoring: problems, solutions, experiences
Accounting at the T1/T2 Sites of the Italian Grid
Farida Fassi, Damien Mercie
Castor services at the Tier-0
Short update on the latest gLite status
ALICE – FAIR Offline Meeting KVI (Groningen), 3-4 May 2010
Workshop Summary Dirk Duellmann.
Francesco Giacomini – INFN JRA1 All-Hands Nikhef, February 2008
Pierre Girard ATLAS Visit
Danilo Dongiovanni INFN-CNAF
EGEE Operation Tools and Procedures
Site availability Dec. 19 th 2006
Deploying Production GRID Servers & Services
Dirk Duellmann ~~~ WLCG Management Board, 27th July 2010
Presentation transcript:

Servizi core INFN Grid presso il CNAF: setup attuale A.Cavalli, D.Dongiovanni, T.Ferrari, P.Veronesi Riunione di cabina, CNAF, 03-06-2008

Outline FTS Top-level BDII WMS VOMS

59 channel agents, 8 VOs supported FTS 1/2 Current version: gLite 3.0.2 Update 42 Hardware layout: 3 DELL servers poweredge 1950 2 CPU dual core 8 GB RAM 2 SATA disks – 160 GB RAID1 Backend: Oracle cluster fts01-sc fts02-sc fts03-sc 59 channel agents, 8 VOs supported FTS Current Version: gLite 3.0.2 Update 42

FTS 2/2 Monitoring of service: Configuration details: profiles managed via Quattor+YAIM Nagios checks (agents + web server)‏ Lemon Monitoring (https://lemon03.cr.cnaf.infn.it/lemon-web/info.php?entity=FTS%20Servers&type=host&cluster=1)‏ pperational precedure described on wiki page (https://tier1.cnaf.infn.it/wiki/index.php?title=FTS_installation_notes)‏ Support: mailing list fts-support<at>cnaf.infn.it‏ FTS support on Italian Ticketing System FTSMonitor: https://tier1.cnaf.infn.it/ftsmonitor Configuration details: all channel agents type are URLCOPY default SRM version copy: 2.2 SRMVERSIONPOLICY: with-space-token max number of transfer files per channels and streams configured as needed (details available via command line client)‏ timeouts tuned as needed (details available on quattor profiles)‏

TOP-BDII ROC-Italy CURRENT: FUTURE: CNAF: 3 hosts behind a DNS round-robin: egee-bdii => egee-bdii-02, egee-bdii-05, egee-bdii-06 (SL3) INFN-PADOVA, INFN-FERRARA 2 alternative hosts used in case of major CNAF downtime (DNS name temporarily remapped on the external ones) MONITORING: NAGIOS alerts, manual intervention (automatic DNS update not allowed under the current domain) FUTURE: SL4 – gLite 3.1 CNAF BDIIs moved under a new domain, that allows automatic DNS update NAGIOS: alerts + automatic exclusion of bad BDII All 5 BDIIs integrated in the pool (internal + external)

WMS 1/4 VO WMS + LB SL4 WMS + LB SL3 Lcg-RB TOT instances ALICE 1 + 1 ATLAS 1 + 0.5 3,5 CDF 2 CMS 3 + 2 5 + 2.5 13,5* LHCB 1 + 0,5 1,5 MULTI - VO 2 + 0,5 2,5 Also 1 + 1 WMS-LB (SL4) for middleware test purposes (CMS test) * 1 WMS + 0.5 LB (SL4) temporarily borrowed from ATLAS

May 2008 – Overall submission activity on CNAF WMS (data source: https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php) VO SUBMITTED DONE COLLECTIONS ALICE 831 686 ATLAS 20083 13520 1325 CDF 3970 2713 2 CMS 751774 583042 131597 LHCB 4958 3528 MULTI - VO 14122 11196 50 TOTAL 795738 614685 132974 Peak daily submission rate (per single WMS): on production WMS: 11 Kjob/day (wms009) Including exp. service instances: 27.4 Kjob/day (devel07)

CCRC 08 MAY (data source: https://cert-wms-01.cnaf.infn.it:8443/wmsmon/main/main.php) Submission Daily Activity on all 26 CNAF WMS/LB instances

Work in Progress: WMS Load Balancing To better distribute load on all available instances we’re working on an automatic load balancing system: A load metric measuring several parameters on each WMS instance is calculated On the base of the load-metric an ARBITER service ranks all WMS instances and sets a list of Best unloaded WMS The list of WMS is made available behind an alias hostname: pre-prod.wms.cnaf.infn.it The user submits to the alias and his jobs automatically go to one of the available WMS The chosen WMS returns a Job ID as usual

VOMS SETUP CNAF hosts two VOMS servers: voms.cnaf.infn.it (VOMS master for CDF) and voms2.cnaf.infn.it more than 20 VOs served in total INFN Grid VOMS replica in INFN Padova (since March 2008): voms-01.pd.infn.it Master and replica both with mysql backend as soon as something changes in the master DB  immediate propagation of changes to the replica server DB Both instances are on sl3, glite 3.0, voms-admin 1.2.19 VOMS hosts under Nagios monitoring (e-mail alarms) From March 2008 (INFN T1 machine room reengineering): dual power supply system for all racks is possible Migration of the voms server to an new host with redundant hw configuration needed to take advantage of this No suitable spare hw available currently Fault tolerance based on the existence of the VOMS replica outside of the INFN T1 LAN To cope with network outages Local VOMS replicas sharing a single DB backend are not totally fault tolerant (this is the current set up at CERN, VOMS replica of CERN will be put into production soon at CNAF)

Service availability from Jan 2008 to date: 95% CNAF centre scheduled downtime Nagios update

Outstanding problems Automatic restart of deamons Plans: Detected problem with init scripts, which prevented the VOMS and VOMS-Admin services to restart correctly after manual reboot of the machine  problem under study (A.Cavalli, A.Paolini) SOLVED Nagios alarms: only related to status of the host, service processes not under test currently  extension of nagios VOMS test suite  action on P.Veronesi DONE Plans: Extension of Nagios tests to raise SMS alarms in addition to e-mail messages for critical VOMS problems Hw and sw upgrade to gLite 3.1 Major database structure upgrade Waiting for input from CDF