Stato dello storage al Tier1 Luca dell’Agnello Mercoledi’ 18 Maggio 2011.

Slides:



Advertisements
Similar presentations
Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.
Advertisements

Distributed Tier1 scenarios G. Donvito INFN-BARI.
DPM Italian sites and EPEL testbed in Italy Alessandro De Salvo (INFN, Roma1), Alessandra Doria (INFN, Napoli), Elisabetta Vilucchi (INFN, Laboratori Nazionali.
INFN-T1 site report Giuseppe Misurelli On behalf of INFN-T1 staff HEPiX Spring 2015.
Luca dell’Agnello INFN-CNAF FNAL, May
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
INFN – Tier1 Site Status Report Vladimir Sapunenko on behalf of Tier1 staff.
Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo.
DMF Configuration for JCU HPC Dr. Wayne Mallett Systems Manager James Cook University.
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Spring 2014.
INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.
Federico Ruggieri INFN-CNAF GDB Meeting 10 February 2004 INFN TIER1 Status.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,
LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.
Tier1 Status Report Martin Bly RAL 27,28 April 2005.
Storage Tank in Data Grid Shin, SangYong(syshin, #6468) IBM Grid Computing August 23, 2003.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.
1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.
INFN TIER1 (IT-INFN-CNAF) “Concerns from sites” Session LHC OPN/ONE “Networking for WLCG” Workshop CERN, Stefano Zani
Tier1 Andrew Sansum GRIDPP 10 June GRIDPP10 June 2004Tier1A2 Production Service for HEP (PPARC) GRIDPP ( ). –“ GridPP will enable testing.
Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2013.
USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
Status Report of WLCG Tier-1 candidate for KISTI-GSDC Sang-Un Ahn, for the GSDC Tier-1 Team GSDC Tier-1 Team ATHIC2012, Busan,
CERN Database Services for the LHC Computing Grid Maria Girone, CERN.
INFN-T1 site report Andrea Chierici, Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX spring 2012.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
KIT – University of the State of Baden-Württemberg and National Laboratory of the Helmholtz Association STEINBUCH CENTRE FOR COMPUTING - SCC
W.A.Wojcik/CCIN2P3, Nov 1, CCIN2P3 Site report Wojciech A. Wojcik IN2P3 Computing Center URL:
Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.
INFN-T1 site report Luca dell’Agnello On behalf ot INFN-T1 staff HEPiX Spring 2013.
1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff HEPiX Fall 2015.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.
CC-IN2P3 Pierre-Emmanuel Brinette Benoit Delaunay IN2P3-CC Storage Team 17 may 2011.
Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Data storage services at CC-IN2P3 Jean-Yves Nief.
An Introduction to GPFS
Mass Storage Systems for the Large Hadron Collider Experiments A novel approach based on IBM and INFN software A. Cavalli 1, S. Dal Pra 1, L. dell’Agnello.
Stato del Tier1 Luca dell’Agnello 11 Maggio 2012.
Validation tests of CNAF storage infrastructure Luca dell’Agnello INFN-CNAF.
High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.
CERN IT-Storage Strategy Outlook Alberto Pace, Luca Mascetti, Julien Leduc
Dynamic Extension of the INFN Tier-1 on external resources
Extending the farm to external sites: the INFN Tier-1 experience
Luca dell’Agnello INFN-CNAF
Status Report dello Storage al Tier1
GEMSS: GPFS/TSM/StoRM
StoRM: a SRM solution for disk based storage systems
Andrea Chierici On behalf of INFN-T1 staff
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
StoRM Architecture and Daemons
Update on Plan for KISTI-GSDC
CERN Lustre Evaluation and Storage Outlook
Luca dell’Agnello INFN-CNAF
The INFN TIER1 Regional Centre
Castor services at the Tier-0
The INFN Tier-1 Storage Implementation
Luca dell’Agnello Daniele Cesini GDB - 13/12/2017
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
The LHCb Computing Data Challenge DC06
Presentation transcript:

Stato dello storage al Tier1 Luca dell’Agnello Mercoledi’ 18 Maggio 2011

2 INFN-CNAF CNAF is the central computing facility of INFN – Italian Tier-1 computing centre for the LHC experiments ATLAS, CMS, ALICE and LHCb... – … but also one of the main processing facilities for other experiments: BaBar and CDF Astro and Space physics VIRGO (Italy), ARGO (Tibet), AMS (Satellite), PAMELA (Satellite) and MAGIC (Canary Islands) More... Disk shares CPU shares Year CPU power [HS06] Disk Space [PB] Tape Space [PB] k k K8.410

EsperimentAllocated (TB)Pledged (TB) Alice Atlas CMS2200 LHCb BABAR SuperB6445 CDF AMS ARGO AUGER FERMI/GLAST2054 MAGIC PAMELA VIRGO Storage shares (May 2011)

Storage resources 8.4 PB of disk on-line under GEMSS 7 DDN S2A9950 (2 TB SATA disks for data, 300 GB SAS disks for metadata) 7 EMC EMC NSD 10 Gbps servers on DDN 60 NSD 1Gbps servers on EMC GridFTP (mostly WAN transfers) servers Most with 10 Gbps connection 1 tape library Sl8500 (10 PB on line) with 20 T10Kb drives – 1 TB tape capacity, 1 Gbps of bandwidth for each drive – Drives interconnected to library and tsm-hsm servers via dedicated SAN (TAN) – TSM server common to all GEMSS instances All storage systems and disk-servers interconnected via SAN (FC4/FC8)

S SAN: Star topology CORE FC director BR48000 BR24000 sw4 BR3BR4 ddn1ddn2ddn3ddn4ddn5ddn6ddn7 emc3emc4emc5emc6 emc1 sw4sw3 sw1 sw6 sw5 emc2 sw2 BR5BR6 S S SSS S Tape drive S Server sw5 FC switch Disks array

The requirements Ricerca di soluzione comune  Adatta per esigenze esperimenti LHC Sistema scalabile a dimensioni O(10) PB Disponibilità di sistema HSM per archiviazione e recupero dinamico da tape Migliaia di accessi concorrenti Throughput aggregato O(10) GB/s  Flessibile per necessità esperimenti non LHC Essenziale non aumentare difficoltà di uso rispetto a soluzioni home-made  Semplicità gestione ed alta affidabilità essenziali TB Storage al Tier1 Protipo Tier1 dal (Alcune) scelte cruciali: batch system Mass Storage System

7 2003: CASTOR chosen as MSS (and phased out Jan 2011) Large variety of issues both at set-up/admin level and at VO’s level (complexity, scalability, stability, support) 2007: start of a project to realize GEMSS, a new grid- enabled HSM solution based on industrial components (parallel file-system and standard archival utility) StoRM adopted as SRM layer and extended to include the methods required to manage data on tape GPFS and TSM by IBM chosen as building blocks An interface between GPFS and TSM implemented (not all needed functionalities provided out of the box) Mass Storage System at CNAF: the evolution (1)

8 Q2 2008: First implementation (D1T1, the easy case) in production for LHCb (CCRC’08) Q2 2009: GEMSS (StoRM/GPFS/TSM), the full HSM solution, ready for production Q3 2009: CMS moving from CASTOR to GEMSS Q1 2010: the other LHC experiments moving to GEMSS End of 2010: all experiments moved from CASTOR to GEMSS All data present on CASTOR tapes copied to TSM tapes CASTOR tapes recycled after data check Mass Storage System at CNAF: the evolution (2)

9 Building blocks of GEMSS system ILM DATA FILE GEMSS DATA MIGRATION PROCESS DATA FILE StoRM GridFTP GPFS DATA FILE WAN DATA FILE TSM DATA FILE GEMSS DATA RECALL PROCESS DATA FILE WORKER NODE SAN TAN LAN SAN Disk-centric system with five building blocks 1. GPFS: disk-storage software infrastructure 2. TSM: tape management system 3. StoRM: SRM service 4. TSM-GPFS interface 5. Globus GridFTP: WAN data transfers

Storage setup Disk storage partitioned in several GPFS clusters One cluster for each (major) experiment with: – several NSD’s (es. 8 for Atlas, 12 for CMS) for data (LAN) – 2 NSD’s for metadata – 2-4 gridftp servers (WAN) – 1 storm end-point (1 BE FE’s) – 2-3 tsm-hsm servers (if needed) data SATA drives SAS drives Metadata NDS Data NSD Ethernet Core Switch FARM WAN 1Gbit 10Gbit 4Gbit 8Gbit SAN 20Gbit ~60Gbit gridFTP 10Gbit Largest file- systems in production: Atlas and CMS (2.2 PB)

Migration Recall GEMSS 1/2 HSM-1HSM-2 TSM-Server TSM DB StoRM GridFTP WAN I/O SRM request Disk Tape Library Storage Area Network Tape Area Netwok

GEMSS 2/2 HSM-1 HSM-2 TSM-Server TSM DB StoRM GridFTP WAN I/O SRM request GPFS Server LAN Worker Node Disk Tape Library Sorting Files by Tape SANTAN

GEMSS HA TSM DB is stored on a CX on the SAN TSM DB is backed up every 2 hours on a different CX disk and every 12 hours on tape with a persistency of 6 days TSM-SERVER have a secondary server in stand-by – It’s possible to move the DB on the CX directly to the secondary server – With a floating IP all client are redirect to the new server We have 2 or 3 TSM-HSM clients for VO for failover GPFS servers, StoRM FEs and GridFTP servers are in cluster – StoRM BE is a single point of failure (cold spare ready) – Nagios (proactive) alarms

Customer (and our) satisfaction INFN T1 availability > MoU threshold in the last months – Storage component big player for this To be honest: 1 incident (CMS) last year – Risk of loss of data In general good feed-back from experiments – CMS CMS – Atlas reprocessing very good efficiency (waiting for official report) Atlas reprocessing CMS queue (May 15)Farm- CMS storage traffic

Yearly statistics LAN trafficWAN traffic Tape trafficMounts/hour

What’s new Key-word: consolidation – Complete revision of the system – Implementation of new features (e.g. DMAPI driven recalls) – Implementation of test-bed for certification of GEMSS – Verification of compatibility of GPFS, TSM, StoRM and home-made layer Original plan: NFS 4.1, HTTP etc… – Not really tested yet  – IPv6 support ? New (internal) requirement: tiered storage – Aggregation of different technologies – Implementation with GPFS policies?

Backup slides

Soluzioni all’esame Storage dei RAW al CNAF, RECO e DST a LNF Buffer disco (~ 20 TB) per evitare uso locale della libreria Upgrade di banda accesso LNF 1÷2Gbps Da dettagliare il modo in cui effettuare le copie remote – Soluzione Interim con gridftp – In esame accesso con cluster GPFS geografico Iniziata discussione tecnica con KLOE CdG Aprile 2011

GEMSS layout for a typical LHC Experiments at INFN Tier-1

21 GEMSS in production for CMS Good-performance achieved in transfer throughput – High use of the available bandwidth – (up to 8 Gbps) Verification with Job Robot jobs in different periods shows that CMS workflows efficiency was not impacted by the change of storage system – “Castor + SL4” vs “TSM + SL4” vs “TSM + SL5” As from the current experience, CMS gives a very positive feedback on the new system – Very good stability observed so far CNAF ➝ T1-US_FNAL CNAF ➝ T2_CH_CAF GEMSS went in production for CMS in October 2009 ✦ w/o major changes to the layout - only StoRM upgrade, with checksum and authz supportbeing deployed soon also

One year statistics Native GPFS (only lan traffic) GridFTP (mostly WAN traffic)

Upgrade StoRM StoRM installato su end-point Atlas e archive a Marzo – problemi riscontrati non ancora risolti memory leak sul BE (sotto carico) Segmentation fault sui FE – Work-around (nagios) In attesa nuova versione (1.7.0 da EMI 1) – Meta’ Giugno?