Building a High Performance Mass Storage System for Tier1 LHC site Vladimir Sapunenko, INFN-CNAF GRID2012, July 16 – 21 Dubna, Russia.

Slides:

Advertisements

Similar presentations

1 ALICE Grid Status David Evans The University of Birmingham GridPP 16 th Collaboration Meeting QMUL June 2006.

Advertisements

Bernd Panzer-Steindel, CERN/IT WAN RAW/ESD Data Distribution for LHC.

Exporting Raw/ESD data from Tier-0 Tier-1s Wrap-up.

GridKa January 2005 Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Doris Ressmann 1 Mass Storage at GridKa Forschungszentrum Karlsruhe GmbH.

Andrew Hanushevsky7-Feb Andrew Hanushevsky Stanford Linear Accelerator Center Produced under contract DE-AC03-76SF00515 between Stanford University.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Jos van Wezel Doris Ressmann GridKa, Karlsruhe TSM as tape storage backend for disk pool managers.

LCG Tiziana Ferrari - SC3: INFN installation status report 1 Service Challenge Phase 3: Status report Tiziana Ferrari on behalf of the INFN SC team INFN.

1 Chapter 11: Data Centre Administration Objectives Data Centre Structure Data Centre Structure Data Centre Administration Data Centre Administration Data.

Ddn.com ©2012 DataDirect Networks. All Rights Reserved. GridScaler™ Overview Vic Cornell Application Support Consultant.

Distributed Tier1 scenarios G. Donvito INFN-BARI.

EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Workload Management Massimo Sgaravatto INFN Padova.

Luca dell’Agnello INFN-CNAF FNAL, May

Large scale data flow in local and GRID environment V.Kolosov, I.Korolko, S.Makarychev ITEP Moscow.

Experiences Deploying Xrootd at RAL Chris Brew (RAL)

Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo.

Computing for ILC experiment Computing Research Center, KEK Hiroyuki Matsunaga.

José M. Hernández CIEMAT Grid Computing in the Experiment at LHC Jornada de usuarios de Infraestructuras Grid January 2012, CIEMAT, Madrid.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Storage Tank in Data Grid Shin, SangYong(syshin, #6468) IBM Grid Computing August 23, 2003.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

 CASTORFS web page - CASTOR web site - FUSE web site -

1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.

An Agile Service Deployment Framework and its Application Quattor System Management Tool and HyperV Virtualisation applied to CASTOR Hierarchical Storage.

Test Results of the EuroStore Mass Storage System Ingo Augustin CERNIT-PDP/DM Padova.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

INFSO-RI Enabling Grids for E-sciencE Enabling Grids for E-sciencE Pre-GDB Storage Classes summary of discussions Flavia Donno Pre-GDB.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

CERN Database Services for the LHC Computing Grid Maria Girone, CERN.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

EGI-Engage Data Services and Solutions Part 1: Data in the Grid Vincenzo Spinoso EGI.eu/INFN Data Services.

CERN IT Department CH-1211 Genève 23 Switzerland t CERN IT Monitoring and Data Analytics Pedro Andrade (IT-GT) Openlab Workshop on Data Analytics.

MND review. Main directions of work  Development and support of the Experiment Dashboard Applications - Data management monitoring - Job processing monitoring.

Maria Girone CERN - IT Tier0 plans and security and backup policy proposals Maria Girone, CERN IT-PSS.

CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.

CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.

1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

1 5/4/05 Fermilab Mass Storage Enstore, dCache and SRM Michael Zalokar Fermilab.

CC-IN2P3 Pierre-Emmanuel Brinette Benoit Delaunay IN2P3-CC Storage Team 17 may 2011.

Daniele Cesini - INFN CNAF. INFN-CNAF 20 maggio 2014 CNAF 2 CNAF hosts the Italian Tier1 computing centre for the LHC experiments ATLAS, CMS, ALICE and.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

An Introduction to GPFS

Mass Storage Systems for the Large Hadron Collider Experiments A novel approach based on IBM and INFN software A. Cavalli 1, S. Dal Pra 1, L. dell’Agnello.

Validation tests of CNAF storage infrastructure Luca dell’Agnello INFN-CNAF.

Availability of ALICE Grid resources in Germany Kilian Schwarz GSI Darmstadt ALICE Offline Week.

Workload Management Workpackage

GEMSS: GPFS/TSM/StoRM

StoRM: a SRM solution for disk based storage systems

Vincenzo Spinoso EGI.eu/INFN

NL Service Challenge Plans

Introduction to Data Management in EGI

Update on Plan for KISTI-GSDC

Luca dell’Agnello INFN-CNAF

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

Storage Virtualization

Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017

OffLine Physics Computing

Support for ”interactive batch”

Presentation transcript:

Building a High Performance Mass Storage System for Tier1 LHC site Vladimir Sapunenko, INFN-CNAF GRID2012, July 16 – 21 Dubna, Russia

July 18, Tier1 site at INFN-CNAF CNAF is the National Center of INFN (National Institute of Nuclear Physics) for Research and Development into the field of Information Technologies applied to High-Energy physics experiments. Operational since 2005

Tier1 at glance All 4 LHC experiments 20 HEP, Space and Astro physics experiments Computation Farm –1300 WNs –130K HEP SPEC –13K job slots Storage –10 PB on disk –14 PB on tapes July 18,

Mass Storage Challenge Several PetaBytes of data (online and near-line) need to be accessed at any time from thousands of concurrent processes Aggregated data throughput required, both on Local Area Network (LAN) and Wide Area Network (WAN), is of the order of several GB/s. Long term transparent archiving of data is needed Frequent configuration changes Independent experiments (with independent production managers and end-users) concur for the usage of disk and tape resources Chaotic access can lead to traffic jams which must be taken into account as quasi-ordinary situations 4 July 18,

What do we need to do to meet that challenge? We need to a Mass Storage Solution which has the following features Grid-enabled high performance modular stable and robust targeted to large computing centers (as WLCG Tier-1s) large means custodial of O(10) PB of data simple installation and management 24x7 operation with limited manpower centralized administration July 18,

Storage HW 10 PB of disks –15 disk arrays (8x EMC CX3-80, 7x DDN S2A 9950) ~130 disk servers –40 10Gb/s Eth ( TB/server) –90 2x1Gb/s Eth (50-75 TB/server) 14 PB of tapes –SL8500 tape library (10K slots) 20 T10000B drives (1TB cartridge) 10 T10000C drives (5TB cartridge) –1 TSM server (+1 stand-by) –13 HSM nodes ~ 500 SAN ports (FC4/FC8) July 18,

ATLAS: outside view Disk space used by Atlas, % –INFNs share is 8% Data volume processed in 1 month –INFNs share is 10% Average efficiency of successfully completed jobs –INFN: the second in global ranking (data from DQ2 Atlas accounting) July 18, CNAF

ATLAS: inside view 2.3 PB of disk space 3 DDN S2A9950, 2TB SATA, 8xFC8 8 I/O servers (10Gb/s, 24GB RAM, 2xFC8) 2 metadata servers (1Gb/s, 4GB RAM, 2FC4) 4 gridFTP servers (10Gb/s,24GB RAM, 2xFC8) 5 StoRM servers (1Gb/s, 4GB RAM) 2 HSM servers (1Gb/s, 4GB RAM) 1 week Stats in GB/s to/from LAN (farm) to/from WAN(gri dftp) July 18,

LHCB: CPU used at CERN and Tier-1s in 2012 CERN CNAF GRIDKA RAL IN2P3 NIKHEF PIC SARA Share of used CPU in succesful jobs CNAF Share of CPU used in failed jobs CNAF is the first centre after CERN for CPU used and the last when counted for fraction of CPU time wasted by jobs failing for any reason The main reason: stability of the storage system ! July 18,

LHCB 0.76 PB of unique file system 40TB reserved as tape buffer More space can be used if available 0.76 PB of disk space 1 EMC CX4, 1TB SATA, 8xFC4 10 I/O servers (2x1Gb/s, 8GB RAM, 2xFC4) 2 metadata servers (1Gb/s, 8GB RAM, 2xFC4) 4 gridFTP servers (2x1Gb/s,8GB RAM, 2xFC4) 3 StoRM servers (1Gb/s, 4GB RAM) 2 HSM servers (1Gb/s, 4GB RAM) July 18,

LHCB data by site July 18, CNAF

ALICE (MonALISA) July 18, I/O activity on disk IN: 100 MB/s OUT: 2.1 GB/s I/O activity on disk IN: 100 MB/s OUT: 2.1 GB/s I/O activity on tape buffer IN: 5 MB/s OUT: 800 MB/s I/O activity on tape buffer IN: 5 MB/s OUT: 800 MB/s

ALICE 8 XrootD servers –6 for Disk-only, –2 for Tape buffer 8 core 2.2GHz, 10Gb/s, 24GB RAM, 2xFC8 2 metadata servers Storage –DDN S2A 9950, 1.3PB net space Two GPFS file systems –960TB disk-only –385TB tape buffer Manages tape recalls directly from GPFS –Custom plug-in to interface XrootD with GEMSS (CNAFs MSS) modified method XrdxFtsOfsFile::open in XrootD library –By F. Noferini and V. Vagnoni July 18,

ALICE: Tape Performance ALICE is doing hard this week reading a lot from the tape buffer July 18, Reads from tapes

Tier1 Storage Group: Tasks and Staff Tasks: –Disk storage administration (GPFS, GEMSS) –Tape library (ACSLS, TSM) –SAN maintenance, administration –Servers installation and configuration –Services (SRM, FTS, DB) –Monitoring (of all HW and SW components) –Procurement (Tender definition) –HW life circle management and 1 st level support Staff: –Just 5 FTE (Full Time Equivalent) How do we manage all this? July 18,

Our approach Fault tolerance and Redundancy everywhere but avoiding resources trashing –Using Active-Active configurations as much as possible load of failed elements distributed over remaining (SAN, servers, controllers) Monitoring and Automated recovery procedures –NAGIOS event handlers Minimizing number of managed objects –Few but BIG storage systems –10Gb servers High level of optimization –OS and network tuning Test everything before deploying –A dedicated cluster with all functionality as testing facility (testbed) Relying on industry standards (GPFS, TSM) Reducing complexity –TSM rather than HPSS July 18,

Software components GPFS as a Clustered Parallel File System TSM as HSM system StoRM as SRM GEMSS as interface between StoRM and GPFS and TSM NAGIOS as alarm and event handling QUATTOR as system configuration manager LEMON as monitoring tool July 18,

GPFS General Parallel File System from IBM –Clustered (fault tolerance and redundancy) –Parallel (scalability) –Used widely in industry (very well documented and supported by user community and by IBM) –Always provide maximum performance (no need to replicate data to increase availability) –Running on AIX, Linux (RH, SL) and Windows –Is NOT bounded to IBMs HW! July 18,

GPFS (2) Advanced High-Availability features disruption-free maintainance servers and storage devices can be added or removed while keeping the filesystems online when storage is added or removed the data can be dynamically rebalanced to maintain optimal performance Centralized administration cluster-wide operations can be managed from any node in the GPFS cluster easy administration model, consistent with standard UNIX file systems Support standard file system functions user quotas, snapshots, etc. Many other features not fitting in two slides… July 18,

TSM Tivoli Storage Manager (IBM) –Very powerful –Simple DB (db2) management hidden form administrator –Build-in HSM functionality Transparent data movement –Integrated with GPFS –Widely used in industry A lot of experience easy to get technical support ether from IBM or from user community July 18,

StoRM: STOrage Resource Manager StoRM is an implementation of the SRM solution designed to leverage the advantages of cluster file systems (like GPFS) and standard POSIX file systems in a Grid environment developed at INFN-CNAF. – July 18, StoRM provides data management capabilities in a Grid environment to share, access and transfer data among heterogeneous and geographically distributed data centers, supporting direct access (native POSIX I/O call) to shared files and directories, as well as other standard Grid access protocols. StoRM is adopted in the context of WLCG computational Grid framework.

A little bit of history CASTOR was the traditional solution for Mass Storage at CNAF for all VO's since 2003 Large variety of issues –both at set-up/admin level and at VOs level (complexity, scalability, stability, …) –successfully used in production, despite large operational overhead In parallel to production, in 2006 we started to search for a potentially more scalable, performing and robust solution –Q1 2007: after massive comparison tests GPFS was chosen as the only solution for disk-based storage (it was already in use at CNAF for a long time before this test) –Q2 2007: StoRM (developed at INFN) implements SRM 2.2 specifications –Q3-Q4 2007: StoRM/GPFS in production for D1T0 for LHCb and Atlas Clear benefits for both experiments (significantly reduced load on CASTOR) –End 2007: a project started at CNAF to realize a complete grid- enabled HSM solution based on StoRM/GPFS/TSM July 18,

GEMSS Grid Enabled Mass Storage System –A full HSM (Hierarchical Storage Management) integration of GPFS, TSM and StoRM –combined GPFS and TSM specific features with StoRM to provide a transparent Grid-friendly HSM solution An interface between GPFS and TSM has been implemented to minimize mechanical operations in tape robotics (mount/unmount, search/rewind) StoRM has been extended to include the SRM methods required to manage the tapes Permits minimize management effort and increase reliability Very positive experience for scalability so far Based on large GPFS installation in production at CNAF since 2005 with increasing disk space and number of users July 18,

GEMSS Development TimeLine July 18, D1T0 Storage Class implemented with StoRM/GPFS for LHCb and ATLAS D1T1 Storage Class implemented with StoRM/GPFS/TSM for LHCb D0T1 Storage Class implemented with StoRM/GPFS/TSM for CMS GEMSS is used by all LHC and non-LHC experiments in production for all Storage Classes ATLAS, ALICE, (CMS) and LHCb experiments, together with all other non-LHC experiments (Argo, Pamela, Virgo, AMS) use GEMSS in production! Introduced DMAPI server (to support GPFS 3.3/3.4

Components of GEMSS July 18, Disk-centric system with five building blocks GPFS : disk-storage software infrastructure TSM : tape management system StoRM : SRM service TSM-GPFS interface Globus GridFTP : WAN data transfers

GEMSS recall system Selective recall system in GEMSS use 4 processes: yamssEnqueueRecall yamssMonitor, yamssReorderRecall yamssProcessRecall yamssEnqueueRecall & yamssrReorderRecall manage a FIFO queue with the files to be recalled, fetches files from the queue and builds sorted lists with optimal file ordering. July 18, yamssProcessRecall actually creates the recall streams, perform the recalls and manages the error conditions (i.e. retries file recall failures…) yamssMonitor is the supervisor of the reorder and recall phases

GEMSS interface Set of administrative commands have been also developed, (for monitoring, stopping and starting migrations and recalls, performance reporting). Almost 50 user interface commands/daemon some examples: – yamssEnqueueRecall (command) Simple command line to enqueue into a FIFO the files to recall from tape –yamssLogger (daemon) Centralized logging facilty. 3 log files (for migrations, premigrations and recalls) are centralized for each YAMSS-managed file system –yamssLs (command) ls-like interface, but in addition prints status of each file: premigrated, migrated, disk-resident. Shipped as RPM package for installation/distribution Provides several STAT files for accurate statistic which includes –file name –Time stamp –File size –Tape label July 18,

Pre-production tests ~24 TB of data moved from tape to disk Recalls of five days typical usage by a large LHC experiment (namely CMS) compacted in one shot and completed in 19h Files were spread on ~100 tapes Average throughput: ~400MB/s 0 failures Up to 6 drives used for recalls Simultaneously, up to 3 drives used for migrations of new data files ~ 400 MB/s Up to ~ 530 MB/s of tape recalls July 18,

GEMSS monitoring Integration with NAGIOS for alert system, notification and automatic actions (i.e. restarting of failed TSM daemons) Integration with LEMON monitoring. July 18, T10KB Tape drive (SAN traffic)

GEMSS in production ~11 PB of data have been migrated to tapes since GEMSS entered in production –(some data was deleted by user => now 8.9PB used) July 18,

ATLAS data re-processing July 18, GPFS TSM traffic write: recalls for tape to disk for reprocessing read: write to tape from TIER-0 (raw data flow) Good performance for simultaneous read/write access 4,20% of total processing activity at T1 (170 TB) in 2011 ATLAS Computing activity involving massive data recall from tape High efficiency (99% successful jobs) Just a few days to complete

Conclusions We implemented a full HSM system based on GPFS and TSM able to satisfy the requirements of WLCG experiments operating the Large Hadron Collider StoRM, the SRM service for GPFS, has been extended in order to manage tape support An interface between GPFS and TSM (GEMSS) was realized in order to perform tape recalls in an optimal order, so achieving great performances A modification to XrootD library permitted to interface XrootD and GEMMS GEMSS is the storage solution used in production in our Tier1 as a single integrated system for ALL the LHC and no-LHC experiments. The recent improvements in GEMSS have increased the level of reliability and performance in the storage access. Results from the experiment perspective of the latest years of production show the systems reliability and high performance with moderate effort GEMSS is the treasure! July 18,

Contributors Alessandro Cavalli, INFN-CNAF Luca Dellagnello, INFN-CNAF Daniele Gregori, INFN-CNAF Andrea Prosperini, INFN-CNAF Francesco Noferini, INFN Enrico Fermi Centre Pier Paolo Ricci, INFN-CNAF Elisabetta Ronchieri, INFN-CNAF Vincenzo Vagnoni, INFN Bologna July 18,

Thank you for your attention! Questions? Вопросы? July 18,