Storage resources management and access at TIER1 CNAF

Slides:



Advertisements
Similar presentations
Oracle Clustering and Replication Technologies CCR Workshop - Otranto Barbara Martelli Gianluca Peco.
Advertisements

INFN CNAF TIER1 Castor Experience CERN 8 June 2006 Ricci Pier Paolo
BNL Oracle database services status and future plans Carlos Fernando Gamboa RACF Facility Brookhaven National Laboratory, US Distributed Database Operations.
Mass RHIC Computing Facility Razvan Popescu - Brookhaven National Laboratory.
Experiences Deploying Xrootd at RAL Chris Brew (RAL)
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
Confidential1 Introducing the Next Generation of Enterprise Protection Storage Enterprise Scalability Enhancements.
GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh
20-22 September 1999 HPSS User Forum, Santa Fe CERN IT/PDP 1 History  Test system HPSS 3.2 installation in Oct 1997 IBM AIX machines with IBM 3590 drives.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
10/22/2002Bernd Panzer-Steindel, CERN/IT1 Data Challenges and Fabric Architecture.
CASTOR: CERN’s data management system CHEP03 25/3/2003 Ben Couturier, Jean-Damien Durand, Olof Bärring CERN.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
Fabric Monitoring at the INFN Tier1 Felice Rosso on behalf of INFN Tier1 Joint OSG & EGEE Operations WS, Culham (UK)
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
Status SC3 SARA/Nikhef 20 juli Status & results SC3 throughput phase SARA/Nikhef Mark van de Sanden.
Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
SA1 operational policy training, Athens 20-21/01/05 Presentation of the HG Node “Isabella” and operational experience Antonis Zissimos Member of ICCS administration.
CASTOR CNAF TIER1 SITE REPORT Geneve CERN June 2005 Ricci Pier Paolo
CNAF Database Service Barbara Martelli CNAF-INFN Elisabetta Vilucchi CNAF-INFN Simone Dalla Fina INFN-Padua.
Database CNAF Barbara Martelli Rome, April 4 st 2006.
BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.
The Italian Tier-1: INFN-CNAF Andrea Chierici, on behalf of the INFN Tier1 3° April 2006 – Spring HEPIX.
Storage at TIER1 CNAF Workshop Storage INFN CNAF 20/21 Marzo 2006 Bologna Ricci Pier Paolo, on behalf of INFN TIER1 Staff
Storage & Database Team Activity Report INFN CNAF,
An Introduction to GPFS
BaBar & Grid Eleonora Luppi for the BaBarGrid Group TB GRID Bologna 15 febbraio 2005.
GDB Meeting 12. January Bernd Panzer-Steindel, CERN/IT 1 Mass Storage at CERN GDB meeting, 12. January 2005.
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
Validation tests of CNAF storage infrastructure Luca dell’Agnello INFN-CNAF.
15.June 2004Bernd Panzer-Steindel, CERN/IT1 CERN Mass Storage Issues.
Oracle Clustering and Replication Technologies UK Metadata Workshop - Oxford Barbara Martelli Gianluca Peco.
Servizi core INFN Grid presso il CNAF: setup attuale
status, usage and perspectives
CASTOR: possible evolution into the LHC era
Ryan Leonard Storage and Solutions Architect
Dynamic Extension of the INFN Tier-1 on external resources
ALICE Computing Data Challenge VI
WP18, High-speed data recording Krzysztof Wrona, European XFEL
GEMSS: GPFS/TSM/StoRM
StoRM: a SRM solution for disk based storage systems
NL Service Challenge Plans
Status and plans Giuseppe Lo Re INFN-CNAF 8/05/2007.
IT-DB Physics Services Planning for LHC start-up
Service Challenge 3 CERN
Introduction to Data Management in EGI
Bernd Panzer-Steindel, CERN/IT
CERN Lustre Evaluation and Storage Outlook
Luca dell’Agnello INFN-CNAF
STORM & GPFS on Tier-2 Milan
Scalable Database Services for Physics: Oracle 10g RAC on Linux
Introduction to Networks
Ákos Frohner EGEE'08 September 2008
The INFN Tier-1 Storage Implementation
CC-IN2P3 Pierre-Emmanuel Brinette IN2P3-CC Storage Team
Computing Infrastructure for DAQ, DM and SC
Storage Virtualization
Vladimir Sapunenko On behalf of INFN-T1 staff HEPiX Spring 2017
ACAT 2007 April Nikhef Amsterdam
ASM-based storage to scale out the Database Services for Physics
Experience with GPFS and StoRM at the INFN Tier-1
CASTOR: CERN’s data management system
Scalable Database Services for Physics: Oracle 10g RAC on Linux
High-Performance Storage System for the LHCb Experiment
Cost Effective Network Storage Solutions
Presentation transcript:

Storage resources management and access at TIER1 CNAF Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo on behalf of INFN TIER1 Staff pierpaolo.ricci@cnaf.infn.it ACAT 2005 May 22-27 2005 DESY Zeuthen, Germany

TIER1 INFN CNAF Storage Linux SL 3.0 clients (100-1000 nodes) HSM (400 TB) NAS (20TB) STK180 with 100 LTO-1 (10Tbyte Native) NAS1,NAS4 3ware IDE SAS 1800+3200 Gbyte W2003 Server with LEGATO Networker (Backup) Linux SL 3.0 clients (100-1000 nodes) RFIO NFS PROCOM 3600 FC NAS3 4700 Gbyte CASTOR HSM servers WAN or TIER1 LAN H.A. PROCOM 3600 FC NAS2 9000 Gbyte STK L5500 robot (5500 slots) 6 IBM LTO-2, 2 (4) STK 9940B drives NFS-RFIO-GridFTP oth... SAN 1 (200TB) SAN 2 (40TB) Diskservers with Qlogic FC HBA 2340 Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1 IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC interfaces 2 Brocade Silkworm 3900 32 port FC Switch 2 Gadzoox Slingshot 4218 18 port FC Switch STK BladeStore About 25000 GByte 4 FC interfaces AXUS BROWIE About 2200 GByte 2 FC interface Infortrend 5 x 6400 GByte SATA A16F-R1211-M2 + JBOD 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it CASTOR HSM STK L5500 2000+3500 mixed slots 6 drives LTO2 (20-30 MB/s) 2 drives 9940B (25-30 MB/s) 1300 LTO2 (200 GB native) 650 9940B (200 GB native) Point to Point FC 2Gb/s connections 8 tapeserver Linux RH AS3.0 HBA Qlogic 2300 Sun Blade v100 with 2 internal ide disks with software raid-0 running ACSLS 7.0 OS Solaris 9.0 1 CASTOR (CERN)Central Services server RH AS3.0 1 ORACLE 9i rel 2 DB server RH AS 3.0 EXPERIMENT Staging area (TB) Tape pool (TB native) ALICE 8 12 ATLAS 6 20 CMS 2 15 LHCb 18 30 BABAR,AMS+oth 4 WAN or TIER1 LAN 6 stager with diskserver RH AS3.0 15 TB Local staging area 8 or more rfio diskservers RH AS 3.0 min 20TB staging area Indicates Full rendundancy FC 2Gb/s connections (dual controller HW and Qlogic SANsurfer Path Failover SW) SAN 2 SAN 1 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it CASTOR HSM (2) In general we obtained: Good performances when writing into the staging area (disk buffer) and from staging area to tapes (2 parallel streams on tape give about 40MB/s) General good reliability on the stager service (Every LHC experiment has its own dedicated stager and policies) and high reliability on the central castor services Bad realiability on LTO-2 drives when writing and reading. This results in tapes marked readonly or disabled when writing and in locking or failure when trying to stage-in files in random order. We could trigger with the experiment coordination a temporary increase of the staging area (disk buffe)r and an optimized sequencial stage-in of data just before analysis phase. Then the analysis job could run directly over rfio or grid tool on castor with an high probability to find the file directly on disk (LHCB). After the end of the analysis phase the disk buffer could be re-assigned to another exp. We decide to acquire and use more STK 9940B drives for random access to the data The access to the CASTOR HSM system is Direct using rf<command> direcly on the user interfaces or on the WN (rfcp,rfrm or API...) Throught front-end with gridftp interface to castor and srm 1 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it DISK access WAN or TIER1 LAN GB Eth. connections: nfs,rfio,xrootd,GPFS, GRID ftp Generic Diskserver Supermicro 1U 2 Xeon 3.2 Ghz 4GB Ram,GB eth. 1 or 2 Qlogic 2300 HBA Linux AS or CERN SL 3.0 OS 1 2 3 4 F1 F2 LUN0 => /dev/sda LUN1 => /dev/sdb ... 1 or 2 2Gb FC connections every Diskserver 2 Brocade Silkworm 3900 32 port FC Switch ZONED (50TB Unit with 4 Diskservers) 2 x 2GB Interlink connections FARMS of rack mountable 1U biprocessors nodes (actually about 1000 nodes for 1300 KspecInt2000) 2Gb FC connections FC Path Failover HA: Qlogic SANsurfer IBM or STK Rdac for Linux 2TB Logical Disk LUN0 LUN1 ... 50 TB IBM FastT 900 (DS 4500) Dual redundant Controllers (A,B) Internal MiniHub (1,2) A1 A2 B1 B2 RAID5 4 Diskservers every 50TB Unit => every controller can perform a maximum of 120MByte/s R-W Application HA: NFS server, rfio server with Red Hat Cluster AS 3.0(*) GPFS with configuration NSD Primary Secondary /dev/sda Primary Diskserver 1; Secondary Diskserver2 /dev/sdb Primary Diskserver 2; Secondary Diskserver3 ..... (*) tested but not actually used in production 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it DISK access (2) We have different protocols in production for accessing the disk storage. In our diskservers and Grid SE front-ends we corrently have: NFS on local filesystem: ADV. Easy client implementation and compatibility and possibility of failover (RH 3.0). DIS. Bad perfomance scalability for an high number of access (1 client 30MB/s 100 client 15MB/s throughtput) RFIO on local filesystem: ADV. Good performance and compatibility with Grid Tools and possibility of failover. DIS. No scalability of front-ends for the single filesystem, no possibility of load-balancing Grid SE Gridftp/rfio over GPFS (CMS,CDF): ADV: Separation from GPFS servers (accessing the disks) and SE GPFS clients. Load balancing and HA on the GPFS servers and possibility to implement the same on the Grid SE services (see next slide). DIS. GPFS layer requirements on OS and Certified Hardware for support. Xrootd (BABAR): ADV: Good performance DIS: No possibility of load-balancing for the single filesystem backends, not grid compliant (at present...) NOTE The IBM GPFS 2.2 is a CLUSTERED FILESYSTEM so is possible from many front-ends (i.e. gridftp or rfio server) to access simultaneously the SAME filesystem. Also can use bigger filesystem size (we use 8-12TB). 1 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

CASTOR Grid Storage Element GridFTP access through the castorgrid SE, a dns cname pointing to 3 server. Dns round-robin for load balancing During LCG Service Challenge2 introduced also a load average selection: every M minutes the ip of the most loaded server is replaced in the cname (see graph) 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Monitoring/notifications (Nagios) 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

# processes on a CMS disk SE eth0 traffic through a CASTOR LCG SE LHCb CASTOR tape pool # processes on a CMS disk SE eth0 traffic through a CASTOR LCG SE 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Disk accounting Pure disk space (TB) CASTOR disk space (TB) 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Parallel Filesystem Test Test Goal: Evaluation and Comparison of Parallel Filesystems (GPFS, Lustre) for the implementation of a powerful disk I/O infrastructure for the TIER1 INFN CNAF. A moderately high-end testbed has been used: 6 IBM xseries 346 file servers connected via FC SAN to 3 IBM FAStT 900 (DS4500) providing a total of 24TB Maximum available throughput to client nodes (30) using Gb Ethernet: 6 Gbps PHASE 1: Generic test and tuning PHASE 2: Realistic physics analysis jobs reading data from a Parallel Filesystem Dedicated tools for test (PHASE 1) and monitoring have been written: The Benchmarking tool allows the user to start, stop and monitor the test on all the clients from a single point Completely automatized 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

PHASE 1 Generic Benchmark GPFS: Very stable, reliable, fault tolerant, indicated for storage of critical data and no charge for educational or research use Lustre: Commercial product, easy to install, but fairly invasive (need patched kernel) and has a node license cost PHASE1 Generic Benchmark Sequential write/read from a variable number of clients simultaneously performing I/O with 3 different protocols (native GPFS, rfio over GPFS, nfs over GPFS). 1 to 30 Gb clients, 1 to 4 processes per client Sequential write/read of zeroed files by means of dd File sizes ranging from 1 MB to 1 GB 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Results of read/write (1GB different files) Generic Benchmark Raw ethernet throughput vs time (20 x 1GB file simultaneous reads with Lustre) Results of read/write (1GB different files) Effective average throughput (Gb/s) # of simultaneous read/writes 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Generic Benchmark (here shown for 1 GB files) WRITE (MB/s) READ (MB/s) # of simultaneous client processes 1 5 10 50 120 GPFS 2.3.0-1 native 114 160 151 147 85 301 305 NFS 102 171 159 158 320 366 322 292 RFIO 79 166 321 Lustre 1.4.1 512 488 478 73 640 453 403 93 284 281 68 269 314 349 Numbers are reproducible with small fluctuations Lustre tests with NFS export not yet performed 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

PHASE 2 Realistic analysis We focus on the Analysis Jobs since they are generally the most I/O bound processes of the experiment activity. Realistic LHCb analysis algorithm runs on 8 TB of data served by RFIO daemons running on GPFS parallel filesystem servers The analysis algorithm performs a selection of an LHCb physics channel by reading sequentially input DST (Data Summary Tape) files and producing ntuple files in output Analysis jobs submitted to the production LSF batch system of TIER1 INFN (RFIO was the simplest and most effective choice) 14000 jobs submitted, 500 jobs in simultaneous RUN state Steps of the jobs RFIO-copy to the local WN disk the file to be processed Analize the data RFIO-copy back the output of the algorithm Cleanup files from the local disk 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Realistic analysis results 8.1 TB of data processed in 7 hours, all 14000 jobs completed succesfully > 3 Gb/s raw sustained read throughput from the file servers with GPFS (about effective 320MByte/s) Write throughput of output data negligible Just 1 MB per job The results are very satisfactory and give us a good impression of the whole infrastructure layout. Test for the Lustre configuration are in progress. (we don’t expect big difference using rfio protocol over parallel filesystem) 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Conclusions In these slides we presented: A general overview of the Italian TIER1 INFN CNAF storage resources hardware and access methods: HSM Software (CERN CASTOR) for Tape Library Mass Storage Disk over SAN with different software protocols Some simple management implementations for monitoring and optimizing our storage resources access Results from Clustered Parallel Filesystem (Lustre/GPFS) performance measurements: Step 1: Generic Filesystem Benchmark Step 2: Realistic LHC analysis jobs results Thank you to everybody for your attention 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Benchmarking tools Dedicated tools for benchmarking and monitoring have been written The benchmarking tools allow the user to start, stop and monitor the evolution of simultaneous read/write operations from an arbitrary number of clients, reporting at the end of the test the aggregated throughput Realized as a set of bash scripts and C programs The tool implements network bandwith measurements by means of the netperf suite and sequential read/write with dd Thought to be of general use, can be reused with minimal effort for any kind of storage benchmark Completely automatized The user does not need to install anything on the target nodes as all the software is copied by the tool via ssh (and also removed in the end) The user has only to issue a few commands from the shell prompt to control everything Can perform complex unattended and personalized tests by means of very simple scripts, collect and save all the results and produce plots 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Monitoring tools The monitoring tools allow to measure the time dependence of the raw network traffic of each server with a granularity of one second Following the time dependence of the I/O gives important insights and can be very important for a detailed understanding and tuning of the network and parallel filesystem operational parameters The existing tools didn’t provide such a low granularity, so we have written our own, reusing a work made for the LHCb online farm monitoring (consider that writing/reading one file of 1 GB from a single client requires just a few seconds) The tool automatically produces a plot of the aggregated network traffic of the file servers for each test in pdf format The network traffic data files corresponding to each file server are saved to ascii files in case one wants to make a detailed per-server analysis 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it GPFS features Very stable, reliable, fault tolerant, indicated for storage of critical data and no charge for educational or research use. Commercial product, initially developed by IBM for the SP series and then ported to Linux Advanced command line interface for configuration and management Easy to install, not invasive Distributed as binaries in RPM packages No patches to standard kernels are required, just a few kernel modules for POSIX I/O to be compiled for the running kernel Data and metadata striping Possibility to have data and metadata redundancy Expensive solution, as it requires the replication of the whole files, indicated for storage of critical data Data recovery for filesystem corruption available Fault tolerant features oriented to SAN and internal health monitoring through network heartbeat 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Lustre features Commercial product, easy to install, but fairly invasive Distributed as binaries and sources in RPM packages Requires own Lustre patches to standard kernels, but binary distribution of patched kernels are made available Aggressive commercial effort, the developers sell it as an “Intergalactic Filesystem” scalable to 10000+ nodes Advanced interface for configuration and management Possibility to have Metadata redundancy and Metadata Server fault tolerance Data recovery for filesystem corruption available POSIX I/O 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Sequential Write/Read benchmarks Sequential write/read from a variable number of clients simultaneously performing I/O 1 to 30 Gb clients, 1 to 4 processes per client Sequential write/read of zeroed files by means of dd File sizes ranging from 1 MB to 1 GB After having been written, files are read back Particular attention to read the whole files from disk (i.e. no caching at all on the client side nor on the server side) Before starting tests, appropriate sync’s are issued to unload the operating system buffers in order not to have interference between consecutive tests 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Hardware testbed Disk storage 3 IBM FAStT 900 (DS4500) Each FAStT 900 serves 2 RAID5 arrays, 4 TB each (17 x 250 GB disks + 1 hot spare) Each RAID5 is further subdivided in two LUNs of 2 TB each In total 12 LUNs and 24 TB of disk space (102 x 250GB disks + 8 hot spares) File System Servers 6 IBM xseries 346, dual Xeon, 2 GB RAM, Gigabit NIC QLogic fiber channel PCI card on each server connected to the DS4500 via a Brocade switch 6 Gb/s available bandwidth to/from the clients Clients 30 SuperMicro nodes, dual Xeon, 2 GB RAM, Gigabit NIC 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

Realistic analysis results (with graphs) 8 TB of data processed in about 7 hours, about 14000 jobs submitted, all completed succesfully 500 analysis jobs in simultaneous RUN state, the rest in PENDING 3 Gb/s sustained read throughput from the file servers (with RFIO on top of GPFS) Write throughput of output data negligible Just about 1 MB per job LHCb LSF batch queue occupancy during tests 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it

ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it Abstract Title: Storage resources management and access at TIER1 CNAF Abstract: At presents at LCG TIER1 CNAF we have 2 main different mass storage systems for archiving the HEP experiment data: a HSM software system (CASTOR) and about 200TB of different storage devices over SAN. This paper briefly describe our hardware and software environtment and summarize the simple technical improvements we have implemented in order to obtain a better avaliability and the best data access throughtput from the front-end machines. Also some test results for different file systems over SAN are reported. 12/7/2018 ACAT 2005 DESY pierpaolo.ricci@cnaf.infn.it