Mass Storage Systems for the Large Hadron Collider Experiments A novel approach based on IBM and INFN software A. Cavalli 1, S. Dal Pra 1, L. dell’Agnello 1, A. Forti 1, A. Ghiselli 1, D. Gregori 1, L. Li Gioi 1, B. Martelli 1, M. Mazzucato 1, P.P. Ricci 1, E. Ronchieri 1, V. Sapunenko 1, V. Vagnoni 2, C. Vistoli 1, R. Zappi 1 1 INFN, CNAF Bologna 2 INFN, Bologna Division SC09 Portland November 17, 2009
Founded in 1951, INFN is an organization dedicated to the study of the fundamental constituents of matter Fundamental research in these areas requires the use of cutting-edge technologies and instrumentation INFN develops both in its own laboratories and in collaboration with the world of industry and in close collaboration with the academic world INFN – National Institute for Nuclear Physics Conducts theoretical and experimental research in subnuclear, nuclear, and astroparticle physics 2 2
INFN today About employees, university associates and students 4 National Laboratories LNL, LNGS, LNF and LNS 19 Divisions located at Physics Departments of Italian Universities 1 National Computing Centre (Tier-1) CNAF Bologna 8 peripheral computing centres (Tier-2s) More than users routinely accessing computing services 3 3
The Large Hadron Collider (LHC) The largest particle accelerator ever built 27 km underground circular tunnel equipped with cryogenic super-conducting dipole magnets operating at a temperature of 1.9 K Protons at near light velocity collide every 25 nsec Every second, billions of particles are created and then observed with dedicated particle detectors Overall cost of the infrastructure: $ 6 billions Starting routine operation these days! 4 4
Layered hierarchical computing model, from detectors to desktop PCs (passing through large computing farms) Data custodial takes place at the Tier-0 (CERN) and several Tier-1 centres across US, Europe and Asia 5 LHC Computing Model 5 Online System Tens of PetaBytes of data Produced each year CERN Cache for data Department Tier 0+1 Tier 1 Tier 3 Tier 4 World-wide LHC Computing Grid (WLCG) Tiers
INFN-CNAF computing center CNAF is the central computing facility of INFN Italian Tier-1 computing centre for the LHC experiments ATLAS, CMS, ALICE and LHCb … but also one of the main processing facilities for other High Energy Experiments BaBar (Standford, CA) and CDF (Fermilab, IL), Astro and Space physics VIRGO (Italy), ARGO (Tibet), AMS (Satellite), PAMELA (Satellite) and MAGIC (Canary Islands) YearCPU power, HS06*Disk Space, PBTape Space, PB k2.4 (2.8 RAW) k6.8 (8.2 RAW)6.6 * HS06 stands for HEP SPEC 06 Approximately: 4 HS06 = 1 kSI2K 6
Mass Storage Challenge for the LHC Long term (several years) data custodial is needed Several PB of data must be archived every year at a Tier-1 centre and kept near-line for being transparently accessed at any time LAN and WAN data access Tier-1 centres have to provide transparent access to online and near-line data files for many thousands of jobs running on the local computing farms Each Tier-1 centre must sustain incoming/outgoing data flows from/to CERN (Tier-0) and the other Tier-1/Tier-2 centres Many hundreds of simultaneous data transfer streams Sustained incoming and outgoing aggregated traffic on a Tier-1 MSS can reach order of several GB/s 7 7
Existing MSS solutions in WLCG Mass Storage Systems presently employed in WLCG Tier-0 and Tier-1 centres (e.g. CASTOR and dCache) are based on a “DAS” model Multiple servers with Direct Attached Storage disks acting as “file servers” They provide read/write access to local files over the network through custom protocols Files reside on a single server direct-attached disks (unless replication is used), i.e. no striping over multiple servers Centralized “nameserver” keeps track of which file server holds a file on a DB Monolithic and very complex products Maintenance and operation have proven to be difficult tasks 8 8
Why a new WLCG MSS? Overcome limitations of existing DAS-based products complexity, scalability and stability issues, limited failover capabilities, limited support Use widely employed, supported and well documented components (either commercial or not) to do the most complicated tasks do not try to reinvent the wheel Keep the system modular Use high-end industry standards in both hardware and software infrastructures large high-performance SAN devices instead of small DAS boxes few disk controllers, few points of failures Simplify administration need to be fully centralized 9 9
Disk-centric system with five fundamental components 1. GPFS: disk-storage software infrastructure 2. TSM: tape management system 3. StoRM: SRM service 4. GEMSS: StoRM-GPFS-TSM interface 5. GridFTP: WAN data transfers 10 ILM DATA FILE GEMSS DATA MIGRATION PROCESS DATA FILE StoRM GridFTP GPFS DATA FILE WAN data transfers DATA FILE TSM DATA FILE GEMSS DATA RECALL PROCESS DATA FILE WORKER NODE Building blocks of the new system SAN TAN LAN SAN 10
Storage Resource Management (SRM) in the WLCG/EGEE Grid In WLCG/EGEE all interactions between applications and storage systems are mediated by an abstraction layer, so-called SRM client applications submitted via Grid should not be aware of the specific storage implementation installed at a given site to let applications interact transparently with the backend storage systems (either disk or tape) a common interface has been defined SRM currently supports several access protocols over LAN (e.g. POSIX, RFIO*, DCAP*) and WAN (Globus GridFTP) SRM also allows for remote space management of storage areas * POSIX-like network protocols developed in HEP contexts 11
SRM service for GPFS: StoRM StoRM is an implementation of the SRM v2.2 protocol and has been developed at INFN-CNAF Since the beginning, it was designed to leverage the advantages of parallel file systems and common POSIX file systems in a Grid environment it allows GPFS and other parallel file system implementations to be used in a WLCG/EGEE Grid framework, where the availability of SRM services is a mandatory requirement StoRM is already in production since a couple of years at CNAF and in other Tier-2 centres in Europe, but just supporting disk-based storage systems Recently, it has been adapted to support a complete HSM solution and is now in production with such new features 12
GPFS at CNAF GPFS has been chosen at CNAF as the solution for disk-based storage outstanding I/O performances and stability achieved Large GPFS installation is in production at CNAF since 2005, with increasing disk space and number of users At present, 2 PB of net disk space (> 6 PB in Q2 2010) partitioned in several GPFS clusters 150 disk-servers (NSD + GridFTP) connected to the SAN Very positive experience so far 1 FTE employed to manage the full system no disruptive events very satisfied users 13
File migrations from GPFS to TSM Data migration from GPFS to TSM has been implemented employing standard GPFS features GPFS ILM engine performs metadata scans to produce the list of files eligible for migration ILM triggers the startup of GEMSS data migrator processes on a set of dedicated nodes GEMSS migrators in turn invoke HSM-client TSM commands to perform file transfers to tape files belonging to different datasets are migrated to different TSM storage pools When the file system occupancy exceeds a (configurable) threshold, ILM triggers a GEMSS garbage collector process contents of files already copied to tape are removed from disk in order to bring down the occupancy to the desired value 14
Introducing file recalls Efficient recall of files from tape to disk is a complex task with respect to migration While performing bulk recalls, if files are not recalled following a proper order, a large number of tape mount/dismount sequences can lead to unreasonably low performance optimal strategies for file recalls must take into account how files are distributed on tapes However, in a Grid environment users have no way to know how files are stored at a given site holding datasets of interest intelligence is mandatory on the server side 15
GEMSS selective tape-ordered recalls (I) Selective tape-ordered recalls have been implemented in GEMSS by means of 4 main commands/processes gemssEnqueueRecall gemssMonitor gemssReorderRecall gemssProcessRecall gemssEnqueueRecall is a command used to insert file names to be recalled into a FIFO gemssReorderRecall is a process which fetches files from the queue and builds sorted lists with optimal file ordering gemssProcessRecall is a process which performs actual recalls from TSM to GPFS for one tape by issuing HSM-client TSM commands gemssMonitor starts one gemssReorderRecall and as many gemssProcessRecall processes as specified in configuration files 16
gemssEnqueueRecall Recall queue (FIFO) gemssReorderRecall File list tape A File list tape B File list tape C File list tape D Tape ordered file lists gemssProcessRecall File path gemssMonitor File path start Pull file lists GEMSS selective tape-ordered recalls (II) 17
GEMSS prototype setup 2x10 Gbps 500 TB GPFS file system 4 GridFTP servers (4x2 Gbps) 6 NSD servers (6x2 Gbps on LAN) HSM STA HSM STA HSM STA 8x4 Gbps 3x4 Gbps 8x4 Gbps 8 tape drives 1 TB per tape 1 Gbps per drive TSM server SAN TAN 6x4 Gbps TAPE LIBRAR Y LAN 3 TSM Storage Agents and HSM clients 20x4 Gbps DB2 SAN DB2 18
The largest LHC user at CNAF (the CMS experiment) has been moved from CASTOR to GEMSS First issue: move existing tape-resident data to the new system Migration of 1 PB of data not a trivial task In addition, the migration had to be done without interrupting nor degrading ordinary CMS production activities A dedicated tool to keep the CASTOR and GEMSS systems in sync during the migration phase was developed Entering the production phase Throughput from tape Throughput to tape 19
As a validation stress-test, recalls corresponding to five days of typical usage by the CMS experiment were submitted in one shot from CERN to CNAF through StoRM 24 TB of data stored in files randomly spread over 100 tapes were moved from TSM to GPFS via GEMSS in 19h Up to 6 drives used for recalls and at the same time up to 3 drives used for migrations of new data Average throughput: ~400MB/s Number of failures: 0 20 Performance with bulk recalls GB recalled versus time Throughput from tape Throughput to tape 20
21 Conclusive remarks A full HSM system based on GPFS and TSM, able to satisfy the requirements of WLCG experiments operating at the Large Hadron Collider, has been implemented StoRM, the SRM service for GPFS, has been extended in order to manage tape support An interface between GPFS and TSM (GEMSS) has been realized in order to implement a high-performance tape recall algorithm 1 PB of tape-resident data owned by the largest HSM user at CNAF (the CMS experiment) has been migrated from CASTOR to the new system without service interruption All other experiments are now going to be migrated as well 21