Mass Storage Systems for the Large Hadron Collider Experiments A novel approach based on IBM and INFN software A. Cavalli 1, S. Dal Pra 1, L. dell’Agnello.

Slides:

Advertisements

Similar presentations

Jens G Jensen Atlas Petabyte store Supporting Multiple Interfaces to Mass Storage Providing Tape and Mass Storage to Diverse Scientific Communities.

Advertisements

Distributed Tier1 scenarios G. Donvito INFN-BARI.

23/04/2008VLVnT08, Toulon, FR, April 2008, M. Stavrianakou, NESTOR-NOA 1 First thoughts for KM3Net on-shore data storage and distribution Facilities VLV.

What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.

Magda – Manager for grid-based data Wensheng Deng Physics Applications Software group Brookhaven National Laboratory.

Luca dell’Agnello INFN-CNAF FNAL, May

Ian M. Fisk Fermilab February 23, Global Schedule External Items ➨ gLite 3.0 is released for pre-production in mid-April ➨ gLite 3.0 is rolled onto.

Ian Fisk and Maria Girone Improvements in the CMS Computing System from Run2 CHEP 2015 Ian Fisk and Maria Girone For CMS Collaboration.

Toward new HSM solution using GPFS/TSM/StoRM integration Vladimir Sapunenko (INFN, CNAF) Luca dell’Agnello (INFN, CNAF) Daniele Gregori (INFN, CNAF) Riccardo.

Zhiling Chen (IPP-ETHZ) Doktorandenseminar June, 4 th, 2009.

High Energy Physics At OSCER A User Perspective OU Supercomputing Symposium 2003 Joel Snow, Langston U.

Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.

Computing Infrastructure Status. LHCb Computing Status LHCb LHCC mini-review, February The LHCb Computing Model: a reminder m Simulation is using.

INTRODUCTION The GRID Data Center at INFN Pisa hosts a big Tier2 for the CMS experiment, together with local usage from other HEP related/not related activities.

Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,

LCG Service Challenge Phase 4: Piano di attività e impatto sulla infrastruttura di rete 1 Service Challenge Phase 4: Piano di attività e impatto sulla.

Storage Tank in Data Grid Shin, SangYong(syshin, #6468) IBM Grid Computing August 23, 2003.

GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh

14 Aug 08DOE Review John Huth ATLAS Computing at Harvard John Huth.

6/26/01High Throughput Linux Clustering at Fermilab--S. Timm 1 High Throughput Linux Clustering at Fermilab Steven C. Timm--Fermilab.

D C a c h e Michael Ernst Patrick Fuhrmann Tigran Mkrtchyan d C a c h e M. Ernst, P. Fuhrmann, T. Mkrtchyan Chep 2003 Chep2003 UCSD, California.

Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.

Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.

November SC06 Tampa F.Fanzago CRAB a user-friendly tool for CMS distributed analysis Federica Fanzago INFN-PADOVA for CRAB team.

Tier-2  Data Analysis  MC simulation  Import data from Tier-1 and export MC data CMS GRID COMPUTING AT THE SPANISH TIER-1 AND TIER-2 SITES P. Garcia-Abia.

And Tier 3 monitoring Tier 3 Ivan Kadochnikov LIT JINR

LCG Phase 2 Planning Meeting - Friday July 30th, 2004 Jean-Yves Nief CC-IN2P3, Lyon An example of a data access model in a Tier 1.

1 INFN-T1 site report Andrea Chierici On behalf of INFN-T1 staff 28 th October 2009.

CASTOR: CERN’s data management system CHEP03 25/3/2003 Ben Couturier, Jean-Damien Durand, Olof Bärring CERN.

Owen SyngeTitle of TalkSlide 1 Storage Management Owen Synge – Developer, Packager, and first line support to System Administrators. Talks Scope –GridPP.

Les Les Robertson LCG Project Leader High Energy Physics using a worldwide computing grid Torino December 2005.

Grid User Interface for ATLAS & LHCb A more recent UK mini production used input data stored on RAL’s tape server, the requirements in JDL and the IC Resource.

CERN IT Department CH-1211 Genève 23 Switzerland t Frédéric Hemmer IT Department Head - CERN 23 rd August 2010 Status of LHC Computing from.

USATLAS dCache System and Service Challenge at BNL Zhenping (Jane) Liu RHIC/ATLAS Computing Facility, Physics Department Brookhaven National Lab 10/13/2005.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Scientific Storage at FNAL Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015.

BNL Service Challenge 3 Status Report Xin Zhao, Zhenping Liu, Wensheng Deng, Razvan Popescu, Dantong Yu and Bruce Gibbard USATLAS Computing Facility Brookhaven.

CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,

Testing and integrating the WLCG/EGEE middleware in the LHC computing Simone Campana, Alessandro Di Girolamo, Elisa Lanciotti, Nicolò Magini, Patricia.

Data Transfer Service Challenge Infrastructure Ian Bird GDB 12 th January 2005.

Andrea Manzi CERN On behalf of the DPM team HEPiX Fall 2014 Workshop DPM performance tuning hints for HTTP/WebDAV and Xrootd 1 16/10/2014.

1.3 ON ENHANCING GridFTP AND GPFS PERFORMANCES A. Cavalli, C. Ciocca, L. dell’Agnello, T. Ferrari, D. Gregori, B. Martelli, A. Prosperini, P. Ricci, E.

1 A Scalable Distributed Data Management System for ATLAS David Cameron CERN CHEP 2006 Mumbai, India.

January 20, 2000K. Sliwa/ Tufts University DOE/NSF ATLAS Review 1 SIMULATION OF DAILY ACTIVITITIES AT REGIONAL CENTERS MONARC Collaboration Alexander Nazarenko.

Distributed Physics Analysis Past, Present, and Future Kaushik De University of Texas at Arlington (ATLAS & D0 Collaborations) ICHEP’06, Moscow July 29,

CMS: T1 Disk/Tape separation Nicolò Magini, CERN IT/SDC Oliver Gutsche, FNAL November 11 th 2013.

BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group.

LHCC Referees Meeting – 28 June LCG-2 Data Management Planning Ian Bird LHCC Referees Meeting 28 th June 2004.

IT-INFN-CNAF Status Update LHC-OPN Meeting INFN CNAF, December 2009 Stefano Zani 10/11/2009Stefano Zani INFN CNAF (TIER1 Staff)1.

Storage & Database Team Activity Report INFN CNAF,

Dominique Boutigny December 12, 2006 CC-IN2P3 a Tier-1 for W-LCG 1 st Chinese – French Workshop on LHC Physics and associated Grid Computing IHEP - Beijing.

Stato dello storage al Tier1 Luca dell’Agnello Mercoledi’ 18 Maggio 2011.

1 Particle Physics Data Grid (PPDG) project Les Cottrell – SLAC Presented at the NGI workshop, Berkeley, 7/21/99.

IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.

An Introduction to GPFS

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Dynamic Extension of the INFN Tier-1 on external resources

Extending the farm to external sites: the INFN Tier-1 experience

GEMSS: GPFS/TSM/StoRM

StoRM: a SRM solution for disk based storage systems

Pasquale Migliozzi INFN Napoli

Introduction to Data Management in EGI

LHC DATA ANALYSIS INFN (LNL – PADOVA)

Luca dell’Agnello INFN-CNAF

LHCb Computing Model and Data Handling Angelo Carbone 5° workshop italiano sulla fisica p-p ad LHC 31st January 2008.

Ákos Frohner EGEE'08 September 2008

The INFN Tier-1 Storage Implementation

CASTOR: CERN’s data management system

INFNGRID Workshop – Bari, Italy, October 2004

The LHCb Computing Data Challenge DC06

Presentation transcript:

Mass Storage Systems for the Large Hadron Collider Experiments A novel approach based on IBM and INFN software A. Cavalli 1, S. Dal Pra 1, L. dell’Agnello 1, A. Forti 1, A. Ghiselli 1, D. Gregori 1, L. Li Gioi 1, B. Martelli 1, M. Mazzucato 1, P.P. Ricci 1, E. Ronchieri 1, V. Sapunenko 1, V. Vagnoni 2, C. Vistoli 1, R. Zappi 1 1 INFN, CNAF Bologna 2 INFN, Bologna Division SC09 Portland November 17, 2009

Founded in 1951, INFN is an organization dedicated to the study of the fundamental constituents of matter Fundamental research in these areas requires the use of cutting-edge technologies and instrumentation INFN develops both in its own laboratories and in collaboration with the world of industry and in close collaboration with the academic world INFN – National Institute for Nuclear Physics Conducts theoretical and experimental research in subnuclear, nuclear, and astroparticle physics 2 2

INFN today About employees, university associates and students 4 National Laboratories LNL, LNGS, LNF and LNS 19 Divisions located at Physics Departments of Italian Universities 1 National Computing Centre (Tier-1) CNAF Bologna 8 peripheral computing centres (Tier-2s) More than users routinely accessing computing services 3 3

The Large Hadron Collider (LHC) The largest particle accelerator ever built 27 km underground circular tunnel equipped with cryogenic super-conducting dipole magnets operating at a temperature of 1.9 K Protons at near light velocity collide every 25 nsec Every second, billions of particles are created and then observed with dedicated particle detectors Overall cost of the infrastructure: $ 6 billions Starting routine operation these days! 4 4

Layered hierarchical computing model, from detectors to desktop PCs (passing through large computing farms) Data custodial takes place at the Tier-0 (CERN) and several Tier-1 centres across US, Europe and Asia 5 LHC Computing Model 5 Online System Tens of PetaBytes of data Produced each year CERN Cache for data Department Tier 0+1 Tier 1 Tier 3 Tier 4 World-wide LHC Computing Grid (WLCG) Tiers

INFN-CNAF computing center CNAF is the central computing facility of INFN Italian Tier-1 computing centre for the LHC experiments ATLAS, CMS, ALICE and LHCb … but also one of the main processing facilities for other High Energy Experiments BaBar (Standford, CA) and CDF (Fermilab, IL), Astro and Space physics VIRGO (Italy), ARGO (Tibet), AMS (Satellite), PAMELA (Satellite) and MAGIC (Canary Islands) YearCPU power, HS06*Disk Space, PBTape Space, PB k2.4 (2.8 RAW) k6.8 (8.2 RAW)6.6 * HS06 stands for HEP SPEC 06 Approximately: 4 HS06 = 1 kSI2K 6

Mass Storage Challenge for the LHC Long term (several years) data custodial is needed Several PB of data must be archived every year at a Tier-1 centre and kept near-line for being transparently accessed at any time LAN and WAN data access Tier-1 centres have to provide transparent access to online and near-line data files for many thousands of jobs running on the local computing farms Each Tier-1 centre must sustain incoming/outgoing data flows from/to CERN (Tier-0) and the other Tier-1/Tier-2 centres Many hundreds of simultaneous data transfer streams Sustained incoming and outgoing aggregated traffic on a Tier-1 MSS can reach order of several GB/s 7 7

Existing MSS solutions in WLCG Mass Storage Systems presently employed in WLCG Tier-0 and Tier-1 centres (e.g. CASTOR and dCache) are based on a “DAS” model Multiple servers with Direct Attached Storage disks acting as “file servers” They provide read/write access to local files over the network through custom protocols Files reside on a single server direct-attached disks (unless replication is used), i.e. no striping over multiple servers Centralized “nameserver” keeps track of which file server holds a file on a DB Monolithic and very complex products Maintenance and operation have proven to be difficult tasks 8 8

Why a new WLCG MSS? Overcome limitations of existing DAS-based products complexity, scalability and stability issues, limited failover capabilities, limited support Use widely employed, supported and well documented components (either commercial or not) to do the most complicated tasks do not try to reinvent the wheel Keep the system modular Use high-end industry standards in both hardware and software infrastructures large high-performance SAN devices instead of small DAS boxes few disk controllers, few points of failures Simplify administration need to be fully centralized 9 9

Disk-centric system with five fundamental components 1. GPFS: disk-storage software infrastructure 2. TSM: tape management system 3. StoRM: SRM service 4. GEMSS: StoRM-GPFS-TSM interface 5. GridFTP: WAN data transfers 10 ILM DATA FILE GEMSS DATA MIGRATION PROCESS DATA FILE StoRM GridFTP GPFS DATA FILE WAN data transfers DATA FILE TSM DATA FILE GEMSS DATA RECALL PROCESS DATA FILE WORKER NODE Building blocks of the new system SAN TAN LAN SAN 10

Storage Resource Management (SRM) in the WLCG/EGEE Grid In WLCG/EGEE all interactions between applications and storage systems are mediated by an abstraction layer, so-called SRM client applications submitted via Grid should not be aware of the specific storage implementation installed at a given site to let applications interact transparently with the backend storage systems (either disk or tape) a common interface has been defined SRM currently supports several access protocols over LAN (e.g. POSIX, RFIO*, DCAP*) and WAN (Globus GridFTP) SRM also allows for remote space management of storage areas * POSIX-like network protocols developed in HEP contexts 11

SRM service for GPFS: StoRM StoRM is an implementation of the SRM v2.2 protocol and has been developed at INFN-CNAF Since the beginning, it was designed to leverage the advantages of parallel file systems and common POSIX file systems in a Grid environment it allows GPFS and other parallel file system implementations to be used in a WLCG/EGEE Grid framework, where the availability of SRM services is a mandatory requirement StoRM is already in production since a couple of years at CNAF and in other Tier-2 centres in Europe, but just supporting disk-based storage systems Recently, it has been adapted to support a complete HSM solution and is now in production with such new features 12

GPFS at CNAF GPFS has been chosen at CNAF as the solution for disk-based storage outstanding I/O performances and stability achieved Large GPFS installation is in production at CNAF since 2005, with increasing disk space and number of users At present, 2 PB of net disk space (> 6 PB in Q2 2010) partitioned in several GPFS clusters 150 disk-servers (NSD + GridFTP) connected to the SAN Very positive experience so far 1 FTE employed to manage the full system no disruptive events very satisfied users 13

File migrations from GPFS to TSM Data migration from GPFS to TSM has been implemented employing standard GPFS features GPFS ILM engine performs metadata scans to produce the list of files eligible for migration ILM triggers the startup of GEMSS data migrator processes on a set of dedicated nodes GEMSS migrators in turn invoke HSM-client TSM commands to perform file transfers to tape files belonging to different datasets are migrated to different TSM storage pools When the file system occupancy exceeds a (configurable) threshold, ILM triggers a GEMSS garbage collector process contents of files already copied to tape are removed from disk in order to bring down the occupancy to the desired value 14

Introducing file recalls Efficient recall of files from tape to disk is a complex task with respect to migration While performing bulk recalls, if files are not recalled following a proper order, a large number of tape mount/dismount sequences can lead to unreasonably low performance optimal strategies for file recalls must take into account how files are distributed on tapes However, in a Grid environment users have no way to know how files are stored at a given site holding datasets of interest intelligence is mandatory on the server side 15

GEMSS selective tape-ordered recalls (I) Selective tape-ordered recalls have been implemented in GEMSS by means of 4 main commands/processes gemssEnqueueRecall gemssMonitor gemssReorderRecall gemssProcessRecall gemssEnqueueRecall is a command used to insert file names to be recalled into a FIFO gemssReorderRecall is a process which fetches files from the queue and builds sorted lists with optimal file ordering gemssProcessRecall is a process which performs actual recalls from TSM to GPFS for one tape by issuing HSM-client TSM commands gemssMonitor starts one gemssReorderRecall and as many gemssProcessRecall processes as specified in configuration files 16

gemssEnqueueRecall Recall queue (FIFO) gemssReorderRecall File list tape A File list tape B File list tape C File list tape D Tape ordered file lists gemssProcessRecall File path gemssMonitor File path start Pull file lists GEMSS selective tape-ordered recalls (II) 17

GEMSS prototype setup 2x10 Gbps 500 TB GPFS file system 4 GridFTP servers (4x2 Gbps) 6 NSD servers (6x2 Gbps on LAN) HSM STA HSM STA HSM STA 8x4 Gbps 3x4 Gbps 8x4 Gbps 8 tape drives 1 TB per tape 1 Gbps per drive TSM server SAN TAN 6x4 Gbps TAPE LIBRAR Y LAN 3 TSM Storage Agents and HSM clients 20x4 Gbps DB2 SAN DB2 18

The largest LHC user at CNAF (the CMS experiment) has been moved from CASTOR to GEMSS First issue: move existing tape-resident data to the new system Migration of 1 PB of data not a trivial task In addition, the migration had to be done without interrupting nor degrading ordinary CMS production activities A dedicated tool to keep the CASTOR and GEMSS systems in sync during the migration phase was developed Entering the production phase Throughput from tape Throughput to tape 19

As a validation stress-test, recalls corresponding to five days of typical usage by the CMS experiment were submitted in one shot from CERN to CNAF through StoRM 24 TB of data stored in files randomly spread over 100 tapes were moved from TSM to GPFS via GEMSS in 19h Up to 6 drives used for recalls and at the same time up to 3 drives used for migrations of new data Average throughput: ~400MB/s Number of failures: 0 20 Performance with bulk recalls GB recalled versus time Throughput from tape Throughput to tape 20

21 Conclusive remarks A full HSM system based on GPFS and TSM, able to satisfy the requirements of WLCG experiments operating at the Large Hadron Collider, has been implemented StoRM, the SRM service for GPFS, has been extended in order to manage tape support An interface between GPFS and TSM (GEMSS) has been realized in order to implement a high-performance tape recall algorithm 1 PB of tape-resident data owned by the largest HSM user at CNAF (the CMS experiment) has been migrated from CASTOR to the new system without service interruption All other experiments are now going to be migrated as well 21