Status Report dello Storage al Tier1

Status Report dello Storage al Tier1
Luca dell’Agnello 14 Maggio 2009

Disk storage systems All systems interconnected in a SAN
12 FC switches (2 core switches) with 2/4 gbps connections ~ 200 disk-servers ~ 2.6 PB raw (~ 2.1 PB-N) of disk-spac 13 EMC/DELL Clarion CX3-80 systems (SATA disks) interconnected to SAN 1 dedicated to databases (FC disks) ~ 0.8 GB/s bandwidth TB each 12 disk-servers (2 x 1 gbits uplinks + 2 FC4 connections) part configured as gridftp servers (if needed), 64 bits OS (see next?? slide) 1 CX4-960 (SATA disks) ~ 2 GB/s bandwidth 600 TB Other older hw (Flexline, Fast-T etc..) being progressively phased out No support (part used as cold spares) Not suitable for GPFS

Tape libraries New SUN SL8500 in production since July’08 10000 slots
20 T10KB drives (1 TB) in place 8 T10KA (0.5 TB) drives still in place 4000 tapes on line (4 PB) “old” SUN SL5500 1 PB on line, b and 5 LTO drives Nearly full No more used for writing Repack on going ~5k 200 GB tapes to be repacked ~ 500 GB tapes

How experiments use the storage
Several experiments use CNAF storage resources Experiments present at CNAF make different use of (storage) resources Some use almost only the disk storage (e.g. CDF, BABAR) Some use also the tape system as an archive for older data (e.g. VIRGO) The LHC experiments exploit the functionalities of the HSM system… ….but in different ways CMS and Alice) use primarily the disk with tape back-end, while others Atlas and LHCb concentrate their activity on the disk-only storage (see next slide for details) Standardization over few storage systems, protocols Srm vs. direct access file, rfio, as LAN protocols Gridftp as WAN protocol Some other protocols used but not supported (xrootd, bbftp)

STORAGE CLASSES 3 class of services/quality (aka Storage Classes) defined in WLCG Present implementation at CNAF of the 3 SC’s Disk1 Tape0 (D1T0 or online-replica)  GPFS/StoRM Space managed by VO Mainly LHCb, Atlas, some usage from CMS and Alice Disk1 Tape1 (D1T1 or online-custodial)  GPFS/TSM/StoRM Space managed by VO (i.e. if disk is full, copy fails) Large buffer of disk with tape back end and no garbage collector LHCb only Disk0 Tape1 (D0T1 or nearline-replica)  CASTOR Space managed by system Data migrated to tapes and deleted from disk when staging area full CMS, LHCb, Atlas, Alice testing GPFS/TSM/StoRM This setup satisfies nearly all WLCG requirements (so far) excepting: Multiple copies in different Storage Area for a sURL Name space orthogonality

Background: why StoRM I/O operations burden is on the underlying fs
Relies on the aggregation functionalities provided by the underlying fs Designed to support guaranteed space reservation and direct access (native POSIX I/O call) to the storage as well as other standard libraries (like RFIO). Highly scalable Front End StoRM is an INFN developed SRM implementation Development schedule controlled by INFN Already in production at CNAF since 2007 In production at ~ 20 sites V (in certification) implements all required functionalities by WLCG MoU (addendum including) Excellent results during ATLAS 10M files test > 30 Hz for srm operations

GPFS validation tests (2007) LHCb analysis challenge
Background: why GPFS Main features: High I/O performance vs others FS (Xrootd and dCache, Castor) All servers accessing all disks Failure of a single server out of N only reduces available bandwidth to storage by factor N-1/N Bandwidth to disks can be easily increased to 8Gb/s (with 2 dual channel FC2 or with 1 dual channel FC4 HBA) Very fast access to files (no db overhead like CASTOR or dCache) IBM support and upgrades paid for 3 years (until mid 2011) Unlimited number of licenses for INFN during this period + large amount of permanent licenses GPFS validation tests (2007) 6 servers ATLAS reprocessing December 2008 0.11 s/evt 0.76 s/file LHCb analysis challenge (see GDB May )

GPFS-CASTOR fs access comparison
Data files Parallel File System (GPFS, Lustre) Data Buffering System (Dcache, Castor) servers Disks Available bandwidth in GPFS to any file depends on the number of disk-servers while in CASTOR depends on the number of copies of the file. To increase the bandwidth in CASTOR you need to replicate files on another servers (hence increasing the used disk space).

StoRM/GPFS deployment at CNAF
GPFS in production at CNAF since 2005 Good expertise 1 FTE needed for management A success story  ~ 2 PB of total disk space assigned on GPFS ~ 150 disk-servers, Gridftp access via ad-hoc servers Almost complete decoupling of the VOs ~ 10 GPFS clusters (to ease management) One for each main “user” (ATLAS, BABAR, CDF, LHCb, CMS) 1 including WNs/UIs only (no storage) 1 dedicated to sw installation (no storage) In GPFS clusters also storm end-point included (if needed)

Gridftp validation tests
Since last November all gridftp servers moved to new setup 64 bits OS - effective use of available RAM directly interconnection to the SAN (dual channel FC4) optimization of network links (GPFS directly accessed) 1 gbit uplinks (now doubled where possible) “Low” CPU load (30-40% at most) Wishing to verify this new setup with ATLAS throughput test (cancelled) 9 castor disk-servers x 5 files x10 streams/file to 1 gridftp server on storm read/write GPFS throughput with 10 parallel transfers and 10 streams per transfer

Example of a real gridftp transfers (to StoRM) by ATLAS
Throughput test not performed but occasional massive transfers observed Up to > 400 MB/s on a single space token (April 9-10 for some hours) Throughput obtained with 4 gridftp servers (8 gbits to LAN)

Castor@CNAF C(ERN) A(dvanced) STOR(age) manager is an HSM
Manages disk cache(s) and data on tertiary storage or tapes provides a UNIX like directory hierarchy of file names (e.g. /castor/cnaf/grid/cms) Can provide all the SC’s (but at CNAF only T1D0) Plethora of services running: stager, LSF, MigHunter, vmgr, name service etc.. etc.. Heavily based on Oracle db (to save status information of the process in order to have stateless components) does not use SAN for data transfers Current version installed: 2 Srm v 2.2 end-points available (1 dedicated to CMS) Supported protocols: rfio, gridftp Still cumbersome to manage requires frequent intervention in the Oracle db Lack of management tools At present heavily used only by CMS Strongly advised to do pre-staging for reading!

Why GPFS&TSM Tivoli Storage Manager (also developed by IBM) is a tape oriented storage manager widely used (also in HEP world, e.g. FZK) Built-in functionality present in both products to implement backup and archiving from GPFS. The development of a HSM solution is based on the combination of features of GPFS (since v.3.2) and TSM (since v.5.5). Since GPFS v.3.2 the new concept of “external storage pool” extends use of policy driven Information Lifecycle Management (ILM) to tape storage. External pools are really interfaces to external storage managers, e.g. HPSS or TSM HPSS very complex (no benefits in this sense respect to CASTOR)

TSM TSM includes the following components:
Server -provides backup, archive, and space management services to the clients. The TSM server uses a database to track information about server storage, clients, client data, policy, and schedules. Client Storage Agent. enables LAN-free data movement for client operations Hierarchical Storage Management. (HSM) provides space management services for workstations. TSM for Space Management automatically migrates files that are less frequently used to server storage, freeing space on disk. Already in production at CNAF since CCRC 2008 for LHCb (D1T1 implementation)

GPFS&TSM: how it works GPFS performs file system metadata scans according to ILM policies specified by the administrators The metadata scan is very fast (it is not a find…) and is used by GPFS to identify the files which need to be migrated to tape Possible to use Extended Attributes Once the list of files is obtained, it is passed to an external process which is run on the HSM nodes and it actually performs the migration to TSM This is in particular what we implemented Recalls can be done passing a list of files to TSM This list will be tape ordered by TSM GPFS and the HSM nodes completely decoupled possible to shutdown the HSM nodes without interrupting the file system availability All components of the system have intrinsic redundancy (GPFS failover mechanisms). No need to put in place any kind of HA features apart from the unique TSM server with the internal db Backup and failover of TSM db tested

GPFS  TSM (pre-)migrations
Resident. file copy is only on disk. Pre-migrated. a copy is present also on tape IBMPmig Extended Attribute added Migrated. Migration performs garbage collecting of pre-migrated files: file on disk replaced by stub file (i.e. pointer to TSM). IBMObj EA added When threshold is reached tsmigrate triggers the dmstartpolicy process that, in turn, executes the admin-provided script startpolicy tsmigrate checks thresholds for (pre)migration in order to apply policies RULE EXTERNAL POOL ‘PoolName’ EXEC ‘InterfaceScript’ [ OPTS ’options’] If startpolicy agrees, dmstartpolicy invokes dmapplypolicy Policies contain RULES that create a list of files for (pre)migrations from GPFS to external storage pool

Storage class D1T1 D1T1 prototype in GPFS/TSM in production since May 2008 Quite simple (no competition between migrations and recalls) D1T1 requires that every file written to disk will be copied to tape (and remain resident on disk) recalls needed only in case of data loss (on disk) Some adjustments were needed in StoRM Basically to place a file on hold for migration until the write operation is completed (SRM “putDone” on file) Next “release” will use EA Net throughput to tape versus time 3 LTO2 drives used About 70 MiB/s on average with peaks up to 90 MiB/s

StoRM-GPFS-TSM integration: a light approach for T1D0
Aim is to reproduce features of present setup (i.e. CASTOR + StoRM) Hp: space-tokens (if used) are not a independent from the surl No multiple copies for the same sURL Files are recalled directly to the original space token Different read and write buffers (requirement by Atlas) A queuing mechanism of the recall is present Below the srm layer, migrations and recalls are de-coupled Migrations are GPFS driven GPFS scans the fs every N (tunable) minutes to find files eligible for pre-migratation (as in D1T1 case) GPFS creates a list of files to be migrated StoRM must add Extended Attribute to files (e.g. files can copied on tape, file must be migrate on tape) Garbage collections is GPFS driven, it uses space occupancy thresholds and pin life time expirations. Recalls are TSM driven TSM reads a list of files and makes a “tape optimized” recalls A queuing mechanism needed (to create the list)

TSM@CNAF test-bed for T1D0
Tests with beta version of 5.5 TSM Client show feasibility of “optimized” recalls But queuing mechanism needed (not foreseen in TSM for now) Needed scalability and stress tests for TSM New version (6.1) required to support T10K drives Major release with new characteristics (e.g. DB2) Functionality tests positive Results for end of May StoRM development needed ~ 14 weeks-person

Summary & next steps The next step is STEP 
WLCG stress test focused to verify sustainability of tape systems at the T1s Too early to use TSM (no srm implementation)  STEP will test CASTOR  But then is too late for TSM  aim of our stress and scalability tests is to mimic STEP (without the srm layer!) Supporting both CASTOR and GPFS/StoRM is not a option in the long period GPFS/StoRM appears to be the right choice

Backup slides

Example of a ILM policy /* Policy implementing T1D1 for LHCb:
-) 1 GPFS storage pool -) 1 SRM space token: LHCb_M-DST -) 1 TSM management class -) 1 TSM storage pool */ /* Placement policy rules */ RULE 'DATA1' SET POOL 'data1' LIMIT (99) RULE 'DATA2' SET POOL 'data2' LIMIT (99) RULE 'DEFAULT' SET POOL 'system' /* We have 1 space token: LHCb_M-DST. Define 1 external pool accordingly. */ RULE EXTERNAL POOL 'TAPE MIGRATION LHCb_M-DST‘ EXEC '/var/mmfs/etc/hsmControl' OPTS 'LHCb_M-DST‘ /* Exclude from migration hidden directories (e.g. .SpaceMan), baby files, hidden and weird files. */ RULE 'exclude hidden directories' EXCLUDE WHERE PATH_NAME LIKE '%/.%' RULE 'exclude hidden file' EXCLUDE WHERE NAME LIKE '.%' RULE 'exclude empty files' EXCLUDE WHERE FILE_SIZE=0 RULE 'exclude baby files' EXCLUDE WHERE (CURRENT_TIMESTAMP-MODIFICATION_TIME)<INTERVAL '3' MINUTE

Example of a ILM policy (cont.)
/* Migrate to the external pool according to space token (i.e. fileset). */ RULE 'migrate from system to tape LHCb_M-DST' MIGRATE FROM POOL 'system' THRESHOLD(0,100,0) WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME) TO POOL 'TAPE MIGRATION LHCb_M-DST' FOR FILESET('LHCb_M-DST') RULE 'migrate from data1 to tape LHCb_M-DST' MIGRATE FROM POOL 'data1' THRESHOLD(0,100,0) RULE 'migrate from data2 to tape LHCb_M-DST' MIGRATE FROM POOL 'data2' THRESHOLD(0,100,0) THRESHOLD (HighPercentage[,LowPercentage[,PremigratePercentage]]) Used with the FROM POOL clause to control migration and deletion based on the pool storage utilization (percent of assigned storage occupied). HighPercentage Indicates that the rule is to be applied only if the occupancy percentage of the named pool is greater than or equal to the HighPercentage value. LowPercentage Indicates that MIGRATE and DELETE rules are to be applied until the occupancy percentage of the named pool is reduced to less than or equal to the LowPercentage value. The default is 0%. PremigratePercentage Defines an occupancy percentage of a storage pool that is below the lower limit. Files that lie between the lower limit LowPercentage and the pre-migrate limit PremigratePercentage will be copied and become dual-resident in both the internal GPFS storage pool and the designated external storage pool. This option allows the system to free up space quickly by simply deleting pre-migrated files if the pool becomes full. Specify a nonnegative integer in the range 0 to LowPercentage. The default is same value as LowPercentage.

Example of configuration file
# HSM node list (comma separated) HSMNODES=diskserv-san-14,diskserv-san-16 # system directory path SVCFS=/storage/gpfs_lhcb/system # filesystem scan minimum frequency (in sec) SCANFREQUENCY=1800 # maximum time allowed for a migrate session (in sec) MIGRATESESSIONTIMEOUT=4800 # maximum number of migrate threads per node MIGRATETHREADSMAX=30 # number of files for each migrate stream MIGRATESTREAMNUMFILES=30 # sleep time for lock file check loop LOCKSLEEPTIME=2 # pin prefix PINPREFIX=.STORM_T1D1_ # TSM admin user name TSMID=xxxxx # TSM admin user password TSMPASS=xxxxx # report period (in sec) REPORTFREQUENCY=86400 # report addresses (comma separated) # alarm addresses (comma separated) # alarm delay (in sec) ALARM DELAY=7200

Risorse umane (gruppo storage)
position FTE (%) contract expiration Luca dell’Agnello Primo Tecnologo 100 Tempo indet. no Pier Paolo Ricci Tecnologo III liv. 90 Tempo indet Vladimir Sapunenko Elisabetta Roncheri Daniele Gregori Assegnista Assegno di ric. 2011 Barbara Martelli Art. 23 (concorsone) Agosto 2009 Stefano dal Pra’ Art. 23 stabilizzabile Alessandro Cavalli Tecnico Temp. indet. Andrea Prosperini esterno A progetto Fine 2009

Piano Tier1 attuale by Concezio Bozzi Il piano assume
presa dati nel 2008 e tempo macchina di 5x106 s nel 2009 I numeri per il Gruppo2 sono leggermente cambiati dopo la riunione di bilancio di settembre. Probabilmente le richieste di Virgo diminuiranno a breve. Costi gare 2009: 2780k€ 440k€ CPU, 2100k€ disco, 240k€ nastro by Concezio Bozzi

Status Report dello Storage al Tier1

Similar presentations

Presentation on theme: "Status Report dello Storage al Tier1"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Status Report dello Storage al Tier1

Similar presentations

Presentation on theme: "Status Report dello Storage al Tier1"— Presentation transcript:

Similar presentations

About project

Feedback