BNL dCache Status and Plan CHEP07: September 2-7, 2007 Zhenping (Jane) Liu for the BNL RACF Storage Group
Outline dCache system instances at BNL RACF Phenix (RHIC) dCache System USATLAS Production dCache System Architecture Network interface Servers Transfer statistics dCache Monitoring Issues Current upgrade activities and further plans
BNL dCache system instances USATLAS production dCache PHENIX production dCache SRM 2.2 dCache testbed OSG dCache testbed
PHENIX production dCache 450 pools, 565 TB storage, 720K files on disk (212 TB). Currently used as the end repository and archiving mechanism for the PHENIX data production stream. dccp is the primary transfer mechanism within Phenix Anatrain SRM is used for offsite transfer, e.g., recent data transfer to IN2P3 Lyon.
USATLAS Production dCache USATLAS Tier1 dCache deployed for production usage since Oct. 2004; It also participated in a series of Service Challenges since then. Large scale, grid-enabled, distributed disk storage system 582 nodes in total (15 Core Servers, 555 Read Servers, 12 Write Servers) dCache PNFS Name Space 904 TB (Production TB, SC TB) as of end of May 2007 Disk Pool Space: 762 TB Grid-enabled (SRM, GSIFTP) Storage Element HPSS as back-end tape system. Efficient and optimized tape data access (Oak Ridge Batch System) Low-cost, locally-mounted disk space on the computing farm as read pool disk space. Dedicated write pool servers GFTP doors as adapters for grid traffic. All grid traffic should go over GFTP doors; Not yet really work this way for all transfer scenarios.
USATLAS dCache HPSS Write pool 12 nodes Read pool 555 nodes SRM Door 2 nodes SRM/SRMDB gridFTP doors 8 nodes dual NIC Other dCache core services 5 nodes admin/pnfs/slony/maintenance/dCap Oak Ridge Batch system ~150MB/s ~350MB/s ~400MB/s BNL firewall ~550MB/s ~350MB/s ~500MB/s Traffic to/from: CERN other Tier1s Tier2s Required bandwidths are indicated ~50MB/s Traffic to/from others
7
dCache servers Core servers Components running Pnfs node: pnfsManager,dir,pnfs, pnfs DB Slony PNFS backup node: Slony Admin node: Admin,LocationManager,PoolManager,AdminDoor Maintenance node: InfoProvider, statistics SRM door node: SRM, Utility SRM DB node: SRM DB, Utility DB GridFTP door node: GFTP door DCap door node: Dcap CPU, memory and OS Pnfs, slony, admin, maintenance, SRM, SRM DB nodes (just upgraded) 4 core CPU, 8GB memory, SAS disk for servers running DB like PNFS, slony, SRM DB, Maintenance; SATA for critical servers without DB like admin, SRM. OS: RHEL 4, 64-bit. 32-bit PNFS; 64-bit application for others
dCache servers (Cont.) GridFTP door nodes, DCap door node 2 core CPU, 4GB memory OS: RHEL 4, 32-bit. 32-bit dCache application Write servers CPU, memory, OS, file system and disk 2 core CPU, 4GB memory OS: RHEL 4, 32-bit. 32-bit dCache application XFS file system; Software raid; SCSI disk Read servers CPU, memory, OS, file system and disk running on worker node; CPU, memory varied OS: SL4, 32-bit. 32-bit dCache application EXT3 file system Read pool space varied
Transfer Statistics (2007 Jan-Jun)
ATLAS data volume at BNL RACF (almost all of data are in dCache)
dCache Monitoring Ganglia Load, network, memory usage, disk I/O and etc. Nagios disk becomes full or nearly full Node crash and disk failure dCache cell offline, pool space usage, restore request status dCache probe (internal/external; dccp/globus-url-copy/srmcp) Check whether dCache processes are listening on the correct ports Host certificate expiration, CRL expiration. Monitoring scripts Oak Ridge Batch System monitoring tool Check log files for signs of trouble Monitor dCache java processes Health monitoring and automatic service restart when needed Others Off-hour operation; System administrator paging
13 Issues PNFS bottleneck Hardware improvement; Chimera deployment; SRM performance issue; SRM bottleneck Software improvement; Hardware improvement; SRM DB and SRM separated. high load on write pool node with poor data I/O when handling concurrent read and write. Better hardware needed high load on GFTP door nodes More GFTP doors needed
Issues (Cont.) Heavy maintenance workload. More automatic monitoring and maintenance tools needed. Production team requires important data to stay on disk, but it is not always the case yet. Need to “Pin” those data in read pool disk.
Current upgrade activities and further plans System just upgraded v (SRM improved) DB and dCache applications stay separated Maintenance components moved out of admin node Slony as PNFS replication mechanism; PNFS routine backup moved out from pnfs node to slony node Hardware upgraded on most core servers On most core servers, hardware and OS upgraded to 64-bit, and 64- bit dCache applications deployed except PNFS. Further upgrade plan Adding five Sun Thumpers as write pools. (Ongoing) Based on evaluation result, we expect the write I/O rate limit on each pool node to go from 15 MB/s to at least 100 MB/s (with concurrent inbound and outbound traffic) Adding more GFTP doors
Current upgrade activities and further plans (Cont.) Deploying HoppingManager and Transfer pool to “pin” important production data in read pool disk. Tested through High Availability for critical servers like PNFS, admin node, SRM, SRMDB. failover and recovery of stopped or interrupted services Adding more monitoring packages SRM watch FNAL monitoring tool More from OSG and other sites Chimera v1.8 evaluation and deployment (a Must to BNL) improved file system engine Performance scales with back-end database implementation oracle cluster scale to the petabyte range – USATLAS Tier-1 Disk Capacity Estimated: Y ,556 TB Y ,610 TB Y ,921 TB Y ,262 TB Y ,427 TB SRM 2.2 deployment
“Pin” data in read pool disk
18 SUN Thumper Test Results 150 clients sequentially reading 5 random 1.4G files. Throughput is 350 MB/s for almost 1 hour: 75 clients sequentially writing 3x1.4G files and 75 clients sequentially reading 4x1.4G randomly selected files. Throughput is 200 MB/s write & 100 MB/s read: