Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel IWR, Forschungszentrum Karlsruhe Germany
CHEP04 The GridKa centre operates cluster computer for: D0, BaBar, CDF, Compass, LHC experiments (tier 1 for LHC) 500 dual CPU nodes, 220 TB disk, 400 TB tape expect growth to 1.6 PB disk, 4 PB tape in 2008 tape storage via dCache and Tivoli Storage Manager backend disk storage via NFS/GPFS
CHEP04 Overview Storage components at GridKa Cluster file system implementation Integration with Linux On line storage management Load balancing
CHEP04 Storage components (1) IO servers –dual Xeon 2.4 GHz, 1.5 GB RAM, Broadcom Ethernet –failover host bus adapter driver (Qlogic version 6.01) –RedHat 8, kernel on production cluster –RedHat ES 3 (Scientific Linux) on test cluster disks and RAID –disk 136 GB, 10 krpm –9 * 10 units of 14 disks: 1260 (36 hot spare) –arranged as RAID-5 volumes of 957 GB –stripe size 256 KB
CHEP04 Storage components (2) disk controllers (IBM FastT700) –to disk: 9 * 4 independent 2 Gb FC connections –to servers: 9 * 4 independent 2 Gb FC connections –reset or failure of (access to) one controller is handled without service interruption parallel cluster file system (GPFS) –each node of the storage cluster sees each disk –a partition is striped over 2 or more RAID volumes –file systems are exported via NFS –maximum size of single LUN is 1 TB
CHEP04 Cluster to storage connection GPFS cluster Worker nodes on NFS Disk collection Fibre channel switch SAN Ethernet
CHEP04 Linux parts SCSI driver –allows for hot adding disks/LUNs –no fixed relation between LUN ID and SCSI numbering. HBAs support persistent binding Fibre Channel driver –failover driver selects functional path –maximum number of LUNs on Qlogic FC HBA is 128 nfs server and nfs client –server side optimized, client default autofs and program maps –version (autofs4)
CHEP04 Maintenance and management Disk storage supports: –global hot-spares –on-line replaceable parts: controllers (incl fw), batteries, power supplies –background disk scrubbing LVM of GPFS allows for: –online replacement of volumes –expansion of file systems –online rebalancing after expansion
CHEP04 Storage load balancing At file system level –data transfers are striped over several raid volumes –storage is re-balanced on-line after expansion At server level –clients select servers at random –combination of autofs and DNS –introduce selection criteria (server capacity, service groups)
CHEP04 Server level load balancing read and write activity of last 24 hrs summed over all file servers read activity of production file servers
CHEP04 Presented solution benefits scalable size (4 PB) and large (15 TB) file spaces scalable performance (100 MB/server on single GE) native OS syscall API, no application code change needed on-line replaceable components reduce down time on-line storage expansion dynamic load balancing server load policies allows different server HW native Linux components on clients
CHEP04 Work to do get GPFS/NFS working on RH ES 3.0 integrate dCache into existing storage environment get DC to CERN and peer tier 1’s rolling start experimenting with SATA connect NFS servers with second Ethernet via Ethernet bonding introduce load policies
CHEP04 Thank you and colleagues from the GIS, GES and DASI departments at Institute for scientific computing (IWR), Karlsruhe