Experience of Lustre at a Tier-2 site Alex Martin + Christopher J. Walker Queen Mary, University of London
Why Lustre? Posix compliant High performance Scalable Free (GPL) Used on large fraction of top supercomputers Able to stripe files if needed Scalable Performance should scale with number of OSTs Tested with 25,000 Clients 450 OSSs (1000 OSTs) Max filesize 2^64 bytes Free (GPL) Source available (Paid support available)
QMUL 2008 Lustre Setup 12 OSS (290 TiB) MDS Rack Switches Worker Nodes 10GigE MDS Failover pair Rack Switches 10GigE uplink Worker Nodes E5420 – 2*GigE Opteron – GigE Xeon - GigE
Number of machines 2 Threads, 1MB block size 3.5 GB/s max transfer Probably limited by network to racks used
StoRM Architcture Storm Traditional SE StoRM
HammerCloud 718 WMS Scales well to about ~600 jobs 369 655 451 Events (24h) - 155/4490 Job failures (3.4%) Scales well to about ~600 jobs
2011 Upgrade Design criteria Maximise storage provided needed ~1PB Sufficient performance, we also upgraded #cores from 1500 to ~3000 - Goal to be able to run ~3000 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!!
Upgrade Design criteria Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (although currently the is a 16 TB ext4 limit) Maximise storage provided needed ~1PB Sufficient performance - Aim to be able to run ~2500 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!! Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (althouh
New Hardware 60 * Dell R510 12*2TB SATA disk H700 RAID controller 12 Gig RAM 4 * 1GbE (4 with 10 GbE) Total ~1.1 PB Formatted (integrate with legacy kit to give ~ 1.4 PB)
Lustre “Brick” (half Rack) HP 2900 Switch (legacy) 48 ports (24 storage, 24 compute) 10Gig uplink (could go to 2) 6 * Storage nodes 6 * Dell R510 4*GigE 12*2TB disk ( ~19 TB RAID6) 12 * Compute node 3 * Dell C6100 (contains 4 motherboards) 2*GigE Total of 144 (288) cores and ~110 TB ( Storage is better coupled to local CN's)
Old QMUL Network
New QMUL Network 48 *1Gig per switch 24 – storage 24 – CPU
The real thing Hepspec06 RAL disk thrashing scripts 2 machines low (power saving mode) RAL disk thrashing scripts 1 Backplane failure 2 disk failures 10Gig cards in x8 slots
RAID6 Storage Performance R510 Disk ~ 600M/s Performance well matched to 4 x Gb/s network
Lustre Block (+ Rack) performance Preliminary tests using iozone 1-24 clients 8 threads/node Network Limit 6 GB/s
Ongoing and Work Need to tune performance Integrate legacy storage into new Lustre filestore Starting to investigate other filesystems particularly Hadoop.
Conclusions Have successfully deployed a ~1PB Lustre filesystem using low cost hardware Required performance. Would scale further with more “Bricks” but would be better if Grid jobs could be localized to a specific “Brick” Would be better if the storage/CPU could be more closely integrated.
Conclusions 2 The storage nodes contain 18% of the CPU cores in the cluster. And we spend a lot of effort networking these to the CPU. It would be better (and cheaper) if these could use directly for processing the data Could be achieved using Lustre pools (or other filesystem such as hadoop)
WMS Throughput (HC 582) Scales well to about ~600 jobs
Overview Design Network Hardware Performance StoRM Conclusions