Experience of Lustre at QMUL Alex Martin + Christopher J. Walker Queen Mary, University of London
Why Lustre? Posix compliant High performance Scalable Free (GPL) Used on large fraction of top supercomputers Able to stripe files if needed Scalable Performance should scale with number of OSTs Tested with 25,000 Clients 450 OSSs (1000 OSTs) Max filesize 2^64 bytes Free (GPL) Source available (Paid support available)
QMUL 2008 Lustre Setup 12 OSS (290 TiB) MDS Rack Switches Worker Nodes 10GigE MDS Failover pair Rack Switches 10GigE uplink Worker Nodes E5420 – 2*GigE Opteron – GigE Xeon - GigE
StoRM Architcture Storm Traditional SE StoRM
2011 Upgrade Design criteria Maximise storage provided needed ~1PB Sufficient performance, we also upgraded #cores from 1500 to ~3000 - Goal to be able to run ~3000 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!!
Upgrade Design criteria Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (although currently the is a 16 TB ext4 limit) Maximise storage provided needed ~1PB Sufficient performance - Aim to be able to run ~2500 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!! Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (althouh
New Hardware 60 * Dell R510 12*2TB SATA disk H700 RAID controller 12 Gig RAM 4 * 1GbE (4 with 10 GbE) Total ~1.1 PB Formatted (integrate with legacy kit to give ~ 1.4 PB)
Lustre “Brick” (half Rack) HP 2900 Switch (legacy) 48 ports (24 storage, 24 compute) 10Gig uplink (could go to 2) 6 * Storage nodes 6 * Dell R510 4*GigE 12*2TB disk ( ~19 TB RAID6) 12 * Compute node 3 * Dell C6100 (contains 4 motherboards) 2*GigE Total of 144 (288) cores and ~110 TB ( Storage is better coupled to local CN's)
New QMUL Network 48 *1Gig per switch 24 – storage 24 – CPU
The real thing Hepspec06 RAL disk thrashing scripts 2 machines low (power saving mode) RAL disk thrashing scripts 1 Backplane failure 2 disk failures 10Gig cards in x8 slots
RAID6 Storage Performance R510 Disk ~ 600MB/s Performance well matched to 4 x Gb/s network
Lustre Brick performance tests using IOR parallel benchmarks 1-12 clients nodes 4 threads/node (Network Limit 3 GB/s) Block performance basically matches 10 Gbit core network
Conclusions Have successfully deployed a ~1PB Lustre filesystem using low cost hardware Required performance. Would scale further with more “Bricks” but would be better if Grid jobs could be localized to a specific “Brick” Would be better if the storage/CPU could be more closely integrated.
Conclusions 2 The storage nodes contain 18% of the CPU cores in the cluster. And we spend a lot of effort networking these to the CPU. It would be better (and cheaper) if these could use directly for processing the data Could be achieved using Lustre pools (or other filesystem such as hadoop)
Lustre Workshop at QMUL 14 July Talks available at: http://www.esc.qmul.ac.uk/wiki/lustreuserworkshop2011/ Topics discussed included -- Future Lustre support models (Whamcloud) -- Release schedule -- Site reports from a number of UK sites with different use cases.
Future Lustre Support Oracle development appears to be dead Whamcloud, company formed July 2010 ~40 employees ( www.whamcloud.com) -- maintains community assets -- committed to Open Source Lustre and a Open Development model Will provide commercial development and support.
Lustre Release Plans 1.8.5 Available from Oracle 1.8.6 Community Release from Whamcloud -- 24 TB OST support (important for QM) -- RHEL 6 -- Bugfixes 2.1 Major Whamcloud Community Release Aim to provide stable 2.x release with Performance >= 1.8 Talks available at: http://www.esc.qmul.ac.uk/wiki/lustreuserworkshop2011/ Topics discussed included
Lustre Workshop at QMUL 14 July Talks available at: http://www.esc.qmul.ac.uk/wiki/lustreuserworkshop2011/ Topics discussed included
HammerCloud 718 WMS Scales well to about ~600 jobs 369 655 451 Events (24h) - 155/4490 Job failures (3.4%) Scales well to about ~600 jobs
WMS Throughput (HC 582) Scales well to about ~600 jobs
Number of machines 2 Threads, 1MB block size 3.5 GB/s max transfer Probably limited by network to racks used
Overview Design Network Hardware Performance StoRM Conclusions
Ongoing and Work Need to tune performance Integrate legacy storage into new Lustre filestore Starting to investigate other filesystems particularly Hadoop.
Old QMUL Network