Experience of Lustre at QMUL

Experience of Lustre at QMUL
Alex Martin + Christopher J. Walker Queen Mary, University of London

Why Lustre? Posix compliant High performance Scalable Free (GPL)
Used on large fraction of top supercomputers Able to stripe files if needed Scalable Performance should scale with number of OSTs Tested with 25,000 Clients 450 OSSs (1000 OSTs) Max filesize 2^64 bytes Free (GPL) Source available (Paid support available)

QMUL 2008 Lustre Setup 12 OSS (290 TiB) MDS Rack Switches Worker Nodes
10GigE MDS Failover pair Rack Switches 10GigE uplink Worker Nodes E5420 – 2*GigE Opteron – GigE Xeon - GigE

StoRM Architcture Storm Traditional SE StoRM

2011 Upgrade Design criteria
Maximise storage provided needed ~1PB Sufficient performance, we also upgraded #cores from 1500 to ~3000 - Goal to be able to run ~3000 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!!

Upgrade Design criteria
Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (although currently the is a 16 TB ext4 limit) Maximise storage provided needed ~1PB Sufficient performance - Aim to be able to run ~2500 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!! Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (althouh

New Hardware 60 * Dell R510 12*2TB SATA disk H700 RAID controller
12 Gig RAM 4 * 1GbE (4 with 10 GbE) Total ~1.1 PB Formatted (integrate with legacy kit to give ~ 1.4 PB)

Lustre “Brick” (half Rack)
HP 2900 Switch (legacy) 48 ports (24 storage, 24 compute) 10Gig uplink (could go to 2) 6 * Storage nodes 6 * Dell R510 4*GigE 12*2TB disk ( ~19 TB RAID6) 12 * Compute node 3 * Dell C6100 (contains 4 motherboards) 2*GigE Total of 144 (288) cores and ~110 TB ( Storage is better coupled to local CN's)

New QMUL Network 48 *1Gig per switch 24 – storage 24 – CPU

The real thing Hepspec06 RAL disk thrashing scripts
2 machines low (power saving mode) RAL disk thrashing scripts 1 Backplane failure 2 disk failures 10Gig cards in x8 slots

RAID6 Storage Performance R510
Disk ~ 600MB/s Performance well matched to 4 x Gb/s network

Lustre Brick performance
tests using IOR parallel benchmarks 1-12 clients nodes 4 threads/node (Network Limit 3 GB/s) Block performance basically matches 10 Gbit core network

Conclusions Have successfully deployed a ~1PB
Lustre filesystem using low cost hardware Required performance. Would scale further with more “Bricks” but would be better if Grid jobs could be localized to a specific “Brick” Would be better if the storage/CPU could be more closely integrated.

Conclusions 2 The storage nodes contain 18% of the
CPU cores in the cluster. And we spend a lot of effort networking these to the CPU. It would be better (and cheaper) if these could use directly for processing the data Could be achieved using Lustre pools (or other filesystem such as hadoop)

Lustre Workshop at QMUL 14 July
Talks available at: Topics discussed included -- Future Lustre support models (Whamcloud) -- Release schedule -- Site reports from a number of UK sites with different use cases.

Future Lustre Support Oracle development appears to be dead
Whamcloud, company formed July 2010 ~40 employees ( -- maintains community assets -- committed to Open Source Lustre and a Open Development model Will provide commercial development and support.

Lustre Release Plans 1.8.5 Available from Oracle
Community Release from Whamcloud -- 24 TB OST support (important for QM) -- RHEL 6 -- Bugfixes Major Whamcloud Community Release Aim to provide stable 2.x release with Performance >= 1.8 Talks available at: Topics discussed included

Lustre Workshop at QMUL 14 July
Talks available at: Topics discussed included

HammerCloud 718 WMS Scales well to about ~600 jobs
Events (24h) - 155/4490 Job failures (3.4%) Scales well to about ~600 jobs

WMS Throughput (HC 582) Scales well to about ~600 jobs

Number of machines 2 Threads, 1MB block size 3.5 GB/s max transfer
Probably limited by network to racks used

Overview Design Network Hardware Performance StoRM Conclusions

Ongoing and Work Need to tune performance
Integrate legacy storage into new Lustre filestore Starting to investigate other filesystems particularly Hadoop.

Old QMUL Network

Experience of Lustre at QMUL

Similar presentations

Presentation on theme: "Experience of Lustre at QMUL"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experience of Lustre at QMUL

Similar presentations

Presentation on theme: "Experience of Lustre at QMUL"— Presentation transcript:

Similar presentations

About project

Feedback