Experience of Lustre at QMUL

Slides:



Advertisements
Similar presentations
Computing Infrastructure
Advertisements

Alastair Dewhurst, Dimitrios Zilaskos RAL Tier1 Acknowledgements: RAL Tier1 team, especially John Kelly and James Adams Maximising job throughput using.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Deployment and Management of Grid Services.
Katie Antypas NERSC User Services Lawrence Berkeley National Lab NUG Meeting 1 February 2012 Best Practices for Reading and Writing Data on HPC Systems.
Southgrid Status Pete Gronbech: 27th June 2006 GridPP 16 QMUL.
Linux Clustering A way to supercomputing. What is Cluster? A group of individual computers bundled together using hardware and software in order to make.
NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
CVMFS AT TIER2S Sarah Williams Indiana University.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
© Copyright 2010 Hewlett-Packard Development Company, L.P. 1 HP + DDN = A WINNING PARTNERSHIP Systems architected by HP and DDN Full storage hardware and.
Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.
Optimizing Performance of HPC Storage Systems
Emalayan Vairavanathan
StoRM Some basics and a comparison with DPM Wahid Bhimji University of Edinburgh GridPP Storage Workshop 31-Mar-101Wahid Bhimji – StoRM.
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
12th November 2003LHCb Software Week1 UK Computing Glenn Patrick Rutherford Appleton Laboratory.
Oxford Update HEPix Pete Gronbech GridPP Project Manager October 2014.
Network Tests at CHEP K. Kwon, D. Han, K. Cho, J.S. Suh, D. Son Center for High Energy Physics, KNU, Korea H. Park Supercomputing Center, KISTI, Korea.
Diamond Computing Status Update Nick Rees et al..
03/03/09USCMS T2 Workshop1 Future of storage: Lustre Dimitri Bourilkov, Yu Fu, Bockjoo Kim, Craig Prescott, Jorge L. Rodiguez, Yujun Wu.
PDSF at NERSC Site Report HEPiX April 2010 Jay Srinivasan (w/contributions from I. Sakrejda, C. Whitney, and B. Draney) (Presented by Sandy.
JLab Scientific Computing: Theory HPC & Experimental Physics Thomas Jefferson National Accelerator Facility Newport News, VA Sandy Philpott.
RAL Site Report Andrew Sansum e-Science Centre, CCLRC-RAL HEPiX May 2004.
UKI-SouthGrid Update Hepix Pete Gronbech SouthGrid Technical Coordinator April 2012.
ITEP computing center and plans for supercomputing Plans for Tier 1 for FAIR (GSI) in ITEP  8000 cores in 3 years, in this year  Distributed.
LFC Replication Tests LCG 3D Workshop Barbara Martelli.
KIT – The cooperation of Forschungszentrum Karlsruhe GmbH und Universität Karlsruhe (TH) Hadoop on HEPiX storage test bed at FZK Artem Trunov.
Ultimate Integration Joseph Lappa Pittsburgh Supercomputing Center ESCC/Internet2 Joint Techs Workshop.
UKI-SouthGrid Overview and Oxford Status Report Pete Gronbech SouthGrid Technical Coordinator HEPSYSMAN – RAL 10 th June 2010.
A Silvio Pardi on behalf of the SuperB Collaboration a INFN-Napoli -Campus di M.S.Angelo Via Cinthia– 80126, Napoli, Italy CHEP12 – New York – USA – May.
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
The 2001 Tier-1 prototype for LHCb-Italy Vincenzo Vagnoni Genève, November 2000.
UKI-LT2-RHUL ATLAS STEP09 report Duncan Rand on behalf of the RHUL Grid team.
TCD Site Report Stuart Kenny*, Stephen Childs, Brian Coghlan, Geoff Quigley.
Pathway to Petaflops A vendor contribution Philippe Trautmann Business Development Manager HPC & Grid Global Education, Government & Healthcare.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Storage at SMU OSG Storage 9/22/2010 Justin Ross Southern Methodist University.
G. Russo, D. Del Prete, S. Pardi Frascati, 2011 april 4th-7th The Naples' testbed for the SuperB computing model: first tests G. Russo, D. Del Prete, S.
STORAGE EXPERIENCES AT MWT2 (US ATLAS MIDWEST TIER2 CENTER) Aaron van Meerten University of Chicago Sarah Williams Indiana University OSG Storage Forum,
S. Pardi Computing R&D Workshop Ferrara 2011 – 4 – 7 July SuperB R&D on going on storage and data access R&D Storage Silvio Pardi
RAL Site Report HEP SYSMAN June 2016 – RAL Gareth Smith, STFC-RAL With thanks to Martin Bly, STFC-RAL.
BeStMan/DFS support in VDT OSG Site Administrators workshop Indianapolis August Tanya Levshina Fermilab.
Testing the Zambeel Aztera Chris Brew FermilabCD/CSS/SCS Caveat: This is very much a work in progress. The results presented are from jobs run in the last.
© 2007 Z RESEARCH Z RESEARCH Inc. Non-stop Storage GlusterFS Cluster File System.
Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.
Title of the Poster Supervised By: Prof.*********
The Beijing Tier 2: status and plans
The demonstration of Lustre in EAST data system
NL Service Challenge Plans
Cluster / Grid Status Update
HEPiX Spring 2014 Annecy-le Vieux May Martin Bly, STFC-RAL
Yaodong CHENG Computing Center, IHEP, CAS 2016 Fall HEPiX Workshop
CC - IN2P3 Site Report Hepix Spring meeting 2011 Darmstadt May 3rd
BDII Performance Tests
Jeremy Maris Research Computing IT Services University of Sussex
Southwest Tier 2 Center Status Report
Experience of Lustre at a Tier-2 site
Oxford Site Report HEPSYSMAN
STORM & GPFS on Tier-2 Milan
UTFSM computer cluster
How can a detector saturate a 10Gb link through a remote file system
Southwest Tier 2.
Building 100G DTNs Hurts My Head!
BEIJING-LCG2 Site Report
Oracle Storage Performance Studies
ASM-based storage to scale out the Database Services for Physics
Proposal for a DØ Remote Analysis Model (DØRAM)
QMUL Site Report by Dave Kant HEPSYSMAN Meeting /09/2019
Presentation transcript:

Experience of Lustre at QMUL Alex Martin + Christopher J. Walker Queen Mary, University of London

Why Lustre? Posix compliant High performance Scalable Free (GPL) Used on large fraction of top supercomputers Able to stripe files if needed Scalable Performance should scale with number of OSTs Tested with 25,000 Clients 450 OSSs (1000 OSTs) Max filesize 2^64 bytes Free (GPL) Source available (Paid support available)

QMUL 2008 Lustre Setup 12 OSS (290 TiB) MDS Rack Switches Worker Nodes 10GigE MDS Failover pair Rack Switches 10GigE uplink Worker Nodes E5420 – 2*GigE Opteron – GigE Xeon - GigE

StoRM Architcture Storm Traditional SE StoRM

2011 Upgrade Design criteria Maximise storage provided needed ~1PB Sufficient performance, we also upgraded #cores from 1500 to ~3000 - Goal to be able to run ~3000 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!!

Upgrade Design criteria Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (although currently the is a 16 TB ext4 limit) Maximise storage provided needed ~1PB Sufficient performance - Aim to be able to run ~2500 ATLAS analysis jobs with high efficiency - Storage bandwidth matches compute bandwidth. Cost!!! Considered both “Fat” servers with 36 x 2 TB drives and “Thin” servers 12 x 2 TB drives Similar total cost (including networking). Chose “Thin” solution - more bandwidth - more flexibility - One OST/node (althouh

New Hardware 60 * Dell R510 12*2TB SATA disk H700 RAID controller 12 Gig RAM 4 * 1GbE (4 with 10 GbE) Total ~1.1 PB Formatted (integrate with legacy kit to give ~ 1.4 PB)

Lustre “Brick” (half Rack) HP 2900 Switch (legacy) 48 ports (24 storage, 24 compute) 10Gig uplink (could go to 2) 6 * Storage nodes 6 * Dell R510 4*GigE 12*2TB disk ( ~19 TB RAID6) 12 * Compute node 3 * Dell C6100 (contains 4 motherboards) 2*GigE Total of 144 (288) cores and ~110 TB ( Storage is better coupled to local CN's)

New QMUL Network 48 *1Gig per switch 24 – storage 24 – CPU

The real thing Hepspec06 RAL disk thrashing scripts 2 machines low (power saving mode) RAL disk thrashing scripts 1 Backplane failure 2 disk failures 10Gig cards in x8 slots

RAID6 Storage Performance R510 Disk ~ 600MB/s Performance well matched to 4 x Gb/s network

Lustre Brick performance tests using IOR parallel benchmarks 1-12 clients nodes 4 threads/node (Network Limit 3 GB/s) Block performance basically matches 10 Gbit core network

Conclusions Have successfully deployed a ~1PB Lustre filesystem using low cost hardware Required performance. Would scale further with more “Bricks” but would be better if Grid jobs could be localized to a specific “Brick” Would be better if the storage/CPU could be more closely integrated.

Conclusions 2 The storage nodes contain 18% of the CPU cores in the cluster. And we spend a lot of effort networking these to the CPU. It would be better (and cheaper) if these could use directly for processing the data Could be achieved using Lustre pools (or other filesystem such as hadoop)

Lustre Workshop at QMUL 14 July Talks available at: http://www.esc.qmul.ac.uk/wiki/lustreuserworkshop2011/ Topics discussed included -- Future Lustre support models (Whamcloud) -- Release schedule -- Site reports from a number of UK sites with different use cases.

Future Lustre Support Oracle development appears to be dead Whamcloud, company formed July 2010 ~40 employees ( www.whamcloud.com) -- maintains community assets -- committed to Open Source Lustre and a Open Development model Will provide commercial development and support.

Lustre Release Plans 1.8.5 Available from Oracle 1.8.6 Community Release from Whamcloud -- 24 TB OST support (important for QM) -- RHEL 6 -- Bugfixes 2.1 Major Whamcloud Community Release Aim to provide stable 2.x release with Performance >= 1.8 Talks available at: http://www.esc.qmul.ac.uk/wiki/lustreuserworkshop2011/ Topics discussed included

Lustre Workshop at QMUL 14 July Talks available at: http://www.esc.qmul.ac.uk/wiki/lustreuserworkshop2011/ Topics discussed included

HammerCloud 718 WMS Scales well to about ~600 jobs 369 655 451 Events (24h) - 155/4490 Job failures (3.4%) Scales well to about ~600 jobs

WMS Throughput (HC 582) Scales well to about ~600 jobs

Number of machines 2 Threads, 1MB block size 3.5 GB/s max transfer Probably limited by network to racks used

Overview Design Network Hardware Performance StoRM Conclusions

Ongoing and Work Need to tune performance Integrate legacy storage into new Lustre filestore Starting to investigate other filesystems particularly Hadoop.

Old QMUL Network