Lock Ahead: Shared File Performance Improvements

Slides:



Advertisements
Similar presentations
Wyatt Lloyd * Michael J. Freedman * Michael Kaminsky David G. Andersen * Princeton, Intel Labs, CMU Dont Settle for Eventual : Scalable Causal Consistency.
Advertisements

Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Steve Ostrovsky – TPM – CSW (PISTEW)
Digital Fountains: Applications and Related Issues Michael Mitzenmacher.
File Consistency in a Parallel Environment Kenin Coloma
A PLFS Plugin for HDF5 for Improved I/O Performance and Analysis Kshitij Mehta 1, John Bent 2, Aaron Torres 3, Gary Grider 3, Edgar Gabriel 1 1 University.
Large Scale Parallel I/O with HDF5 Darren Adams, NCSA Considerations for Parallel File Systems on Distributed HPC Platforms.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Symantec De-Duplication Solutions Complete Protection for your Information Driven Enterprise Richard Hobkirk Sr. Pre-Sales Consultant.
NWfs A ubiquitous, scalable content management system with grid enabled cross site data replication and active storage. R. Scott Studham.
A New Broadcasting Technique for An Adaptive Hybrid Data Delivery in Wireless Mobile Network Environment JungHwan Oh, Kien A. Hua, and Kiran Prabhakara.
Connecting HPIO Capabilities with Domain Specific Needs Rob Ross MCS Division Argonne National Laboratory
What is it? Hierarchical storage software developed in collaboration with five US department of Energy Labs since 1992 Allows storage management of 100s.
1 CMPT 275 Software Engineering Revision Control.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
HDF5 collective chunk IO A Working Report. Motivation for this project ► Found extremely bad performance of parallel HDF5 when implementing WRF- Parallel.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Distributed File Systems
Shared File Performance Improvements LDLM Lock Ahead Patrick Farrell
RFC: Breaking Free from the Collective Requirement for HDF5 Metadata Operations.
Pursuing Faster I/O in COSMO POMPA Workshop May 3rd 2010.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Oracle Advanced Compression – Reduce Storage, Reduce Costs, Increase Performance Session: S Gregg Christman -- Senior Product Manager Vineet Marwah.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Data Dependent Routing may not be necessary when using Oracle RAC Ken Gottry Apr-2003 Through Technology Improvements in: Oracle 9i - RAC Oracle 9i - CacheFusion.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
Workflow Early Start Pattern and Future's Update Strategies in ProActive Environment E. Zimeo, N. Ranaldo, G. Tretola University of Sannio - Italy.
SciDAC All Hands Meeting, March 2-3, 2005 Northwestern University PIs:Alok Choudhary, Wei-keng Liao Graduate Students:Avery Ching, Kenin Coloma, Jianwei.
DONE-08 Sizing and Performance Tuning N-Tier Applications Mike Furgal Performance Manager Progress Software
The Vesta Parallel File System Peter F. Corbett Dror G. Feithlson.
1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.
Copyright © 2010, SAS Institute Inc. All rights reserved. SAS ® Using the SAS Grid.
Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Upcoming Features and Roadmap Ricardo Rocha ( on behalf of the.
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
Lecture 12 Page 1 CS 111 Online Using Devices and Their Drivers Practical use issues Achieving good performance in driver use.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
PIDX PIDX - a parallel API to capture the data models used by HPC application and write it out in an IDX format. PIDX enables simulations to write out.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
CalvinFS: Consistent WAN Replication and Scalable Metdata Management for Distributed File Systems Thomas Kao.
DISTRIBUTED FILE SYSTEM- ENHANCEMENT AND FURTHER DEVELOPMENT BY:- PALLAWI(10BIT0033)
Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.
Solutions for Fourth Quiz COSC 6360 Fall First question What do we mean when we say that NFS client requests are: (2×10 pts)  self-contained? 
Parallel Virtual File System (PVFS) a.k.a. OrangeFS
Achieving the Ultimate Efficiency for Seismic Analysis
Slide credits: Thomas Kao
Why Events Are A Bad Idea (for high-concurrency servers)
Parallel I/O Optimizations
Hitting the SQL Server “Go Faster” Button
William Stallings Computer Organization and Architecture 8th Edition
VisIt Libsim Update DOE Computer Graphics Forum 2012 Brad Whitlock
Research Introduction
Introduction to Computers
Large Scale Test of a storage solution based on an Industry Standard
Steve Woods, Solutions Architect
Direct Attached Storage and Introduction to SCSI
CMSC 611: Advanced Computer Architecture
Chapter 12: Automated data collection methods
Hitting the SQL Server “Go Faster” Button
ICOM 6005 – Database Management Systems Design
Filesystems 2 Adapted from slides of Hank Levy
Cluster Resource Management: A Scalable Approach
What’s changed in the Shibboleth 1.2 Origin
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
EE 122: Lecture 22 (Overlay Networks)
On the Role of Burst Buffers in Leadership-Class Storage Systems
File System Performance
IPP Job Storage 2.0: Fixing JPS2
The reactor design pattern
Presentation transcript:

Lock Ahead: Shared File Performance Improvements Patrick Farrell Cray Lustre Developer Steve Woods Senior Storage Architect woods@cray.com September 2016 11/19/2018 Copyright 2015 Cray Inc

Sequence of events that lead to the problem Solution Agenda Shared File Problem Sequence of events that lead to the problem Solution Usage and Performance Availability 11/19/2018 Copyright 2015 Cray Inc

Shared File vs FPP Two principal approaches to writing out data File-per-process which scales well in Lustre Shared file which does not scale well 11/19/2018 Copyright 2015 Cray Inc

Shared File IO Common HPC applications use MPIIO library to do ‘good’ shared file IO Technique is called collective buffering IO is aggregated to a set of nodes, each of which handles parts of a file Writes are strided, non-overlapping Example: Client 1 is responsible for writes to block 0, block 2, block 4, etc., client 2 is responsible for block 1, block 3, etc. Currently arranged so there is one client per OST 11/19/2018 Copyright 2015 Cray Inc

Shared File Scalability Bandwidth best at one client per OST Going from one to two clients reduces bandwidth dramatically, adding more after two doesn’t help much In real systems, OST can handle full bandwidth of several clients (FPP hits these limits) 11/19/2018 Copyright 2015 Cray Inc

11/19/2018 Copyright 2015 Cray Inc

Why Doesn’t Shared File IO Scale In ‘good’ shared file IO, writes are strided, non-overlapping Since writes don’t overlap, should be possible to have multiple clients per OST without lock contention With > 1 client per OST, writes are serialized due to LDLM* extent lock design in Lustre 2+ clients are slower than one due to lock contention *LDLM locks are Lustre’s distributed locks, used on clients and servers

Extent Lock Contention Single OST view of a file, also applies to individual OSTs in a striped file Two clients, doing strided writes Client 1 asks to write segment 0 (Assume stripe size segments)

Extent Lock Contention Lock for client 1 was called back, so no locks on file currently OST expands lock request from client 2 Grants lock on rest of file...

Extent Lock Contention Client 1 asks to write segment 2 Conflicts with the expanded lock granted to client 2 Lock for client 2 is called back… Etc. Continues throughout IO

Extent Lock Contention Multiple clients per OST are completely serialized, no parallel writing at all Even worse: Additional latency to exchange lock Mitigation: Clients generally are able to write > one segment before giving up local

Extent Lock Contention What about not expanding locks. Avoids contention, clients can write in parallel Surprise: It is actually worse This means we need a lock for every write, latency kills performance That was the blue line at the bottom of the performance graph..

Proposal: Lock Ahead Lock ahead: Allow clients to request locks on file regions in advance of IO Pros: Request lock on a part of a file with an IOCTL, server grants lock only on requested extent (no expansion) Flexible, can optimize other IO patterns Relatively easy to implement Cons: Large files drive up lock count and can hurt performance Pushes LDLM into new areas Exposes bugs

Lock Ahead: Requests Locks to Match the IO Pattern Imagine requesting locks ahead of time Same situation: Client 1 wants to write segment 0 But before that, it requests locks…

Lock Ahead: Requests Locks to Match the IO Pattern Request locks on segments the client intends to do IO on 0, 2, 4, etc… Lock ahead locks are not expanded

Lock Ahead: Requests Locks to Match the IO Pattern Client 2 requests locks on its segments Segments 1, 3, 5, etc…

Lock Ahead: Requests Locks to Match the IO Pattern With locks issued, clients can do IO in parallel No lock conflicts

What about Group Locks? Lustre has an existing solution: Group locks Basically turns off LDLM locking on a file for group members, allows for FPP performance for group members Tricky: Since lock is shared between clients there are write visibility issues (Clients assume they are the only one with a lock, do not notice file updates until the lock is released and cancelled) Must release the lock to get write visibility between clients

What about Group Locks? Works for some workloads, but not OK for many others Not really compatible with HDF5 and other such file formats: In file metadata updates require write visibility between clients during the IO It’s possible to fsync and release the lock after every write, but speed benefits are lost

Lock Ahead: Performance Early performance results show performance equal to file-per-process or group locks Intended to match up with MPIIO collective buffering feature described earlier IOR –a MPIIO –c Cray has made patch available Codes need to be rebuilt but not modified

Lock Ahead: Performance (preliminary)

Lock Ahead: Performance (preliminary) Job Standard Lock (MiB/s) Lock Ahead (MiB/s) CL_IOR_all_hdf5_cb_w_64m 1666.582 2541.630 CL_IOR_all_hdf5_cb_w_32m 1730.573 2650.620 CL_IOR_all_hdf5_cb_w_1m 689.620 1781.704

Lock Ahead: Performance (preliminary) Job Group Lock (MiB/s) Lock Ahead CL_IOR_all_mpiio_cb_w_64m 2528 2530 CL_IOR_all_mpiio_cb_w_32m 2446 2447 CL_IOR_all_mpiio_cb_w_1m 2312 2286

Lock Ahead: Implementation Re-uses much of Asynchronous Glimpse Lock (AGL) implementation Adds an LDLM flag to tell server not to expand lock ahead locks

Lock Ahead: When will it be available? Targeted as a feature for Lustre 2.10 Currently available in the Cray Lustre 2.7 Available soon with Cray Sonexion (Neo 2.0 SU22)

Questions? 11/19/2018 Copyright 2015 Cray Inc