Early Experiences with NFS over RDMA OpenFabric Workshop San Francisco, September 25, 2006 Sandia National Laboratories, CA Helen Y. Chen, Dov Cohen, Joe.

Slides:

Advertisements

Similar presentations

Distributed Processing, Client/Server and Clusters

Advertisements

Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

PNFS, 61 th IETF, DC1 pNFS: Requirements 61 th IETF – DC November 10, 2004.

Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Performance Characterization of a 10-Gigabit Ethernet TOE W. Feng ¥ P. Balaji α C. Baron £ L. N. Bhuyan £ D. K. Panda α ¥ Advanced Computing Lab, Los Alamos.

The TickerTAIP Parallel RAID Architecture P. Cao, S. B. Lim S. Venkatraman, J. Wilkes HP Labs.

Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.

An Adaptable Benchmark for MPFS Performance Testing A Master Thesis Presentation Yubing Wang Advisor: Prof. Mark Claypool.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji  Hemal V. Shah ¥ D. K. Panda 

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.

Infiniband enables scalable Real Application Clusters – Update Spring 2008 Sumanta Chatterjee, Oracle Richard Frank, Oracle.

Direct Access File System (DAFS): Duke University Demo Source-release reference implementation of DAFS Broader research goal: Enabling efficient and transparently.

© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.

Windows RDMA File Storage

Roland Dreier Technical Lead – Cisco Systems, Inc. OpenIB Maintainer Sean Hefty Software Engineer – Intel Corporation OpenIB Maintainer Yaron Haviv CTO.

Copyright DataDirect Networks - All Rights Reserved - Not reproducible without express written permission Adventures Installing Infiniband Storage Randy.

Optimizing Performance of HPC Storage Systems

1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.

Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.

The NE010 iWARP Adapter Gary Montry Senior Scientist

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Large Scale Test of a storage solution based on an Industry Standard Michael Ernst Brookhaven National Laboratory ADC Retreat Naples, Italy February 2,

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

2006 Sonoma Workshop February 2006Page 1 Sockets Direct Protocol (SDP) for Windows - Motivation and Plans Gilad Shainer Mellanox Technologies Inc.

CMAQ Runtime Performance as Affected by Number of Processors and NFS Writes Patricia A. Bresnahan, a * Ahmed Ibrahim b, Jesse Bash a and David Miller a.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

InstantGrid: A Framework for On- Demand Grid Point Construction R.S.C. Ho, K.K. Yin, D.C.M. Lee, D.H.F. Hung, C.L. Wang, and F.C.M. Lau Dept. of Computer.

A Measurement Based Memory Performance Evaluation of High Throughput Servers Garba Isa Yau Department of Computer Engineering King Fahd University of Petroleum.

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Architecture for Caching Responses with Multiple Dynamic Dependencies in Multi-Tier Data- Centers over InfiniBand S. Narravula, P. Balaji, K. Vaidyanathan,

Srihari Makineni & Ravi Iyer Communications Technology Lab

Large Scale Parallel File System and Cluster Management ICT, CAS.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

1 Public DAFS Storage for High Performance Computing using MPI-I/O: Design and Experience Arkady Kanevsky & Peter Corbett Network Appliance Vijay Velusamy.

CHEP04 Performance Analysis of Cluster File System on Linux Yaodong CHENG IHEP, CAS

L/O/G/O Input Output Chapter 4 CS.216 Computer Architecture and Organization.

Latency Reduction Techniques for Remote Memory Access in ANEMONE Mark Lewandowski Department of Computer Science Florida State University.

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

VMware vSphere Configuration and Management v6

Sockets Direct Protocol Over InfiniBand in Clusters: Is it Beneficial? P. Balaji, S. Narravula, K. Vaidyanathan, S. Krishnamoorthy, J. Wu and D. K. Panda.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.

Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.

Mr. P. K. GuptaSandeep Gupta Roopak Agarwal

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

Datacenter Fabric Workshop NFS over RDMA Boris Shpolyansky Mellanox Technologies Inc.

Tackling I/O Issues 1 David Race 16 March 2010.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Experiences with VI Communication for Database Storage Yuanyuan Zhou, Angelos Bilas, Suresh Jagannathan, Cezary Dubnicki, Jammes F. Philbin, Kai Li.

GPFS Parallel File System

Advisor: Hung Shi-Hao Presenter: Chen Yu-Jen

High Performance Storage System (HPSS) Jason Hick Mass Storage Group HEPiX October 26-30, 2009.

Balazs Voneki CERN/EP/LHCb Online group

GGF15 – Grids and Network Virtualization

Application taxonomy & characterization

Prof. Leonardo Mostarda University of Camerino

Presentation transcript:

Early Experiences with NFS over RDMA OpenFabric Workshop San Francisco, September 25, 2006 Sandia National Laboratories, CA Helen Y. Chen, Dov Cohen, Joe Kenny Jeff Decker, and Noah Fischer SAND C

2 Outline Motivation RDMA technologies NFS over RDMA Testbed hardware and software Preliminary results and analysis Conclusion Ongoing work and Future Plans

3 What is NFS A network attached storage file access protocol layered on RPC, typically carried over UDP/TCP over IP Allow files to be shared among multiple clients across LAN and WAN Standard, stable and mature protocol adopted for cluster platform

4 NFS Scalability Concerns in Large Clusters Large number of concurrent requests from parallel applications Parallel I/O requests serialized by NFS to a large extend Need RDMA and pNFS NFS Server Application 1 Application 2 Application N Concurrent I/O

5 How DMA Works

6 How RDMA Works

7 Why NFS over RDMA NFS moves big chunks of data incurring many copies with each RPC Cluster Computing –High bandwidth and low latency RDMA –Offload protocol processing –Offload host memory I/O bus –A must for 10/20 Gbps networks nfs-rdma-problem-statement-04.txt

8 The NFS RDMA Architecture NFS is a family of protocol layered over RPC XDR encodes RPC requests and results onto RPC transports NFS RDMA is implemented as a new RPC transport mechanism Selection of transport is an NFS mount option NFS v2 NFS v3 NFS v4 NLM NFS ACL RPC XDR UDPTCPRDMA Brent Callaghan, Theresa Lingutla-Raj, Alex Chiu, Peter Staubach, Omer Asad, “NFS over RDMA”, ACM SIGCOMM 2003 Workshops, August 25-27, 2003

9 This Study

10 OpenFabrics Software Stack Offers a common, open source, and open development RDMA application programming interface

11 Testbed Key Hardware Mainboard: Tyan Thunder K8WE (S2895) –CPU – Dual 2.2 Ghz AMD Opteron Skt940 en/assets/content_type/white_papers_and_tech_docs –Memory – 8 GB ATP 1GB PC3200 DDR SDRAM on NFS server and 2 GB CORSAIR CM725D512RLP- 3200/M on client IB Switch: Flextronics InfiniScale III 24-port switch IB HCA: Mellanox MT25208 InfiniHost III Ex

12 Testbed Key Software Kernel: Linux with deadline I/O scheduler NFS/RDMA release candidate 4 – oneSIS used to boot all the nodes OpenFabric IB stack svn

13 Testbed Configuration One NFS server and up to four clients –NFS/TCP vs. NFS/RDMA –IPoIB and IB RDMA running SDR Ext2 with Software RAID0 backend Clients ran IOZONE writing and reading 64KB records and 5GB aggregate file size used –To eliminate cache effect on client –To maintain consistent disk I/O on server Allowing the evaluation of NFS/RDMA transport without being constrained by disk I/O System resources monitored using vmstat at 2s intervals Server Clients IB switch

14 Local, NFS, and NFS/RDMA Throughput LocalNFS (IPoIB)NFS/RDMA Write (MB/s) Read (MB/s) Reads are from server cache reflecting –TCP RPC transport achieved ~180 MB/s (1.4 Gb/s) of throughput –RDMA RPC transport was capable of delivering ~700MB/s (5.6Gb/s) throughput RPCNFSDCOUNT=8 /proc/sys/sunrpc/svc_rdma/max_requests=16

15 NFS & NFS/RDMA Server Disk I/O Writes incurred disk I/O issued according to deadline scheduler –NFS/RDMA server has higher incoming data rate, and thus higher block I/O output rate to disk –NFS/RDMA data-rate bottlenecked by the storage I/O rate as indicated by the higher IOWAIT time

16 NFS vs. NFS/RDMA Client Interrupt and Context Switch NFS/RDMA incurred ~1/8 of Interrupts, completed in a little more than 1/2 of the time NFS/RDMA showed higher context-switch rates indicating faster processing of application requests Higher throughput comparing to NFS!

17 Client CPU Efficiency CPU per MB of transfer:  t )    cpu/100 / file-size Write NFS NFS/RDMA = % more efficient! Read NFS = NFS/RDMA = % more efficient! Improved application performance

18 Server CPU Efficiency CPU per MB of transfer:  t)    cpu/100 / file-size Write NFS = NFS/RDMA = % more efficient! Read NFS = NFS/RDMA = % more efficient! Improved system performance

19 To minimize the impact of disk I/O –One 5GB, two 2.5GB, three 1.67GB, four 1.25GB Ignored rewrite and reread due to client-side cache effect Scalability Test - Throughput

20 Scalability Test – Server I/O NFS RDMA transport demonstrated faster processing of concurrent RPC I/O requests and responses from and to the 4 clients than NFS Concurrent NFS/RDMA writes were impacted more by our slow storage as indicated by the close to 80% CPU IOWAIT times

21 Scalability Test – Server CPU NFS/RDMA incurred ~½ the CPU overhead and for half of the duration, but delivered 4 times the aggregate throughput comparing to NFS NFS/RDMA write-performance was impacted more by the backend storage than NFS, as indicated by the ~70% vs. ~30% idle CPU time waiting for IO to complete

22 Compared to NFS, NFS/RDMA demonstrated: –impressive CPU efficiency –and promising scalability NFS/RDMA will Improve application and system level performance! NFS/RDMA can easily take advantage of the bandwidth in 10/20 Gigabit network for large file accesses Preliminary Conclusion

23 SC06 participation –HPC Storage Challenge Finalist Micro benchmark MPI Applications with POSIX and/or MPI I/O –Xnet NFS/RDMA demo over IB and iWARP Ongoing Work

24 Future Plans Initiate study of NFSv4 pNFS performance with RDMA storage –Blocks (SRP, iSER) –File (NFSv4/RDMA) –Object (iSCSI-OSD)?

25 NFSv3 –Use of ancillary Network Lock Manager (NLM) protocol adds complexity and limits scalability in parallel I/O –No attribute caching requirement squelches performance NFSv4 –Use of Integrated lock management allows byte range locking required for Parallel I/O –Compound operations improves efficiency of data movement and … Why NFSv4

26 Why Parallel NFS (pNFS) pNFS extends NFSv4 –Minimum extension to allow out-of-band I/O –Standards-based scalable I/O solution Asymmetric, out-of-band solutions offer scalability –Control path (open/close) different from Data Path (read/write)

27 Acknowledgement The authors would like to thank the following for their technical input –Tom Talpey and James Lentini from NetApp –Tom Tucker from Open Grid Computing –James Ting from Mellanox –Matt Leininger and Mitch Sukalski from Sandia