Download presentation
Presentation is loading. Please wait.
Published byEmery Martin Modified over 6 years ago
1
Performance Evaluation of InfiniBand NFS/RDMA for Linux
Benjamin Allan, Helen Chen, Scott Cranford, Ron Minnich, Don Rudish, and Lee Ward Sandia National Laboratories This work was supported by the United States Department of Energy, Office of Defense Programs. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed-Martin Company, for the United States Department of Energy under contract DE-AC04-94-AL85000.
2
Talk Outline Read/Write performance Application Profile System Profile
Network Profile Infinite fast file and disk I/O Infinite Fast Network
3
Sandia Motivation for looking at NFS/RDMA
Why NFS-RDMA and why is Sandia looking at it? Use for HPC Platforms Transparent solution for applications In the mainstream kernel Increased performance over normal NFS
4
Reads VS Writes TCP VS RDMA
NFS/TCP read/write ratio 2:1 NFS/RDMA read/write ratio 5:1 Previous work
5
FTQIO Application FTQ (fixed time quantum) FTQ I/O More info visit:
Simply put, rather than doing a fixed unit of work, and measuring how long that work took; FTQ measures the amount of work done in a fixed time quantum. FTQ I/O Modified FTQ to measure file system performance by writing data and recording statistics. High Resolution Benchmark More info visit: How it works? One thread of the program writes blocks of allocated memory to disk. A second thread records the number of bytes written and optionally Supermon data. (talk supermon about later) Basic Operation The loop will count work done until it reaches a fixed end-point in time. It then records the starting point of the loop and the amount of work that was done.
6
FTQIO Application Profile
Red DOTS represent data written in a 440 microsecond interval. Every 440 microseconds, FTQIO will count how many bytes to wrote and then plot it 27 sec FTQIO Run Bytes Written
7
Application Profile with VMM Data
Red DOTS represent bytes recorded in a 440 microsecond interval. Blue DOTS represent number of dirty pages Purple DOTS represent number of pages in writeback queue Black DOTS represent when application goes to sleep 27 sec FTQIO Run Bytes Written
8
Added bytes transmitted from IB card
Red DOTS represent bytes recorded in a 440 microsecond interval. Blue DOTS represent number of dirty pages Purple DOTS represent number of pages in writeback queue Black DOTS represent when application goes to sleep Green DOTS represent number of bytes transmitted on the InfiniBand card 27 sec FTQIO Run Bytes Written Here we add green markers to indicate the number of bytes transmitted by the InfiniBand card. Just like the other statistics, we sample bytes transmitted every 440 microseconds. Generally we see that the InfiniBand card has a consistent amount of data being transmitted suggesting that the VMM is doing its job cleaning pages while still keeping the network busy. The only problem is that the network is not bust enough, and not anywhere near the theoretical limit o DDR InfiniBand which is 2GB/sec. Transition: What we did next was to baseline the various subsystems to attempt to isolate where the bottleneck might be.
9
Baseline Approach Client Server User Application Application VFS VFS
Page Cache Page Cache Page Cache Short Circuit Patch Kernel NFS Client NFS Server FS RPC RPC Short Circuit Patch TCP/IP RDMA TCP/IP RDMA Block HW PCI PCI PCI Disk HCA IB Fabric HCA Controller
10
Look at the Code Where to look? The Linux Cross Reference Ftrace
# tracer: function_graph # # CPU TASK/PID DURATION FUNCTION CALLS # | | | | | | | | | 0) dd-2280 | | schedule_tail() { 0) dd-2280 | us | finish_task_switch(); 0) dd-2280 | us | __might_sleep(); 0) dd-2280 | us | _cond_resched(); 0) dd-2280 | us | __task_pid_nr_ns(); 0) dd-2280 | us | down_read_trylock(); 0) dd-2280 | us | find_vma(); 0) dd-2280 | | handle_mm_fault() { 0) dd-2280 | | kmap_atomic() { 0) dd-2280 | | kmap_atomic_prot() { 0) dd-2280 | us | page_address(); 0) dd-2280 | us | } 0) dd-2280 | us | } 0) dd-2280 | us | _spin_lock(); 0) dd-2280 | | do_wp_page() { 0) dd-2280 | us | vm_normal_page(); 0) dd-2280 | us | reuse_swap_page(); 0) dd-2280 | | unlock_page() { 0) dd-2280 | us | __wake_up_bit(); 0) dd-2280 | | kunmap_atomic() { 0) dd-2280 | us | arch_flush_lazy_mmu_mode(); 0) dd-2280 | | anon_vma_prepare() { 0) dd-2280 | us | __might_sleep(); 0) dd-2280 | us | _cond_resched(); 0) dd-2280 | | __alloc_pages_internal() { 0) dd-2280 | | get_page_from_freelist() { 0) dd-2280 | us | next_zones_zonelist(); 0) dd-2280 | us | zone_watermark_ok(); 0) dd-2280 | us | } Where to look? The Linux Cross Reference Ftrace Comes with the kernel No userspace programs Debugfs
11
Infinite fast file and disk I/O
When NFS Server wants to write to a file, claim success. In fs/nfsd/nfs3proc.c /* * Write data to a file */ static __be32 nfsd3_proc_write(struct svc_rqst *rqstp, struct nfsd3_writeargs *argp, struct nfsd3_writeres *resp) { __be32 nfserr; if (foobar_flag != '0') { resp->count = argp->count; RETURN_STATUS(0); } fh_copy(&resp->fh, &argp->fh); resp->committed = argp->stable; nfserr = nfsd_write(rqstp, &resp->fh, NULL, argp->offset, rqstp->rq_vec, argp->vlen, argp->len, &resp->committed); RETURN_STATUS(nfserr); User Application VFS NFS Client RPC TCP/IP RDMA PCI HCA NFS Server Page Cache FS Block Controller Kernel Disk IB Fabric HW RPC Payload (Bytes) Throughput (MB/s) 32-KB Record 512-KB Record 1-MB Record 32768 285.60 283.40 281.60 65536 377.00 350.50 293.00 131072 387.50 363.50 306.00 262144 401.40 335.80 305.00 524288 425.00 376.50 312.50
12
Infinite Fast Network Nothing goes out on the network. RPC transmit
Remove the RDMA transport from the NFS write-path. Max throughput of 1.25GB/sec Factor of 3 improvement as oppose to when we send the data over the wire. Nothing goes out on the network. RPC transmit Returns claiming that the transmit completed and now has the reply. Tells the NFS client service that the page was committed. User Application VFS NFS Client RPC TCP/IP RDMA PCI HCA NFS Server Page Cache FS Block Controller Kernel Disk IB Fabric HW
13
Recap And Conclusion Client Server User Application Application VFS
Page Cache Page Cache Page Cache 377MB/sec Kernel NFS Client NFS Server FS RPC RPC 1.25GB/sec TCP/IP RDMA TCP/IP RDMA Block 1.8GB/sec HW PCI PCI PCI Disk HCA IB Fabric HCA Controller
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.