Download presentation
Published byAbigayle Paul Modified over 7 years ago
1
NVMMU: A Non-Volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures
Jie Zhang1, David Donofrio3, John Shalf3, Mahmut Kandemir2,and Myoungsoo Jung1 1Computer Architecture and Memory Systems Laboratory, School of Integrated Technology, Yonsei Institute Convergence Technology, Yonsei University 2Department of EECS, The Pennsylvania State University 3Computer Architecture Laboratory, Lawrence Berkeley National Laboratory Good morning, everyone. My name is Jie Zhang, from Yonsei University. Today, I would like to introduce NVMMU, a non-volatile memory management unit for heterogeneous GPU-SSD architectures.
2
Summary Motivation Challenge Achievements
Fetching data from underlying storage degrades GPU’s performance. Attaching SSD to GPU is a promising solution. Challenge Data movement between SSD and GPU: User/kernel-mode context switches and redundant data copies. Our Solution NVMMU: a non-volatile memory management unit for heterogeneous GPU-SSD architectures GPU&SSD stacks unification. Unify the two different software stacks to avoid unnecessary CPU intervention. GPU&SSD memory management. Employ NVMMU to remove redundant data copies. Programming model for GPU data movement. A new set of programming interfaces for programmers to utilize NVMMU. Achievements Reduce data movement overhead by 95% on average. Improve overall system performance by 78%. First, I would like to use single slides to summarize today’s presentation. Fetching data from underlying storage is a bottleneck of GPU performance. Modern SSDs can provide very high throughput, which can match well with GPU interface bandwidth. Thus, attaching SSDs with the GPU would be a good solution, especially for executing data-intensive applications. However, data movement between SSD and GPU can introduce a lot of software stack overhead such as, frequent user/kernel-mode context switching, and redundant data copies. To address this challenge, we propose NVMMU to simplify the software stacks between SSDs and GPUs. Specifically, we unify two different software stacks to avoid unnecessary CPU intervention and to manage the GPU and the SSD memory space. We also expose a new set of programming interface for the programmers to better utilize NVMMU. Finally, our evaluation results show that NVMMU can reduce data movement overhead by 95% and improve whole system performance by 78%.
3
Rethink a system bandwidth
3-6 GB/s 12.8 GB/s 204 GB/s >3.3 GB/s Executing data-intensive GPU workloads requires heavy data access in the underlying storage. The throughput disparity between GPU and storage can degrade the whole system performance. GPU internal memory throughput is around 200GB/s. But the CPU-GPU data transfer bandwidth is around 3GB/s-6GB/s. high-performance SSDs can provide more than 3GB throughput, which can mitigate the performance gap as much as possible.
4
- Discrete storage/GPU software stacks with IOMMU
Challenge - Discrete storage/GPU software stacks with IOMMU However, in reality, such data movement still faces a lot of performance degradation, even if we employ NVMe SSD as storage. One of the challenges is the discrete storage/GPU software stacks.
5
Data movement model becomes bottleneck.
Data movement overheads 5% 80% 95% computation 45% data movement 40% computation 15% storage 5% computation To better understand data movement overhead, we show execution time decomposition of various GPU workloads in the figure. Due to various workload characteristics, GPU kernel takes from 5% to 95% of execution time for computation. But on average, it takes around 40% for computation. Data access in SSD takes around 15% in execution time. Finally, although CPU is not involved in computation, it takes 45% of execution time to coordinate data movement between GPU and SSD. Even if we improve GPU and SSD performance, data movement will still be the bottleneck of whole system performance. Data movement model becomes bottleneck.
6
worse storage/GPU software stacks with IOMMU CPU intervention.
Redundant Data movement. Hardware Architecture Software Stack To find out the root cause of data movement overhead, let’s look at the hardware architecture. The GPU is connected to the host system via north bridge. And SSDs are attached to the host via south bridge and north bridge. From the hardware viewpoint, there is no direct path for the GPU to communicate with the SSD. The approach would be to use CPU-memory as intermedia storage medium to forward data between GPU and SSD. However, this approach has multiple drawbacks. One drawback is redundant data movement inside host memory. We can look at the software stack for more details. Even though both the SSD and the PCIe are connected to the north bridge as peripheral devices, due to different device types, they reside in different software stacks. As for storage software stack, GPU application interacts with I/O runtime library to submit data requests to virtual filesystem and native filesystem. Native filesystem changes data requests to block device operation, and finally disk controller issues IO requests to underlying SSD. The fetched data needs to go backwards through multiple layers to reach user applications. As for GPU software stack, GPU application interacts with GPU runtime library to generate GPU commands. These commands are submitted to GPU driver via IOCTL. GPU driver can translate kernel memory space to GPU memory space or virtual address to GPU physical address. These operations requires memory allocation in both user space and kernel space. Data is redundantly copied in cache buffer and user’s memory, even though the host does not need it. Moreover, data movement also invokes frequent user/kernel space switch. What’s worse, since the GPU and the SSD do not share the same memory space, data should be copied between SSD driver and GPU driver. Redundant Data copy.
7
An Example: SSD -> GPU
GPU programming model Here is an example of GPU applications, with simplified code segments and programming model. click, Before GPU kernel executes, GPU application needs to create file descriptors to fetch data from underlying storage. It allocates memory in user space to store the data. At the same time, it also allocates GPU memory for data transfer between CPU and GPU. Then, the CPU sends data requests to the underlying storage. The fetched data is put in kernel buffer. And then it is copied to user memory. Once data is ready, GPU application interacts with GPU runtime library to initialize data transfer between CPU and GPU. And then, GPU device runs the kernel with the data. If application needs to retrieve data from GPU, GPU runtime library is invoked again to transport data back to user’s memory. Finally Then applications invoke IO runtime library to transport data to SSD via PCIe interface. Free function will release memory in both user space and kernel space. And kernel deletes file descriptors. In this example, CPU is heavily involved in data movement, which causes a lot of performance overhead. GPU
8
NVMMU Target: Directly forward data between GPU and SSD without
Unnecessary CPU-memory activities; User/kernel switching; Underlying hardware modification; We propose NVMMU to improve the performance of GPU-SSD heterogeneous architecture. Our target is to directly forward data between GPU and SSD without unnecessary CPU-memory activities or user-kernel switching, or underlying hardware modification.
9
Redundant data movement
UIL(NVMMU) Redundant data movement User/kernel switch open() nvmmuBegin() cudaMalloc() malloc() read() cudaMemcpy() nvmmuMove() write() cudaFree() free() nvmmuEnd() As we discussed before, user applications use different runtime libraries and programming APIs to separately interact with the underlying storage and GPU, click, This programming model makes redundant data copy and user/kernel switch unavoidable during data movement between GPU and SSDs. To overcome this drawback, we propose a virtual file system driver, called “unified interface library”. And also, a simplified programming API. close()
10
Reduce user/kernel switch
NDMA(NVMMU) Reduce data movement Data movement nvmmuMove() Reduce user/kernel switch When user applications call the API nvmmuMove( ), the UIL directly forward data from SSD to GPU, without user CPU intervention. It also saves memory allocation in user space. click, However, the redundant data copy is not completely removed. Since SSD and GPU resides in different memory boundary, data still needs to move from the storage memory blocks to the GPU pinned memory. For this, we modified disk controller driver. We call it, "NDMA". NDMA manages physical memory address mapping. It merges SSD memory blocks with GPU pinned memory. Disadvantage: The discrete software stacks between a GPU and an SSD results in data movements across two kernel modules.
11
NDMA implementation NVMe GPUDirect
Here is a NDMA implementation of GPUdirect and NVMe SSD. GPUdirect allocates GPU pinned memory in kernel space. This GPU pinned memory can directly map data to GPU memory. NDMA leverages GPU pinned memory as NDMA buffer. UIL uses NDMA buffer to upload and download data to SSD. Since data is put in GPU pinned memory, data can be mapped to GPU memory without redundant data copy. click, NVMe disk controller driver manages memory mapped register. Within memory-mapped register, there is a IO submission region, containing multiple submission command lists. Each submission command list contains multiple entries. Each entry has two physical region page entries or PRPs. They are the pointer of page blocks used by NDMA buffer for data transfer. NDMA remaps PRP or PRP list to points to GPU pinned memory.
12
SSD array While NVMe-based SSD provides high performance, its cost and power consumption is not acceptable to build a GPU-SSD heterogeneous system. We build up a low-end SSD array by leveraging low-power advanced host controller interface or AHCI, which is shown in figure. Each SSD device occupies a SATA lane and all SSDs are connected to direct media interface, or DMI. Since DMI provides throughput around 20Gb/s, the aggregated performance of our SSD array can be similar or exceeds NVMe-based SSD. NVMMU also supports the optimization of GPU and SSD array.
13
NDMA implementation AHCI Command list
AHCI also leverages GPU pinned memory. AHCI memory mapped registers contains a generic host control list. It contains multiple ports. Each port represents a device. There are two metadata structure in each port, which are command list and FIS structure. Each command list has 32 command headers. In each command header, there are entries, which point to system memory blocks that resides in GPU pinned memory.
14
An NVMMU example GPU Pinned Memory GPU programming model
Here is an NVMMU example with its programming model. click, GPU application creates a file descriptor for data movement. GPU initialize NVMMU by calling function nvmmuBegin(). GPU application calls nvmmuMove() to transfer data from SSD to GPU. If GPU application needs retrieving data, it calls nvmmuMove() to transfer data from GPU to SSD. Finally, nvmmuEnd() will release all allocated memory.
15
SSD communication protocol
Evaluation setup System configuration: CPU Intel Core I7 Main Memory 16GB DDR3 Connection PCIe lanes, SATA 3.0 SSD communication protocol NVMe 1.1, AHCI Testbed characteristics: Total Cost Cap. (TB) Device Protocol # of Device Failure Recovery RAID-HDD $1085 2 HDD SATA 4 YES AHCI-SSD $194 0.25 MLC 1 NO PCIe-SSD $5990 0.36 NVMe RAID-SSD $1760 The host system configuration is shown in the table. We also choose five different storage options. PCIe-SSD is much more expensive than other options. And RAID-based storage array provides extra failure recovery scheme.
16
Evaluation setup Memory Management Unit options: NVMe-IOMMU
PCIe-SSD, using a conventional MMU AHCI-NVMMU AHCI-SSD, using NVMMU NVMe-NVMMU PCIe-SSD, using NVMMU RAID-NVMMU RAID-SSD, using NVMMU without parity block pipelining RAID-NVMMU-P RAID-SSD, using NVMMU with parity block pipelining We also apply different memory management units to the storages. One thing to mention, since RAID-based storage array has overhead in parity data checking, we optimize its performance by overlapping parity checking with CPU, GPU activities, which is the option of RAID-NVMMU-P.
17
Data movement analysis
Similar performance 248% Less power & cost Average: 134% Significant improvement Parity check overhead This slides show the speed up in regarding to data movement, compared to NVMe-IOMMU. 1. With the same PCIe device, by applying our NVMMU, the performance improves up to 248% 2. RAID system shows poor performance. If we look at the figure, it degrades by 50% and 25% for Mars and Parboil workload sets. That’s not only because of its worse performance, but also because it suffers from parity code checking. 3. Our optimized RAID system tries to reduce the overhead of parity data. It significantly improves the total performance by 218%, 16%, 74%, and 7% 4. Finally, RAID-based SSD can achieve similar performance compared to NVMe SSD, but with lower power and cost. 50% 25%
18
Performance analysis Similar performance Less power & cost
Significant improvement Parity check overhead Significant improvement It also shows the same behavior in total execution time. Compared to the NVMe SSD, RAID AHCI SSD can provide similar performance, but less power and cost.
19
Upload behavior analysis
35% 80% In this slide, we analyze data movement behavior from SSD to GPU. We choose performance disparity as performance metric. 1. Traditional hard disk drive with RAID, fail to reduce the throughput gap between GPU and storage. Because of its low bandwidth, software stack overhead and parity data penalty. 2. A AHCI-based SSD with the support of our NVMMU can significantly reduce the performance gap by 35%. But due to its bandwidth limitation, it still cannot meet our goal. 3. The AHCI-based SSD array can further reduce the performance disparity. And with the optimization of parity code computation, it can achieve even better performance. If we look at the figure, the disparity can be reduced to less than 15%, if the access granularity becomes bigger. 4. NVMe-based SSD can practically eliminate the performance disparity when the block size is larger than 4MB. < 15% No disparity
20
Power consumption analysis
Similar power 80% 16 Watt The cost and power consumption determine if a device is feasible to use. Since we already provide the cost of different storage devices. Right now, we analyze the power consumption of each device. 1. A conventional hard disk drive is a power-hungry device. If we look at the figure, a RAID-based hard disk array consumes around 16 Watts of power. 2. In contrast, an SSD usually consumes much less power. If we look at the figure, the power consumed by an AHCI based SSD doesnot even exceed 3 Watts. Thatis because, SSD leverages a low-power interface. 3. Even though the AHCI based SSDs show promising power consumption, the NVMe SSD which relies on high performance PCIe interface cannot achieve similar behavior. If we look at the figure, the power consumption is almost similar compared to a hard disk array. 4. The power consumption of an SSD array is more promising compared to NVMe SSD. If we look at the figure, the maximum power consumption is less than 9W, which is an 80% reduction compared to NVMe SSD. The optimization of parity code does not impact the power consumption, because it does not impact memory access. < 3W
21
Conclusion NVMMU addresses the performance disparity caused by file-resident data movement. RAID-based SSD array is a promising solution for GPU-SSD heterogeneous system. The benefits of employing GPU comes from its massive computation units – CUDA cores and more and more higher bandwidth.
22
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.