1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany with: Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, Matei Ripeanu
2 Computation Landscape Recent GPUs dramatically change the computation cost landscape. Floating-Point Operations per Second for the CPU and GPU. (Source: CUDA 1.1 Guide) A quiet revolution: Computation: 367 vs. 32 GFLOPS 128 vs. 4 cores Memory Bandwidth: 86.4 vs. 8.4 GB/s $220 $290 HPDC ‘08
3 Computation Landscape Affordable Widely available in commodity desktop Include 10s to 100s of cores ( can support 1000s of threads) General purpose programming friendly Recent GPUs dramatically change the computation cost landscape. HPDC ‘08
4 Exploiting GPUs’ Computational Power Studies exploiting the GPU: Bioinformatics: [Liu06] Chemistry: [Vogt08] Physics: [Anderson08] And many more : [Owens07] Report: 4x to 50x speedup But: Mostly scientific and specialized applications. HPDC ‘08
5 Motivating Question System design: balancing act in a multi-dimensional space e.g., given certain objectives, say job turnaround, minimize total system cost given component prices, I/O bottlenecks, bounds on storage and network traffic, energy consumption, etc. Q: Does the 10x reduction in computation costs GPUs offer change the way we design/implement (distributed) system middleware?
6 Distributed Systems Computationally Intensive Operations Hashing Erasure coding Encryption/decryption Compression Membership testing (Bloom-filter) HPDC ‘08 Computationally intensive Often avoided in existing systems. Used in: Storage systems Security protocols Data dissemination techniques Virtual machines memory management And many more …
7 Why Start with Hashing? Popular -- used in many situations: Similarity detection Content addressability Integrity Copyright infringement detection Load balancing
8 File A X Y Z Hashing ICDCS ‘08 How Hashing is Used in Similarity Detection ? W Y Z File B Hashing Only the first block is different
9 How to divide the file into blocks Fixed-size blocks Content-based block boundaries ICDCS ‘08 How Hashing is Used in Similarity Detection ?
10 File i Hashing B1 B2 B3 B4 ICDCS ‘08 HashValue K = 0 ? m bytes k bits offset Detecting Content-based Block Boundaries
11 Hashing Use in Similarity Detection – Two scenarios I.Computing block hashes : large blocks of data (100s KB to 10s MB). II. Computing block boundary: Hashing large number of small data blocks (few bytes) HPDC ‘08
12 StoreGPU HPDC ‘08 StoreGPU : a library that exploits GPUs to support distributed storage system by offloading the computationally intensive functions. One performance data point: In similarity detection, StoreGPU achieves 8x speedup and 5x data compression for a checkpointing application. StoreGPU v1.0 implements hashing functions used in computing block hashes and blocks boundaries Implication: GPUs unleash valuable set of optimization techniques into high performance systems design space. - Although GPUs have not been designed with this usage in mind.
13 Outline GPU architecture GPU programming Typical application flow StoreGPU design Evaluation HPDC ‘08
14 NVIDIA CUDA GPU Architecture HPDC ‘08 SIMD Architecture. Four memories. Device (a.k.a. global) slow – cycles access latency large – 256MB – 1GB Shared fast – 4 cycles access latency small – 16KB Texture – read only Constant – read only
15 GPU Programming HPDC ‘08 NVIDIA CUDA programming model: Abstracts the GPU architecture Is an extension to C programming language Compiler directives Provides GPU specific API (device properties, timing, memory management…etc) Programming still challenging Parallel programming is challenging Extracting parallelism at large scale Parallel programming (SIMD) Memory management Synchronization Immature debugging tools
16 Performance Tips HPDC ‘08 Use 1000s of threads to best use the GPU hardware Optimize the use the shared memory and the registers Challenge: limited shared memory and registers Challenge: small, bank conflicts
17 Shared Memory Complications HPDC ‘08 Shared memory is organized into 16 -1KB banks. Bank 0 Bank 1 Bank Complication I : Concurrent accesses to the same bank will be serialized (bank conflict) slow down. Complication II : Banks are interleaved. Tip : Assign different threads to different banks. Bank 0 Bank 1 Bank bytes
18 Execution Path on GPU – Data Processing Application HPDC ‘08 T Total = 1 T Preprocesing T DataHtoG T Processing T DataGtoH T PostProc 5 1.Preprocessing 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing
19 Outline GPU architecture GPU programming Typical application flow StoreGPU design Evaluation HPDC ‘08
20 StoreGPU Design I.Computing block hashes : large blocks of data (100s KB to 10s MB). II. Computing block boundary: Hashing large number of small data blocks (few bytes) HPDC ‘08
21 HPDC ‘08 Input Data i-1i... Output GPU Input Data Host Machine Data transf. to shared mem Data transfer in Processing Result transfer to global Result transfer out Preprocessing Execute the final hash Computing Block Hash – Module Design
22 Computing Block Hash – Module Design HPDC ‘08 The design is highly parallel Last step - on the CPU to avoid synchronization The resulting hash is not compatible with standard MD5 and SHA1 but is equally collision resistant [Damgard89]
23 HPDC ‘08 Input Data.. Output Input Data GPU Host Machine Data transf. to shared mem Data transfer in Processing Result transfer to global Result transfer out Preprocessing Detecting Block Boundaries – Module Design
24 StoreGPU v1.0 Optimizations Optimized shared memory usage. StoreGPU shared memory management mechanism: assigns threads to different banks while providing contiguous space abstraction. Memory pinning Reduced output size HPDC ‘08 B1 B2 B3 B4 HashValue K = 0 ? m bytes k bits Bank 0 Bank 1 Bank bytes
25 Outline GPU architecture GPU programming Typical application flow StoreGPU design StoreGPU v1.0 optimizations Evaluation HPDC ‘08
26 Evaluation Testbed: A machine with CPU: Intel Core2 Duo 6600, 2 GB RAM (priced at : $290) GPU: GeForce 8600 GTS GPU (32 cores, 256 MB RAM, PCIx 16x) (priced at : $100) HPDC ‘08 Experiment space: GPU vs. single CPU core. MD5 and SHA1 implementations Three optimizations Detecting block boundary configurations (m and offset)
27 Computing Block Hash HPDC ‘08 Over 4x speedup in computing block hashes Computing Block Hash – MD5
28 Computing Block Boundary HPDC ‘08 Over 8x speedup in detecting blocks boundaries Computing Block Boundary– MD5 m = 20 bytes, offset = 4 bytes 1
29 HPDC ‘08 Dissecting GPU Execution Time T Total = 1 T Preprocesing T DataHtoG T Processing T DataGtoH T PostProc 5 1.Preprocessing 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing
30 Dissecting GPU Execution Time HPDC ‘08 T Total = T Preprocesing 1 + T DataHtoG 2 + T Processing 3 + T DataGtoH 4 + T PostProc 5 MD5 computing block hashes module with all optimizations enabled
31 Application Level Performance – Similarity Detection HPDC ‘08 Online similarity detection throughput and speedup using MD5. Throughput (MBps)Similarity ratio detected StoreGPUStandard Fixed size Compare by Hash % Content based Compare by Hash % Implication: similarity detection can be used even on 10Gbps setups !! 840 Speedup : 4.3x 114 Speedup: 8.4x Application: similarity detection between checkpoint images. Data: checkpoints from BLAST (bioinformatics) collected using BLCR, checkpoint interval : 5 minutes
32 Summary HPDC ‘08 StoreGPU : Offloads the computationally intensive operations from the CPU Achieves considerable speedups Contributions: Feasibility of using GPUs to support (distributed) middlewares Performance model StoreGPU library Implication : GPUs unleash valuable set of optimization techniques into high performance systems design space.
33 Other GPU Applications Current NetSysLab GPU-related projects Exploring GPU to support other middleware primitives: Bloom filters (BloomGPU) Packet classification Medical imaging compression Hashing Erasure coding Encryption/decryption Compression Membership testing (Bloom-filter) HPDC ‘08
34 Thank you netsyslab.ece.ubc.ca HPDC ‘08
35 References HPDC ‘08 [Damgard89] Damgard, I. A Design Principle for Hash Functions. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science. [Liu06] Liu, W., et al. Bio-sequence database scanning on a GPU. in Parallel and Distributed Processing Symposium, IPDPS [Vogt08] Vogt, L, et al. Accelerating Resolution-of-the-Identity Second- Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. J. Phys. Chem. A, 112 (10), , [Anderson08] Joshua A. Anderson, Chris D. Lorenz and A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational Physics Volume 227, Issue 10, 1 May 2008, Pages [Owens07] Owens, J.D., et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, (1): p