1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

Slides:

Advertisements

Similar presentations

Shredder GPU-Accelerated Incremental Storage and Computation

Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Where to go from here? Get real experience building systems! Opportunities: 496 projects –More projects:

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Computer Graphics Graphics Hardware

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Sunpyo Hong, Hyesoon Kim

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

My Coordinates Office EM G.27 contact time:

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Computer Engg, IIT(BHU)

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

NFV Compute Acceleration APIs and Evaluation

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

NVIDIA Fermi Architecture

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Mattan Erez The University of Texas at Austin

Mattan Erez The University of Texas at Austin

6- General Purpose GPU Programming

Presentation transcript:

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany with: Abdullah Gharaibeh, Elizeu Santos-Neto, George Yuan, Matei Ripeanu

2 Computation Landscape Recent GPUs dramatically change the computation cost landscape. Floating-Point Operations per Second for the CPU and GPU. (Source: CUDA 1.1 Guide) A quiet revolution:  Computation: 367 vs. 32 GFLOPS 128 vs. 4 cores  Memory Bandwidth: 86.4 vs. 8.4 GB/s $220 $290 HPDC ‘08

3 Computation Landscape  Affordable  Widely available in commodity desktop  Include 10s to 100s of cores ( can support 1000s of threads)  General purpose programming friendly Recent GPUs dramatically change the computation cost landscape. HPDC ‘08

4 Exploiting GPUs’ Computational Power Studies exploiting the GPU: Bioinformatics: [Liu06] Chemistry: [Vogt08] Physics: [Anderson08] And many more : [Owens07] Report: 4x to 50x speedup But: Mostly scientific and specialized applications. HPDC ‘08

5 Motivating Question System design: balancing act in a multi-dimensional space e.g., given certain objectives, say job turnaround, minimize total system cost given component prices, I/O bottlenecks, bounds on storage and network traffic, energy consumption, etc. Q: Does the 10x reduction in computation costs GPUs offer change the way we design/implement (distributed) system middleware?

6 Distributed Systems Computationally Intensive Operations  Hashing  Erasure coding  Encryption/decryption  Compression  Membership testing (Bloom-filter) HPDC ‘08 Computationally intensive Often avoided in existing systems. Used in:  Storage systems  Security protocols  Data dissemination techniques  Virtual machines memory management  And many more …

7 Why Start with Hashing? Popular -- used in many situations:  Similarity detection  Content addressability  Integrity  Copyright infringement detection  Load balancing

8 File A X Y Z Hashing ICDCS ‘08 How Hashing is Used in Similarity Detection ? W Y Z File B Hashing Only the first block is different

9  How to divide the file into blocks  Fixed-size blocks  Content-based block boundaries ICDCS ‘08 How Hashing is Used in Similarity Detection ?

10 File i Hashing B1 B2 B3 B4 ICDCS ‘08 HashValue K = 0 ? m bytes k bits offset Detecting Content-based Block Boundaries

11 Hashing Use in Similarity Detection – Two scenarios I.Computing block hashes : large blocks of data (100s KB to 10s MB). II. Computing block boundary: Hashing large number of small data blocks (few bytes) HPDC ‘08

12 StoreGPU HPDC ‘08 StoreGPU : a library that exploits GPUs to support distributed storage system by offloading the computationally intensive functions. One performance data point: In similarity detection, StoreGPU achieves 8x speedup and 5x data compression for a checkpointing application. StoreGPU v1.0 implements hashing functions used in computing block hashes and blocks boundaries Implication: GPUs unleash valuable set of optimization techniques into high performance systems design space. - Although GPUs have not been designed with this usage in mind.

13 Outline  GPU architecture  GPU programming  Typical application flow  StoreGPU design  Evaluation HPDC ‘08

14 NVIDIA CUDA GPU Architecture HPDC ‘08  SIMD Architecture.  Four memories. Device (a.k.a. global) slow – cycles access latency large – 256MB – 1GB Shared fast – 4 cycles access latency small – 16KB Texture – read only Constant – read only

15 GPU Programming HPDC ‘08 NVIDIA CUDA programming model:  Abstracts the GPU architecture  Is an extension to C programming language Compiler directives Provides GPU specific API (device properties, timing, memory management…etc) Programming still challenging  Parallel programming is challenging Extracting parallelism at large scale Parallel programming (SIMD)  Memory management  Synchronization  Immature debugging tools

16 Performance Tips HPDC ‘08  Use 1000s of threads to best use the GPU hardware  Optimize the use the shared memory and the registers Challenge: limited shared memory and registers Challenge: small, bank conflicts

17 Shared Memory Complications HPDC ‘08 Shared memory is organized into 16 -1KB banks. Bank 0 Bank 1 Bank Complication I : Concurrent accesses to the same bank will be serialized (bank conflict)  slow down. Complication II : Banks are interleaved. Tip : Assign different threads to different banks. Bank 0 Bank 1 Bank bytes

18 Execution Path on GPU – Data Processing Application HPDC ‘08 T Total = 1 T Preprocesing T DataHtoG T Processing T DataGtoH T PostProc 5 1.Preprocessing 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing

19 Outline  GPU architecture  GPU programming  Typical application flow  StoreGPU design  Evaluation HPDC ‘08

20 StoreGPU Design I.Computing block hashes : large blocks of data (100s KB to 10s MB). II. Computing block boundary: Hashing large number of small data blocks (few bytes) HPDC ‘08

21 HPDC ‘08 Input Data i-1i... Output GPU Input Data Host Machine Data transf. to shared mem Data transfer in Processing Result transfer to global Result transfer out Preprocessing Execute the final hash Computing Block Hash – Module Design

22 Computing Block Hash – Module Design HPDC ‘08  The design is highly parallel  Last step - on the CPU to avoid synchronization  The resulting hash is not compatible with standard MD5 and SHA1 but is equally collision resistant [Damgard89]

23 HPDC ‘08 Input Data.. Output Input Data GPU Host Machine Data transf. to shared mem Data transfer in Processing Result transfer to global Result transfer out Preprocessing Detecting Block Boundaries – Module Design

24 StoreGPU v1.0 Optimizations  Optimized shared memory usage. StoreGPU shared memory management mechanism: assigns threads to different banks while providing contiguous space abstraction.  Memory pinning  Reduced output size HPDC ‘08 B1 B2 B3 B4 HashValue K = 0 ? m bytes k bits Bank 0 Bank 1 Bank bytes

25 Outline  GPU architecture  GPU programming  Typical application flow  StoreGPU design  StoreGPU v1.0 optimizations  Evaluation HPDC ‘08

26 Evaluation Testbed: A machine with CPU: Intel Core2 Duo 6600, 2 GB RAM (priced at : $290) GPU: GeForce 8600 GTS GPU (32 cores, 256 MB RAM, PCIx 16x) (priced at : $100) HPDC ‘08 Experiment space:  GPU vs. single CPU core.  MD5 and SHA1 implementations  Three optimizations  Detecting block boundary configurations (m and offset)

27 Computing Block Hash HPDC ‘08 Over 4x speedup in computing block hashes Computing Block Hash – MD5

28 Computing Block Boundary HPDC ‘08 Over 8x speedup in detecting blocks boundaries Computing Block Boundary– MD5 m = 20 bytes, offset = 4 bytes 1

29 HPDC ‘08 Dissecting GPU Execution Time T Total = 1 T Preprocesing T DataHtoG T Processing T DataGtoH T PostProc 5 1.Preprocessing 2.Data transfer in 3.GPU Processing 4.Data transfer out 5.Postprocessing

30 Dissecting GPU Execution Time HPDC ‘08 T Total = T Preprocesing 1 + T DataHtoG 2 + T Processing 3 + T DataGtoH 4 + T PostProc 5 MD5 computing block hashes module with all optimizations enabled

31 Application Level Performance – Similarity Detection HPDC ‘08 Online similarity detection throughput and speedup using MD5. Throughput (MBps)Similarity ratio detected StoreGPUStandard Fixed size Compare by Hash % Content based Compare by Hash % Implication: similarity detection can be used even on 10Gbps setups !! 840 Speedup : 4.3x 114 Speedup: 8.4x Application: similarity detection between checkpoint images. Data: checkpoints from BLAST (bioinformatics) collected using BLCR, checkpoint interval : 5 minutes

32 Summary HPDC ‘08 StoreGPU :  Offloads the computationally intensive operations from the CPU  Achieves considerable speedups Contributions:  Feasibility of using GPUs to support (distributed) middlewares  Performance model  StoreGPU library Implication : GPUs unleash valuable set of optimization techniques into high performance systems design space.

33 Other GPU Applications Current NetSysLab GPU-related projects  Exploring GPU to support other middleware primitives: Bloom filters (BloomGPU)  Packet classification  Medical imaging compression  Hashing  Erasure coding  Encryption/decryption  Compression  Membership testing (Bloom-filter) HPDC ‘08

34 Thank you netsyslab.ece.ubc.ca HPDC ‘08

35 References HPDC ‘08 [Damgard89] Damgard, I. A Design Principle for Hash Functions. in Advances in Cryptology - CRYPTO. 1989: Lecture Notes in Computer Science. [Liu06] Liu, W., et al. Bio-sequence database scanning on a GPU. in Parallel and Distributed Processing Symposium, IPDPS [Vogt08] Vogt, L, et al. Accelerating Resolution-of-the-Identity Second- Order Moller-Plesset Quantum Chemistry Calculations with Graphical Processing Units. J. Phys. Chem. A, 112 (10), , [Anderson08] Joshua A. Anderson, Chris D. Lorenz and A. Travesset, General purpose molecular dynamics simulations fully implemented on graphics processing units. Journal of Computational Physics Volume 227, Issue 10, 1 May 2008, Pages [Owens07] Owens, J.D., et al., A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, (1): p