1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University.

Slides:



Advertisements
Similar presentations
Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.
Advertisements

Shredder GPU-Accelerated Incremental Storage and Computation
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.
OpenFOAM on a GPU-based Heterogeneous Cluster
The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.
1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
Are P2P Data-Dissemination Techniques Viable in Today's Data- Intensive Scientific Collaborations? Samer Al-Kiswany – University of British Columbia joint.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Where to go from here? Get real experience building systems! Opportunities: 496 projects –More projects:
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Emalayan Vairavanathan
Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #,
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
Energy Prediction for I/O Intensive Workflow Applications 1 MASc Exam Hao Yang NetSysLab The Electrical and Computer Engineering Department The University.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.
1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
My Coordinates Office EM G.27 contact time:
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
NFV Compute Acceleration APIs and Evaluation
Gwangsun Kim, Jiyun Jeong, John Kim
Seth Pugsley, Jeffrey Jestes,
EECE571R -- Harnessing Massively Parallel Processors ece
Genomic Data Clustering on FPGAs for Compression
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
A Software-Defined Storage for Workflow Applications
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Fast Accesses to Big Data in Memory and Storage Systems
Presentation transcript:

1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work with: Abdullah Gharaibeh, Samer Al-Kiswany

2 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia

3 Hybrid architectures in Top 500 [Nov’10]

4 Hybrid architectures –High compute power / memory bandwidth –Energy efficient [operated today at low overall efficiency] Agenda for this talk –GPU Architecture Intuition What generates the above characteristics? –Progress on efficiently harnessing hybrid (GPU-based) architectures

5 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

6

7

8

9

10 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

11 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

12 Feed the cores with data Idea #3 The processing elements are data hungry!  Wide, high throughput memory bus

13 10,000x parallelism! Idea #4 Hide memory access latency  Hardware supported multithreading

14 The Resulting GPU Architecture Multiprocessor 2 Multiprocessor N GPU Core M Instruction Unit Shared Memory Registers Multiprocessor 1 Core 1 Registers Core 2 Registers Global Memory Texture Memory Constant Memory nVidia Tesla 2050  448 cores  Four ‘memories’ Shared fast – 4 cycles small – 48KB Global slow – cycles large – up to 3GB high throughput – 150GB/s Texture – read only Constant – read only  Hybrid PCI 16x -- 4GBps Host Memory Host Machine

15 GPU characteristics  High peak compute power  High host-device communication overhead  Complex to program (SIMD, co-processor model)  High peak memory bandwidth  Limited memory space

16 MummerGPU++ Context: Porting a bioinformatics application ( Sequence Alignment)  A string matching problem  Data intensive (10 2 GB) Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems? Motivating Question Distributed storage systems Context: Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? StoreGPU Roadmap: Two Projects

17 Computationally Intensive Operations in Distributed (Storage) Systems Hashing Erasure coding Encryption/decryption Membership testing (Bloom-filter) Compression Computationally intensive Limit performance Similarity detection (deduplication) Content addressability Security Integrity checks Redundancy Load balancing Summary cache Storage efficiency Operations Techniques

18 Distributed Storage System Architecture Client Metadata Manager Storage Nodes Access Module Application Techniques To improve Performance/Reliability b1 b2 b3 bnbn Files divided into stream of blocks De- duplication Security Integrity Checks Redundancy CPU GPU Offloading Layer Enabling Operations Compression Encoding/ Decoding Encryption/ Decryption Hashing Application Layer FS API MosaStore

19  GPU accelerated deduplication: Design / prototype implementation that integrates similarity detection and GPU support  End-to-end system evaluation 2x throughput improvement for a realistic checkpointing workload

20 Challenges  Integration Challenges  Minimizing the integration effort  Transparency  Separation of concerns  Extracting Major Performance Gains  Hiding memory allocation overheads  Hiding data transfer overheads  Efficient utilization of the GPU memory units  Use of multi-GPU systems Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU Hashing Offloading Layer

21 Hashing on GPUs HashGPU 1 : a library that exploits GPUs to support specialized use of hashing in distributed storage systems 1 Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, G. Yuan, M. Ripeanu, HPDC ‘08 However, significant speedup achieved only for large blocks (>16MB) => not suitable for efficient similarity detection One performance data point: Accelerates hashing by up to 5x speedup compared to a single core CPU HashGPU GPU b1 b2 b3 bnbn Hashing a stream of blocks

22 Profiling HashGPU Amortizing memory allocation and overlapping data transfers and computation may bring important benefits At least 75% overhead

23 CrystalGPU: a layer of abstraction that transparently enables common GPU optimizations Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU HashGPU Offloading Layer CrystalGPU One performance data point: CrystalGPU can improve the speedup of hashing by more than 10x

24 CrystalGPU Opportunities and Enablers  Opportunity: Reusing GPU memory buffers Enabler: a high-level memory manager  Opportunity: overlap the communication and computation Enabler: double buffering and asynchronous kernel launch  Opportunity: multi-GPU systems (e.g., GeForce 9800 GX2 and GPU clusters) Enabler: a task queue manager Similarity Detection b1 b2 b3 bnbn Files divided into stream of blocks GPU HashGPU Offloading Layer CrystalGPU Memory Manager Task Queue Double Buffering

25 HashGPU Performance on top CrystalGPU The gains enabled by the three optimizations can be realized! Base Line: CPU Single Core

26 End-to-end system evaluation

27  Testbed –Four storage nodes and one metadata server –One client with 9800GX2 GPU  Three configuration –No similarity detection (without-SD) –Similarity detection on CPU (4 2.6GHz) (SD-CPU) on GPU (9800 GX2) (SD-GPU)  Three workloads –Real checkpointing workload –Completely similar files: maximum gains in terms of data saving –Completely different files: only overheads, no gains  Success metrics: –System throughput –Impact on a competing application: compute or I/O intensive End-to-End System Evaluation A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10

28 System Throughput (Checkpointing Workload) The integrated system preserves the throughput gains on a realistic workload! 1.8x improvement

29 System Throughput (Synthetic Workload of Similar Files) Offloading to the GPU enables close to optimal performance! Room for 2x improvement

30 Impact on a Competing (Compute Intensive) Application Writing Checkpoints back to back 2x improvement Frees resources (CPU) to competing applications while preserving throughput gains! 7% reduction

31 Summary

32 Distributed Storage System Architecture Client Metadata Manager Storage Nodes Access Module Application MosaStore

33 Does the 10x lower computation cost offered by GPUs change the way we design (distributed storage) systems? Motivating Question StoreGPU summary Techniques To improve Performance/Reliability b1 b2 b3 bnbn Files divided into stream of blocks De- duplication Security Integrity Checks Redundancy CPU GPU Offloading Layer Enabling Operations Compression Encoding/ Decoding Encryption/ Decryption Hashing Application Layer FS API Results so far:  StoreGPU: storage system prototype that offloads to GPU  Evaluate the feasibility of GPU offloading, and the impact on competing applications

34 MummerGPU++ Context: Porting a bioinformatics application ( Sequence Alignment)  A string matching problem  Data intensive (10 2 GB) Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems? Motivating Question Distributed storage systems Context: Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? StoreGPU Roadmap: Two Projects

35 CCAT GGCT CGCCCTA GCAATTT GCGG...TAGGC TGCGC......CGGCA......GGCG...GGCTA ATGCG….…TCGG... TTTGCGG…....TAGG...ATAT….…CCTA... CAATT…...CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Background: Sequence Alignment Problem Problem: Find where each query most likely originated from  Queries  10 8 queries  10 1 to 10 2 symbols length per query  Reference  10 6 to symbols length (up to ~400GB) Queries Reference

36 Sequence Alignment on GPUs  MUMmerGPU [Schatz 07, Trapnell 09] :  A GPU port of the sequence alignment tool MUMmer [Kurtz 04]  Achieves good speedup compared to CPU version  Based on suffix tree  However, suffers from significant communication and post- processing overheads  MUMmerGPU++ [gharibeh 10] :  Use a space efficient data structure (though, from higher computational complexity class): suffix array  Achieves significant speedup compared to suffix tree-based on GPU > 50% overhead Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing, Jan/Feb 2011.

37 Speedup Evaluation Workload: Human, ~10M queries, ~30M ref. length Suffix Tree Suffx Array Over 60% improvement

38 Space/Time Trade-off Analysis

39 GPU Offloading: addressing the challenges subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space →divide and compute in rounds →search-optimized data- structures Large output size →compressed output representation (decompress on the CPU) High-level algorithm (executed on the host)

40 The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])  Search: O(qry_len) per query  Space: O(ref_len) but the constant is high ~20 x ref_len  Post-processing: DFS traversal for each query O(4 qry_len - min_match_len )

41 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07])  Search: O(qry_len) per query  Space: O(ref_len), but the constant is high: ~20xref_len  Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Expensive Efficient

42 A better matching data structure? Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len Search O(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 1: Reduced communication Less data to transfer Compute

43 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 2: Better data locality is achieved at the cost of additional per-thread processing time Space for longer sub- references => fewer processing rounds Compute

44 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 3: Lower post-processing overhead Compute

45 Evaluation

46 Evaluation setup Workload / Species Reference sequence length # of queries Average read length HS1 - Human (chromosome 2) ~238M~78M~200 HS2 - Human (chromosome 3) ~100M~2M~700 MONO - L. monocytogenes~3M~6M~120 SUIS - S. suis~2M~26M~36  Testbed  Low-end Geforce 9800 GX2 GPU (512MB)  High-end Tesla C1060 (4GB)  Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])  Success metrics  Performance  Energy consumption  Workloads (NCBI Trace Archive,

47 Speedup: array-based over tree-based

48 Dissecting the overheads Consequences: Focus shifts to optimizing the compute stage Opportunity to exploit multi-GPU systems (as I/O is less of a bottleneck) Workload: HS1, ~78M queries, ~238M ref. length on GeForce

49  Choice of appropriate data structure can be crucial when porting applications to the GPU  A good matching data structure ensures:  Low communication overhead  Data locality: can be achieved at the cost of additional per thread processing time  Low post-processing overhead MummerGPU++ Summary Motivating Question How should one design/port applications to efficiently exploit GPU characteristics?

50 MummerGPU++ Hybrid platforms will gain wider adoption. Unifying theme: making the use of hybrid architectures (e.g., GPU-based platforms) simple and effective Does the 10x lower computation cost offered by GPUs change the way we design (distributed) systems? Motivating Question How should one design/port applications to efficiently exploit GPU characteristics? StoreGPU

51 Code, benchmarks and papers available at: netsyslab.ece.ubc.ca

52 Projects at Accelerated storage systems  A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10  On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08 Porting applications to efficiently exploit GPU characteristics Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February Middleware runtime support to simplify application development CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, Technical Report GPU-optimized building blocks: Data structures and libraries Hashing, BloomFilters, SuffixArray