1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.

Slides:

Advertisements

Similar presentations

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.

Advertisements

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Where to go from here? Get real experience building systems! Opportunities: 496 projects –More projects:

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Supporting GPU Sharing in Cloud Environments with a Transparent

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Emalayan Vairavanathan

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

GPU Programming with CUDA – Optimisation Mike Griffiths

Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Sunpyo Hong, Hyesoon Kim

GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.

My Coordinates Office EM G.27 contact time:

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Computer Engg, IIT(BHU)

Gwangsun Kim, Jiyun Jeong, John Kim

Seth Pugsley, Jeffrey Jestes,

EECE571R -- Harnessing Massively Parallel Processors ece

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Fast Accesses to Big Data in Memory and Storage Systems

Presentation transcript:

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work Abdullah Gharaibeh, Samer Al-Kiswany

2 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia

3 Hybrid architectures in Top 500 [Nov’10]

4 Hybrid architectures –High compute power / memory bandwidth –Energy efficient [operated today at low efficiency] Agenda for this talk –GPU Architecture Intuition What generates the above characteristics? –Progress on efficiently harnessing hybrid (GPU-based) architectures

5 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

6

7

8

9

10 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

11 Acknowledgement: Slide borrowed from presentation by Kayvon Fatahalian

12 Feed the cores with data Idea #3 The processing elements are data hungry!  Wide, high throughput memory bus

13 10,000x parallelism! Idea #4 Hide memory access latency  Hardware supported multithreading

14 The Resulting GPU Architecture Multiprocessor 2 Multiprocessor N GPU Core M Instruction Unit Shared Memory Registers Multiprocessor 1 Core 1 Registers Core 2 Registers Global Memory Texture Memory Constant Memory nVidia Tesla 2050  448 cores  Four ‘memories’ Shared fast – 4 cycles small – 48KB Global slow – cycles large – up to 3GB high throughput – 150GB/s Texture – read only Constant – read only  Hybrid PCI 16x -- 4GBps Host Memory Host Machine

15 GPUs offer different characteristics  High peak compute power  High host-device communication overhead  Complex to program  High peak memory bandwidth  Limited memory space

16 Projects at Porting applications to efficiently exploit GPU characteristics Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 Accelerating Sequence Alignment on Hybrid Architectures, A. Gharaibeh, M. Ripeanu, Scientific Computing Magazine, January/February Middleware runtime support to simplify application development CrystalGPU: Transparent and Efficient Utilization of GPU Power, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, TR GPU-optimized building blocks: Data structures and libraries GPU Support for Batch Oriented Workloads, L. Costa, S. Al-Kiswany, M. Ripeanu, IPCCC’09 Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10 A GPU Accelerated Storage System, A. Gharaibeh, S. Al-Kiswany, M. Ripeanu, HPDC’10 On GPU's Viability as a Middleware Accelerator, S. Al-Kiswany, A. Gharaibeh, E. Santos-Neto, M. Ripeanu, JoCC‘08

17 Motivating Question: How should we design applications to efficiently exploit GPU characteristics? Context: A bioinformatics problem: Sequence Alignment  A string matching problem  Data intensive (10 2 GB) Size Matters: Space/Time Tradeoffs to Improve GPGPU Applications Performance, A.Gharaibeh, M. Ripeanu, SC’10

18 Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]:  A GPU port of the sequence alignment tool MUMmer [Kurtz 04]  ~4x (end-to-end) compared to CPU version Hypothesis : mismatch between the core data structure ( suffix tree ) and GPU characteristics > 50% overhead (%)

19  Use a space efficient data structure (though, from higher computational complexity class): suffix array  4x speedup compared to suffix tree-based on GPU Idea: trade-off time for space Consequences:  Opportunity to exploit multi-GPU systems as I/O is less of a bottleneck  Focus is shifted towards optimizing the compute stage Significant overhead reduction

20 Outline for the rest of this talk  Sequence alignment: background and offloading to GPU  Space/Time trade-off analysis  Evaluation

21 CCAT GGCT CGCCCTA GCAATTT GCGG...TAGGC TGCGC......CGGCA......GGCG...GGCTA ATGCG….…TCGG... TTTGCGG…....TAGG...ATAT….…CCTA... CAATT…...CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Background: Sequence Alignment Problem Problem: Find where each query most likely originated from  Queries  10 8 queries  10 1 to 10 2 symbols length per query  Reference  10 6 to symbols length (up to ~400GB) Queries Reference

22 GPU Offloading: Opportunity and Challenges Sequence alignment  Easy to partition  Memory intensive GPU  Massively parallel  High memory bandwidth Opportunity  Data Intensive  Large output size  Limited memory space  No direct access to other I/O devices (e.g., disk) Challenges

23 GPU Offloading: addressing the challenges subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space →divide and compute in rounds →search-optimized data- structures Large output size →compressed output representation (decompress on the CPU) High-level algorithm (executed on the host)

24 Space/Time Trade-off Analysis

25 The core data structure massive number of queries and long reference => pre-process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])  Search: O(qry_len) per query  Space: O(ref_len) but the constant is high ~20 x ref_len  Post-processing: DFS traversal for each query O(4 qry_len - min_match_len )

26 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07])  Search: O(qry_len) per query  Space: O(ref_len), but the constant is high: ~20xref_len  Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Expensive Efficient

27 A better matching data structure? Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len Search O(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 1: Reduced communication Less data to transfer Compute

28 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 2: Better data locality is achieved at the cost of additional per-thread processing time Space for longer sub- references => fewer processing rounds Compute

29 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array Space O(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post- process O(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 3: Lower post-processing overhead Compute

30 Evaluation

31 Evaluation setup Workload / Species Reference sequence length # of queries Average read length HS1 - Human (chromosome 2) ~238M~78M~200 HS2 - Human (chromosome 3) ~100M~2M~700 MONO - L. monocytogenes~3M~6M~120 SUIS - S. suis~2M~26M~36  Testbed  Low-end Geforce 9800 GX2 GPU (512MB)  High-end Tesla C1060 (4GB)  Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])  Success metrics  Performance  Energy consumption  Workloads (NCBI Trace Archive,

32 Speedup: array-based over tree-based

33 Dissecting the overheads Significant reduction in data transfers and post- processing Workload: HS1, ~78M queries, ~238M ref. length on GeForce

34 Comparing with CPU performance [baseline single core performance] [Suffix tree] [Suffix array]

35 Summary  GPUs have drastically different performance characteristics  Reconsidering the choice of the data structure used is necessary when porting applications to the GPU  A good matching data structure ensures:  Low communication overhead  Data locality: might be achieved at the cost of additional per thread processing time  Low post-processing overhead

36 Code, benchmarks and papers available at: netsyslab.ece.ubc.ca