Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia.

Slides:

Advertisements

Similar presentations

1 StoreGPU Exploiting Graphics Processing Units to Accelerate Distributed Storage Systems NetSysLab The University of British Columbia Samer Al-Kiswany.

Advertisements

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

Multi-GPU System Design with Memory Networks

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA.

High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.

Sequence Alignment in DNA Under the Guidance of : Prof. Kolin Paul Presented By: Lalchand Gaurav Jain.

Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.

Low-Cost Data Deduplication for Virtual Machine Backup in Cloud Storage Wei Zhang, Tao Yang, Gautham Narayanasamy University of California at Santa Barbara.

OpenFOAM on a GPU-based Heterogeneous Cluster

The Energy Case for Graph Processing on Hybrid Platforms Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto and Matei Ripeanu NetSysLab The University.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

1 Harvesting the Opportunity of GPU- based Acceleration Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Joint work.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

1 Harvesting the Opportunity of GPU-Based Acceleration for Data-Intensive Applications Matei Ripeanu Networked Systems Laboratory (NetSysLab) University.

Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

Where to go from here? Get real experience building systems! Opportunities: 496 projects –More projects:

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Energy-Efficient Query Processing on Embedded CPU-GPU Architectures Xuntao Cheng, Bingsheng He, Chiew Tong Lau Nanyang Technological University, Singapore.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

GPU Programming with CUDA – Optimisation Mike Griffiths

Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs Azita Nouri, Reha Oğuz Selvitopi, Özcan Öztürk, Onur Mutlu, Can Alkan Bilkent University,

Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology.

1 Martin Schulz, Lawrence Livermore National Laboratory Brian White, Sally A. McKee, Cornell University Hsien-Hsin Lee, Georgia Institute of Technology.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Fall 2013.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

QCAdesigner – CUDA HPPS project

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

1 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray)

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Sunpyo Hong, Hyesoon Kim

Qq q q q q q q q q q q q q q q q q q q Background: DNA Sequencing Goal: Acquire individual’s entire DNA sequence Mechanism: Read DNA fragments and reconstruct.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Computational Challenges in BIG DATA 28/Apr/2012 China-Korea-Japan Workshop Takeaki Uno National Institute of Informatics & Graduated School for Advanced.

FastHASH: A New Algorithm for Fast and Comprehensive Next-generation Sequence Mapping Hongyi Xin1, Donghyuk Lee1, Farhad Hormozdiari2, Can Alkan3, Onur.

Gwangsun Kim, Jiyun Jeong, John Kim

Seth Pugsley, Jeffrey Jestes,

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Presentation transcript:

Size Matters : Space/Time Tradeoffs to Improve GPGPU Application Performance Abdullah Gharaibeh Matei Ripeanu NetSysLab The University of British Columbia

2 GPUs offer different characteristics  High peak compute power  High communication overhead  High peak memory bandwidth  Limited memory space Implication: careful tradeoff analysis is needed when porting applications to GPU-based platforms

3 Motivating Question: How should we design applications to efficiently exploit GPU characteristics? Context: A bioinformatics problem: Sequence Alignment  A string matching problem  Data intensive (10 2 GB)

4 Past work: sequence alignment on GPUs MUMmerGPU [Schatz 07, Trapnell 09]:  A GPU port of the sequence alignment tool MUMmer [Kurtz 04]  ~4x (end-to-end) compared to CPU version Hypothesis : mismatch between the core data structure ( suffix tree ) and GPU characteristics > 50% overhead (%)

5  Use a space efficient data structure (though, from higher computational complexity class): suffix array  4x speedup compared to suffix tree-based on GPU Idea: trade-off time for space Consequences:  Opportunity to exploit multi-GPU systems as I/O is less of a bottleneck  Focus is shifted towards optimizing the compute stage Significant overhead reduction

6 Outline  Sequence alignment: background and offloading to GPU  Space/Time trade-off analysis  Evaluation

7 CCAT GGCT CGCCCTA GCAATTT GCGG...TAGGC TGCGC......CGGCA......GGCG...GGCTA ATGCG….…TCGG... TTTGCGG…....TAGG...ATAT….…CCTA... CAATT…...CCATAGGCTATATGCGCCCTATCGGCAATTTGCGGCG.. Background: sequence alignment problem Find where each query most likely originated from  Queries  10 8 queries  10 1 to 10 2 symbols length per query  Reference  10 6 to symbols length Queries Reference

8 GPU Offloading: opportunity and challenges Sequence alignment  Easy to partition  Memory intensive GPU  Massively parallel  High memory bandwidth Opportunity  Data Intensive  Large output size  Limited memory space  No direct access to other I/O devices (e.g., disk) Challenges

9 GPU Offloading: addressing the challenges subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Data intensive problem and limited memory space →divide and compute in rounds Large output size →compressed output representation (decompress on the CPU) High-level algorithm (executed on the host)

10 Space/Time Trade-off Analysis

11 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07, 09])  Search: O(qry_len) per query  Space: O(ref_len), but the constant is high: ~20xref_len  Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query

12 The core data structure massive number of queries and long reference => pre- process reference to an index Past work: build a suffix tree (MUMmerGPU [Schatz 07])  Search: O(qry_len) per query  Space: O(ref_len), but the constant is high: ~20xref_len  Post-processing: O(4 qry_len - min_match_len ), DFS traversal per query subrefs = DivideRef(ref) subqrysets = DivideQrys(qrys) foreach subqryset in subqrysets { results = NULL CopyToGPU(subqryset) foreach subref in subrefs { CopyToGPU(subref) MatchKernel(subqryset, subref) CopyFromGPU(results) } Decompress(results) } Expensive Efficient

13 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array SpaceO(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post-processO(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 1: reduced communication Less data to transfer

14 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array SpaceO(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post-processO(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 2: better data locality is achieved at the cost of additional per-thread processing time Space for longer sub- references => fewer processing rounds

15 A better matching data structure Suffix Tree 0A$ 1ACA$ 2ACACA$ 3CA$ 4CACA$ 5TACACA$ Suffix Array SpaceO(ref_len), 20 x ref_lenO(ref_len), 4 x ref_len SearchO(qry_len)O(qry_len x log ref_len) Post-processO(4 qry_len - min_match_len )O(qry_len – min_match_len) Impact 3: lower post-processing overhead

16 Evaluation

17 Evaluation setup Workload / Species Reference sequence length # of queries Average read length HS1 - Human (chromosome 2) ~238M~78M~200 HS2 - Human (chromosome 3) ~100M~2M~700 MONO - L. monocytogenes~3M~6M~120 SUIS - S. suis~2M~26M~36  Testbed  Low-end Geforce 9800 GX2 GPU (512MB)  High-end Tesla C1060 (4GB)  Base line: suffix tree on GPU (MUMmerGPU [Schatz 07, 09])  Success metrics  Performance  Energy consumption  Workloads (NCBI Trace Archive,

18 Speedup: array-based over tree-based

19 Dissecting the overheads Significant reduction in data transfers and post- processing Workload: HS1, ~78M queries, ~238M ref. length on Geforce

20 Summary  GPUs have drastically different performance characteristics  Reconsidering the choice of the data structure used is necessary when porting applications to the GPU  A good matching data structure ensures:  Low communication overhead  Data locality: might be achieved at the cost of additional per thread processing time  Low post-processing overhead

21 Code available at: netsyslab.ece.ubc.ca