Cuda application-Case study 2015/10/24 1. Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and.

Slides:



Advertisements
Similar presentations
Speed, Accurate and Efficient way to identify the DNA.
Advertisements

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
High-throughput sequence alignment using Graphics Processing Units Michael C Schatz, Cole Trapnell, Arthur L Delcher, Amitabh Varshney UMD Presented by.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Chapter.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
All-Pairs-Shortest-Paths for Large Graphs on the GPU Gary J Katz 1,2, Joe Kider 1 1 University of Pennsylvania 2 Lockheed Martin IS&GS.
Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
1 Chapter 04 Authors: John Hennessy & David Patterson.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
MIDeA :A Multi-Parallel Instrusion Detection Architecture Author: Giorgos Vasiliadis, Michalis Polychronakis,Sotiris Ioannidis Publisher: CCS’11, October.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology.
On a Few Ray Tracing like Algorithms and Structures. -Ravi Prakash Kammaje -Swansea University.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Chun-Yuan Lin Assistant Professor Department of Computer Science and Information Engineering Chang Gung University Experiences for computational biology.
© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
My Coordinates Office EM G.27 contact time:
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Computer Engg, IIT(BHU)
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Graphics Processing Unit
6- General Purpose GPU Programming
Presentation transcript:

Cuda application-Case study 2015/10/24 1

Introduction (1) 2015/10/24 GPU Workshop 2 The fast increasing power of the GPU (Graphics Processing Unit) and its streaming architecture opens up a range of new possibilities for a variety of applications. Previous work on GPGPU (General-Purpose computation on GPUs) have showed the design and implementation of algorithms for non-graphics applications. (scientific computing, computational geometry, image processing, Bioinformatics and etc.)

Introduction (2) 2015/10/24 GPU Workshop 3 Some bioinformatics applications have been successfully ported to GPGPU in the past. Liu et al. (IPDPS 2006) implemented the Smith-Waterman algorithm to run on the nVidia GeForce 6800 GTO and GeForce 7800 GTX, and reported an approximate 16× speedup by computing the alignment score of multiple cells simultaneously. Charalambous et al. (LNCS 2005) ported an expensive loop from RAxML, an application for phylogenetic tree construction, and achieved a 1.2× speedup on the nVidia GeForce 5700 LE.

Introduction (3) Liu et al. (IEEE TPDS 2007) presented a GPGPU approach to high- performance biological sequence alignment based on commodity PC graphics hardware. (C++ and OpenGL Shading Language (GLSL)) Pairwise Sequence Alignment (Smith-Waterman algorithm, scan database) Multiple sequence alignment (MSA) 2015/10/24 GPU Workshop 4 (from Liu et al. TPDS 2007) (from Liu et al. TDPS 2007)

2015/10/24 GPU Workshop 5 (from Liu et al. TPDS 2007)

2015/10/24 GPU Workshop 6 (from Liu et al. TDPS 2007) (from Liu et al. TPDS 2007) (no traceback)

2015/10/24 GPU Workshop 7 (from Liu et al. TDPS 2007) (from Liu et al. TDPS 2007)

Introduction (4) CUDA (Compute Unified Device Architecture) is an extension of C/C++ which enables users to write scalable multi-threaded programs for CUDA-enabled GPUs. CUDA programs contain a sequential part, called a kernel. Readable and writable global memory (ex. 1GB) Readable and writable per-thread local memory (16KB per thread) Read-only constant memory (64KB, cached) Read-only texture memory (size of global, cached) Readable and writable per-block shared memory (16KB per block) Readable and writable per-thread registers (8192 per block) 2015/10/24 GPU Workshop 8

2015/10/24 GPU Workshop 9 Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign (from Schatz et al. BMC Bioinformatics 2007)

Introduction (5) Some bioinformatics applications have been successfully ported to CUDA now. Smith-Waterman algorithm (goal: scan database) Manavski and Valle (BMC Bioinformatics 2008), Ligowski and Rudnicki (University of Warsaw), Striemer and Akoglu (IPDPS 2009), Liu et al. (BMC Research Notes 2009) Multiple sequence alignment (ClustalW) Liu et al. (IPDPS 2009) for Neighbor-Joining Trees construction Liu et al. (ASAP 2009) Pattern matching (MUMmer) Schatz et al. (BMC Bioinformatics 2007) 2015/10/24 GPU Workshop 10

Smith-Waterman algorithm (1) Manavski and Valle present the first solution based on commodity hardware that efficiently computes the exact Smith-Waterman alignment. It runs from 2 to 30 times faster than any previous implementation on general-purpose hardware. 2015/10/24 GPU Workshop 11 (from Schatz et al. BMC Bioinformatics 2007)

Smith-Waterman algorithm (2) Pre-compute a query profile parallel to the query sequence for each possible residue. The implementation in CUDA was to make each GPU thread compute the whole alignment of the query sequence with one database sequence. (pre-order the sequences of the database in function of their length) The ordered database is stored in the global memory, while the query-profile is saved into the texture memory. For each alignment the matrix is computed column by column in order parallel to the query sequence. (store them in the local memory of the thread) 2015/10/24 GPU Workshop 12

2015/10/24 GPU Workshop 13 (from Schatz et al. BMC Bioinformatics 2007) (no traceback)

2015/10/24 GPU Workshop 14 (from Schatz et al. BMC Bioinformatics 2007)

Smith-Waterman algorithm (3) Striemer and Akoglu further study the effect of memory organization and the instruction set architecture on GPU performance. For both single and dual GPU configurations, Manavski utilizes the help of an Intel Quad Core processor by distributing the workload among GPU(s) and the Quad Core processor. They pointed out that query profile in Manavski’s method has a major drawback in utilizing the texture memory of the GPU that leads to unnecessary caches misses. (larger than 8KB) Long sequence problem. 2015/10/24 GPU Workshop 15

Smith-Waterman algorithm (4) They placed the substitution matrix in the constant memory to exploit the constant cache, and created an efficient cost function to access it. (modulo operator is extremely inefficient on CUDA, not use hash function) The substitution matrix needs to be re-arranged in alphabetical order. They mapped query sequence as well as the substitution matrix to the constant memory. They pointed out the main drawback of GPU is the limited on chip memory. (need to be designed carefully) 2015/10/24 GPU Workshop 16

2015/10/24 GPU Workshop 17 (from Striemer and Akoglu IPDPS 2009)

Smith-Waterman algorithm (5) Liu et al. proposed Two versions of CUDASW++ are implemented: a single-GPU version and a multi-GPU version. The alignment can be computed in minor-diagonal order from the top-left corner to the bottom-right corner in the alignment matrix. Considering the optimal local alignment of a query sequence and a subject sequence as a task. Inter-task parallelization: Each task is assigned to exactly one thread and dimBlock tasks are performed in parallel by different threads in a thread block. Intra-task parallelization: Each task is assigned to one thread block and all dimBlock threads in the thread block cooperate to perform the task in parallel. 2015/10/24 GPU Workshop 18

Smith-Waterman algorithm (6) Inter-task parallelization occupies more device memory but achieves better performance than intra-task parallelization. Intra-task parallelization occupies significantly less device memory and therefore can support longer query/subject sequences. (two stages implementation) In order to achieve high efficiency for inter-task parallelization, the runtime of all threads in a thread block should be roughly identical. (re-order database sequences) 2015/10/24 GPU Workshop 19

Smith-Waterman algorithm (7) Coalesced subject sequence arrangement Sorted subject sequences for the intra-task parallelization are sequentially stored in an array row by row from the top-left corner to the bottom-right corner. (A hash table records the location coordinate in the array and the length of each sequence, providing fast access to any sequence) Coalesced global memory access During the execution of the SW algorithm, additional memory is required to store intermediate alignment data. (A prerequisite for coalescing is that the words accessed by all threads in a half-warp must lie in the same segment) 2015/10/24 GPU Workshop 20

Smith-Waterman algorithm (8) Cell block division method To maximize performance and to reduce the bandwidth demand of global memory, they propose a cell block division method for the inter-task parallelization, where the alignment matrix is divided into cell blocks of equal size. (padded with an appropriate number of dummy symbols) Constant memory is exploited to store the gap penalties, scoring matrix and the query sequence. (In our implementation, sequences of length up to 59K can be supported) 2015/10/24 GPU Workshop 21

2015/10/24 GPU Workshop 22 a single-GPU NVIDIA GeForce GTX 280 graphics card and a dual-GPU GeForce GTX 295 graphics card (from Liu et al. BMC Research Notes 2009)

Multiple sequence alignment Liu et al. presents MSA-CUDA, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using CUDA. Pairwise distance computation: a forward score-only pass using Smith-Waterman (SW) algorithm a reverse score-only pass using SW algorithm a traceback computation pass using Myers-Miller algorithm they have developed a new stack-based iterative implementation. (CUDA does not support recursion) As the work in Liu et al. (BMC Research Notes 2009) Neighbor-Joining Trees: as the work in Liu et al. (IPDPS 2009) Progressive alignment: conducted iteratively in a multi-pass way. 2015/10/24 GPU Workshop 23

2015/10/24 GPU Workshop 24 (from Liu et al. ASAP 2009)

Pattern matching (1) Schatz et al. proposed MUMmerGPU, an open-source high- throughput parallel pairwise local sequence alignment program (exact sequence alignment) that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. 2015/10/24 GPU Workshop 25

2015/10/24 GPU Workshop 26 (from Schatz et al. BMC Bioinformatics 2007)

Pattern matching (2) First a suffix tree of the reference sequence is constructed on the CPU using Ukkonen's algorithm and transferred to the GPU. (the reference suffix tree, query sequences, and output buffers will fit on the GPU) MUMmerGPU builds k smaller suffix trees from overlapping segments of the reference. The suffix tree is "flattened" into two 2D textures, the node texture and the child texture. The queries are read from disk in blocks that will fill the remaining memory, concatenated into a single large buffer (separated by null characters), and transferred to the GPU. An auxiliary 1D array, also transferred to the GPU, stores the offset of each query in the query buffer. Then the query sequences are transferred to the GPU, and are aligned to the tree on the GPU using the alignment algorithm. 2015/10/24 GPU Workshop 27

2015/10/24 GPU Workshop 28 (from Schatz et al. BMC Bioinformatics 2007)

Pattern matching (3) Each multiprocessor on the GPU is assigned a subset of queries to process in parallel, depending on the number of multiprocessors and processors available. Thus, the data reordering scheme attempts to increase the cache hit rate for a single thread Alignment results are temporarily written to the GPU's memory, and then transferred in bulk to host RAM once the alignment kernel is complete for all queries. (the alignments are printed by the CPU) 2015/10/24 GPU Workshop 29

2015/10/24 GPU Workshop 30 (from Schatz et al. BMC Bioinformatics 2007) The time for building the suffix tree, reading queries from disk, and printing alignment output is the same regardless of whether MUMmerGPU ran on the CPU or the GPU

2015/10/24 GPU Workshop 31 (from Schatz et al. BMC Bioinformatics 2007)