Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
More on threads, shared memory, synchronization
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CUDA - 2.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
Massively Parallel Programming with CUDA: A Hands-on Tutorial for GPUs Carlo del Mundo ACM Student Chapter, Students Teaching Students (SRS) Series November.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Heterogeneous Computing With GPGPUs Matthew Piehl Overview Introduction to CUDA Project Overview Issues faced nvcc Implementation Performance Metrics Conclusions.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CUDA programming Performance considerations (CUDA best practices)
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
CUDA C/C++ Basics Part 2 - Blocks and Threads
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
CUDA Programming Model
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
Introduction to GPU programming using CUDA
Basic CUDA Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
Introduction to CUDA heterogeneous programming
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
General Purpose Graphics Processing Units (GPGPUs)
CUDA Programming Model
6- General Purpose GPU Programming
Presentation transcript:

Barzan Shkeh

Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics

Introduction Today, science and technology are inextricably linked. Human insight in bioinformatics, in particular, is driven by the vast amount of data that can be collected with sufficient computational capability to extract, analyze, model, and visualize results.

Introduction cont. For example, Harvard Connectome project is in the process of creating a complete “ wiring-diagram” of a rat brain at 3 nm/pixel resolution using automated slicing and data collection instruments.

Introduction cont. Many important problems have remained intractable because there were on computer powerful enough or because scientists simply could not afford to access machine with the necessary capabilities. Current revolution in scientific computation is happening because intense computation in computer graphics, mainly driven by the computer gaming industry, has evolved graphics processors into extremely capable yet low-cost general-purpose computation platforms.

Introduction cont. GPGPU (general-purpose graphics processor unit). Many scientists and programmers, using existing tools, are able to achieve one to tow orders of magnitude, 10x-100x, of performance increase over conventional hardware when running their applications on GPGPUs.

Massive Multithreading Massive Multithreading is the key to harnessing computational power of GPGPUs because it provides a common paradigm that both programmers and hardware designers can exploit to attain the highest possible performance. Permits graphics processors to achieve extremely high floating-point performance because the latency of memory accesses can be hidden and full bandwidth of the memory subsystem can be utilized.

Massive Multithreading cont. Roughly speaking, graphics processors can be considered “streaming processors” because best performance is achieved when coalesced memory operations are used to simultaneously stream data from all of the on-board graphics memory banks. coalesced memory operation combines simultaneous memory accesses by multiple threads into a single memory transaction.

GPGPU

GPGPU cont GPU hardware effectively evolved into single program multiple data (SPMD). NVIDAI generally bundles 32 threads into a wrap, which runs single instruction multiple data fashion (SIMD) on each streaming multiprocessor SM

CUDA ( Compute Unified Device Architecture ) Is a parallel computing platform and programming model created by NVIDIA. The CUDA platform is accessible to software developers through CUDA-accelerated libraries, compiler directives, and extensions to industry-standard programming languages, including C, C++,..etc.

CUDA Memory Types cont. Global memory (read and write) slow, but cached Texture memory (read only) { cache optimized for 2D access pattern Constant memory where constants and kernel arguments are stored Shared memory (48KB per SM) fast, but subject to (diernt) bank conicts Local memory used for whatever doesn't t in to registers part of global memory; slow but now cached Registers { bit registers per SM

CUDA Memory Types cont. Using registers Registers are read/write per-thread – Can’t access registers outside of each thread – Used for storing local variables in functions, etc. – No special syntax for doing this - just declare local variables as usual Physically stored in each MP Can’t index (no arrays) Obviously, can’t access from host code

CUDA Memory Types cont. Local memory Also read/write per-thread Can’t read other threads’ local memory – Declare a variable in local memory using the __local__ keyword – __local__ float results[32]; Can index (this is where local arrays go) Much slower than register memory! – Don’t use local arrays if you don’t have to

CUDA Memory Types cont. Shared memory Read/write per-block All threads in a block share the same memory In general, pretty fast – Certain cases can hinder performance...

CUDA Memory Types cont. Using shared memory Similar to local memory: – __shared__ float current_row[]; Only declare one variable as shared! – Multiple declarations of __shared__ variables will occupy same memory space! – __shared__ float a[]; – __shared__ float b[]; – b[0] = 0.5f; – // now a[0] == 0.5f also!

CUDA Memory Types cont. Global memory Read/write per-application Can share between blocks and grids Persistent across kernel executions Completely un-cached Really slow!

CUDA Memory Types cont. Constant memory Read-only from device Cached in each SM Cache can broadcast to every thread running - very efficient! Keyword: __constant__ Access from device code like normal variables Set values from host code with cudaMemcpyToSymbol – Can’t use pointers – Can’t dynamically allocate

CUDA Memory Types cont. Texture memory Read-only from device Complex 2D caching method Linear filtering/interpolation available,

CUDA C/C++ programming

Heterogeneous Computing Terminology: -Host The CPU and its memory (host memory) -Device The GPU and its memory (device memory) hostDevice

Hello World! int main(void) { printf("Hello World!\n"); return 0; } Standard C that runs on the host NVIDIA compiler (nvcc) can be used to compile programs with no device code Output: $ nvcc hello_world.cu $ a.out Hello World! $

Hello World! with Device Code __global__ void mykernel(void) { } int main(void) { mykernel >>(); printf("Hello World!\n"); return 0; } CUDA C/C++ keyword __global__ indicates a function that: -Runs on the device -Is called from host code nvcc separates source code into host and device components -Device functions (e.g. mykernel()) processed by NVIDIA compiler -Host functions (e.g. main()) processed by standard host compiler -gcc, cl.exe

Memory Management Host and device memory are separate entities Device pointers point to GPU memory -May be passed to/from host code -May not be dereferenced in host code Host pointers point to CPU memory -May be passed to/from device code -May not be dereferenced in device code Simple CUDA API for handling device memory cudaMalloc(), cudaFree(), cudaMemcpy() Similar to the C equivalents malloc(), free(), memcpy()

Addition on the Device: add() __global__ void add(int *a, int *b, int *c) { int index = threadIdx.x + blockIdx.x * blockDim.x; c[index] = a[index] + b[index]; } #define N (2048*2048) #define THREADS_PER_BLOCK 512 int main(void) { int *a, *b, *c; // host copies of a, b, c int *d_a, *d_b, *d_c; // device copies of a, b, c int size = N * sizeof(int); // Alloc space for device copies of a, b, c cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size);

Addition on the Device: add() a = (int *)malloc(size); random_ints(a, N); b = (int *)malloc(size); random_ints(b, N); c = (int *)malloc(size); // Copy inputs to device cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice); // Launch add() kernel on GPU add >>(d_a, d_b, d_c); cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost); // Copy result back to host free(a); free(b); free(c); cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0; }

kernel can be executed by multiple equally-shaped thread blocks, so that the total number of threads is equal to the number of threads per block times the number of blocks.

CUDA in Bioinformatics MUMmerGPU: High-through DNA sequence alignment using GPUs

CUDA in Bioinformatics SmithWaterman-CUDA allows to perform alignments between one or more sequences and a database (all the sequences, even in the DB, are intended to be proteinic). LISSOM (Laterally Interconnected Synergetically Self- Organizing Map) is a model of human neocortex (mainly modeled on visual cortex) at a neural column level.

CUDA in Bioinformatics CUDA-MEME is an ultrafast scalable motif discovery algorithm based on MEME. MEME (Multiple expectation maximization form Motif Elicitation). CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform.