demo 146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding.

Slides:

Advertisements

Similar presentations

Intermediate GPGPU Programming in CUDA

Advertisements

List Ranking and Parallel Prefix

C++  PPL  AMP When NO branches between a micro-op and retiring to the visible architectural state – its no longer speculative.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Basic Algorithms on Arrays. Learning Objectives Arrays are useful for storing data in a linear structure We learn how to process data stored in an array.

for (i = 0; i < 1024; i++) C[i] = A[i]*B[i]; for (i = 0; i < 1024; i+=4) C[i:i+3] = A[i:i+3]*B[i:i+3]; #pragma loop(hint_parallel ( N ) ) for.

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

C++ AMP: Accelerated Massive Parallelism in Visual C++ 11 Kate Gregory Gregory Consulting

Software. What Is Software? software –Also called Computer programs –Are a list of instructions –Instructions are called code –CPU performs the instructions.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Presented by David Cravey 10/15/2011. About Me – David Cravey Started programming in 4 th grade Learned BASIC on a V-Tech “Precomputer 1000” and then.

NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

A Source-to-Source OpenACC compiler for CUDA Akihiro Tabuchi †1 Masahiro Nakao †2 Mitsuhisa Sato †1 †1. Graduate School of Systems and Information Engineering,

C++ + r1 r2 r3 add r3, r1, r2 SCALAR (1 operation) v1 v2 v3 + vector length vadd v3, v1, v2 VECTOR (N operations)

Visual Studio 11 for Game Developers Boris Jabes Senior Program Manager Microsoft Corporation.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

images source: AMD image source: NVIDIA performance portability productivity.

C++ Accelerated Massive Parallelism in Visual C Kate Gregory Gregory Consulting DEV334.

An Introduction to Programming with CUDA Paul Richmond

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

Steve Teixeira Director of Program Management, Visual C++ Microsoft Corporation Visual C++ and the Native Renaissance.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang

Accelerating MATLAB with CUDA

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

CIS 565 Fall 2011 Qing Sun

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.

CS 193G Lecture 2: GPU History & CUDA Programming Basics.

This deck has 1-, 2-, and 3- slide variants for C++ AMP If your own deck uses 4:3, get with the 21 st century and switch to 16:9 ( Design tab, Page Setup.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.

MAXIMISE.NET WITH C++ FOR INTEROP, PERFORMANCE AND PRODUCTIVITY Angel Hernandez Avanade Australia (c) 2011 Microsoft. All rights reserved. SESSION CODE:

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

University of Michigan Electrical Engineering and Computer Science Paragon: Collaborative Speculative Loop Execution on GPU and CPU Mehrzad Samadi 1 Amir.

Using Open64 for High Performance Computing on a GPU by Mike Murphy, Gautam Chakrabarti, and Xiangyun Kong.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

Graphic Processing Units Presentation by John Manning.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Computer Engg, IIT(BHU)

CUDA Programming Model

Patrick Cozzi University of Pennsylvania CIS Spring 2011

CS 179: GPU Programming Lecture 7.

Taming GPU compute with C++ Accelerated Massive Parallelism

CUDA Parallelism Model

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Parallel programming with GPGPU coprocessors

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

CS 179: Lecture 3.

Introduction to CUDA.

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Presentation transcript:

demo

146X Interactive visualization of volumetric white matter connectivity 36X Ionic placement for molecular dynamics simulation on GPU 19X Transcoding HD video stream to H X Simulation in Matlab using.mex file CUDA function 100X Astrophysics N- body simulation 149X Financial simulation of LIBOR model with swaptions 47X An M- script API for linear Algebra operations on GPU 20X Ultrasound medical imaging for cancer diagnostics 24X Highly optimized object oriented molecular dynamics 30X Cmatch exact string matching to find similar proteins and gene sequences source

images source: AMD

image source: AMD

performance portability productivity

void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } How do we take the serial code on the left that runs on the CPU and convert it to run on an accelerator like the GPU?

void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; } #include using namespace concurrency; void AddArrays(int n, int * pA, int * pB, int * pC) { array_view a(n, pA); array_view b(n, pB); array_view sum(n, pC); parallel_for_each( sum.grid, [=](index i) restrict(direct3d) { sum[i] = a[i] + b[i]; } ); } void AddArrays(int n, int * pA, int * pB, int * pC) { for (int i=0; i<n; i++) { pC[i] = pA[i] + pB[i]; }

void AddArrays(int n, int * pA, int * pB, int * pC) { array_view a(n, pA); array_view b(n, pB); array_view sum(n, pC); parallel_for_each( sum.grid, [=](index idx) restrict(direct3d) { sum[idx] = a[idx] + b[idx]; } ); } array_view variables captured and associated data copied to accelerator (on demand) parallel_for_each: execute the lambda on the accelerator once per thread grid: the number and shape of threads to execute the lambda index: the thread ID that is running the lambda, used to index into data array_view: wraps the data to operate on the accelerator restrict(direct3d): tells the compiler to check that this code can execute on Direct3D hardware (aka accelerator)

index i(2);index i(0,2);index i(2,0,1); extent e(3,2,2);extent e(3,4);extent e(6); grid g(e);

vector v(96); extent e(8,12); // e[0] == 8; e[1] == 12; array a(e, v.begin(), v.end()); // in the body of my lambda index i(3,9); // i[0] == 3; i[1] == 9; int o = a[i]; //or a[i] = 16; //int o = a(i[0], i[1]);

vector v(10); extent e(2,5); array_view a(e, v); //above two lines can also be written //array_view a(2,5,v);

1.parallel_for_each( 2.g, //g is of type grid 3.[ ](index idx) restrict(direct3d) { // kernel code } 1.);

void MatrixMultiplySerial( vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { for (int row = 0; row < M; row++) { for (int col = 0; col < N; col++){ float sum = 0.0f; for(int i = 0; i < W; i++) sum += vA[row * W + i] * vB[i * N + col]; vC[row * N + col] = sum; } void MatrixMultiplyAMP( vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { array_view a(M,W,vA),b(W,N,vB); array_view,2> c(M,N,vC); parallel_for_each(c.grid, [=](index idx) restrict(direct3d) { int row = idx[0]; int col = idx[1]; float sum = 0.0f; for(int i = 0; i < W; i++) sum += a(row, i) * b(i, col); c[idx] = sum; } ); }

PCIe HostAccelerator (GPU example)

// Identify an accelerator based on Windows device ID accelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”); // …or enumerate all accelerators (not shown) // Allocate an array on my accelerator array myArray(10, myAcc.default_view); // …or launch a kernel on my accelerator parallel_for_each(myAcc.default_view, myArrayView.grid,...); // Identify an accelerator based on Windows device ID accelerator myAcc(“PCI\\VEN_1002&DEV_9591&CC_0300”); // …or enumerate all accelerators (not shown) // Allocate an array on my accelerator array myArray(10, myAcc.default_view); // …or launch a kernel on my accelerator parallel_for_each(myAcc.default_view, myArrayView.grid,...);

g.tile () extent e(8,6); grid g(e);

T array_view data(8, 6, p_my_data); parallel_for_each( data.grid.tile (), [=] (tiled_index t_idx)… { … }); T 7

void MatrixMultSimple(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { array_view a(M, W, vA), b(W, N, vB); array_view,2> c(M,N,vC); parallel_for_each(c.grid, [=] (index idx) restrict(direct3d) { int row = idx[0]; int col = idx[1]; float sum = 0.0f; for(int k = 0; k < W; k++) sum += a(row, k) * b(k, col); c[idx] = sum; } ); } void MatrixMultTiled(vector & vC, const vector & vA, const vector & vB, int M, int N, int W ) { static const int TS = 16; array_view a(M, W, vA), b(W, N, vB); array_view,2> c(M,N,vC); parallel_for_each(c.grid.tile (), [=] (tiled_index t_idx) restrict(direct3d) { int row = t_idx.local[0]; int col = t_idx.local[1]; float sum = 0.0f; for (int i = 0; i < W; i += TS) { tile_static float locA[TS][TS], locB[TS][TS]; locA[row][col] = a(t_idx.global[0], col + i); locB[row][col] = b(row + i, t_idx.global[1]); t_idx.barrier.wait(); for (int k = 0; k < TS; k++) sum += locA[row][k] * locB[k][col]; t_idx.barrier.wait(); } c[t_idx.global] = sum; } ); }

demo

double bar( double ) restrict(cpu,direc3d); // 1: same code for both double cos( double ); // 2a: general code double cos( double ) restrict(direct3d); // 2b: specific code void SomeMethod(array c) { parallel_for_each( c.grid, [=](index idx) restrict(direct3d) { //… double d1 = bar(c[idx]);// ok double d2 = cos(c[idx]);// ok, chooses direct3d overload //… }); }