CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

Slides:

Advertisements

Similar presentations

CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Advertisements

Lecture 1: Introduction

CUDA exercitation. Ex 1 Analyze device properties of each device on the node by using cudaGetDeviceProperties function Check the compute capability, global.

List Ranking and Parallel Prefix

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

More on threads, shared memory, synchronization

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Lecture 1: Introduction

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Waves!. Solving something like this… The Wave Equation (1-D) (n-D)

© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.

GPGPU platforms GP - General Purpose computation using GPU

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

CS 179: GPU Computing Lecture 3 / Homework 1. Recap Adding two arrays… a close look – Memory: Separate memory space, cudaMalloc(), cudaMemcpy(), … – Processing:

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CS 179: GPU Programming Lecture 9 / Homework 3. Recap Some algorithms are “less obviously parallelizable”: – Reduction – Sorts – FFT (and certain recursive.

CIS 565 Fall 2011 Qing Sun

Array Cs212: DataStructures Lab 2. Array Group of contiguous memory locations Each memory location has same name Each memory location has same type a.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

QCAdesigner – CUDA HPPS project

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

CS 193G Lecture 5: Parallel Patterns I. Getting out of the trenches So far, we’ve concerned ourselves with low-level details of kernel programming Mapping.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

© David Kirk/NVIDIA and Wen-mei W. Hwu, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 11 Parallel Computation.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

CS/EE 217: GPU Architecture and Parallel Programming Convolution, (with a side of Constant Memory and Caching) © David Kirk/NVIDIA and Wen-mei W. Hwu/University.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CUDA C/C++ Basics Part 2 - Blocks and Threads

CS 179: GPU Programming Lecture 1: Introduction 1

CS427 Multicore Architecture and Parallel Computing

Basic CUDA Programming

CS 179: GPU Programming Lecture 1: Introduction 1

CS 179: GPU Programming Lecture 1: Introduction 1

Recitation 2: Synchronization, Shared memory, Matrix Transpose

Pipeline parallelism and Multi–GPU Programming

CS 179: GPU Programming Lecture 7.

CS 179: Lecture 12.

CS/EE 217 – GPU Architecture and Parallel Programming

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

CS 179: Lecture 3.

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

Presentation transcript:

CS 179: Lecture 2 Lab Review 1

The Problem  Add two arrays  A[] + B[] -> C[]

GPU Computing: Step by Step  Setup inputs on the host (CPU-accessible memory)  Allocate memory for inputs on the GPU  Copy inputs from host to GPU  Allocate memory for outputs on the host  Allocate memory for outputs on the GPU  Start GPU kernel  Copy output from GPU to host  (Copying can be asynchronous)

The Kernel  Determine a thread index from block ID and thread ID within a block:

Calling the Kernel …

CUDA implementation (2)

Fixing the Kernel  For large arrays, our kernel doesn’t work!  Bounds-checking – be on the lookout!  Also, need a way for kernel to handle a few more elements…

Fixing the Kernel – Part 1

Fixing the Kernel – Part 2

Fixing our Call

Lab 1!  Sum of polynomials – Fun, parallelizable example!  Suppose we have a polynomial P(r) with coefficients c 0, …, c n-1, given by:  We want, for r 0, …, r N-1, the sum:  Output condenses to one number!

Calculating P(r) once  Pseudocode (one possible method): Given r, coefficients[] result <- 0.0 power <- 1.0 for all coefficient indecies i from 0 to n-1: result += (coefficients[i] * power) power *= r

Accumulation  atomicAdd() function  Important for safe operations!

Accumulation

Shared Memory  Faster than global memory  Per-block  One block

Linear Accumulation  atomicAdd() has a choke point!  What if we reduced our results in parallel?

Linear Accumulation …

Linear Accumulation (2)

Can we do better?

Last notes  minuteman.cms.caltech.edu – the easiest option  CMS accounts!  Office hours  Kevin: Monday, 8-10 PM  Connor: Tuesday, 8-10 PM