CS 179: Lecture 3.

Slides:

Advertisements

Similar presentations

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Advertisements

Intermediate GPGPU Programming in CUDA

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

CS 179: Lecture 2 Lab Review 1. The Problem  Add two arrays  A[] + B[] -> C[]

More on threads, shared memory, synchronization

CS 791v Fall # a simple makefile for building the sample program. # I use multiple versions of gcc, but cuda only supports # gcc 4.4 or lower. The.

BWUPEP2011, UIUC, May 29 - June Taking CUDA to Ludicrous Speed BWUPEP2011, UIUC, May 29 - June Blue Waters Undergraduate Petascale Education.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

L9: Control Flow CS6963. Administrative Project proposals Due 5PM, Friday, March 13 (hard deadline) MPM Sequential code and information posted on website.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

L8: Control Flow CS6963. Administrative Next assignment on the website – Description at end of class – Due Wednesday, Feb. 17, 5PM (done?) – Use handin.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

CS 179: Lecture 4 Lab Review 2. Groups of Threads (Hierarchy) (largest to smallest)  “Grid”:  All of the threads  Size: (number of threads per block)

CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.

High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:

CS179: GPU Programming Lecture 11: Lab 5 Recitation.

L8: Writing Correct Programs, cont. and Control Flow L8: Control Flow CS6235.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

GPU Architecture and Programming

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.

QCAdesigner – CUDA HPPS project

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

© David Kirk/NVIDIA, Wen-mei W. Hwu, and John Stratton, ECE 498AL, University of Illinois, Urbana-Champaign 1 CUDA Lecture 7: Reductions and.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Martin Kruliš by Martin Kruliš (v1.0)1.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.

Single Instruction Multiple Threads

Lecture 10 CUDA Instructions

Lecture 5: Performance Considerations

Sathish Vadhiyar Parallel Programming

CS427 Multicore Architecture and Parallel Computing

EECE571R -- Harnessing Massively Parallel Processors ece

Lecture 2: Intro to the simd lifestyle and GPU internals

Lecture 5: GPU Compute Architecture

Recitation 2: Synchronization, Shared memory, Matrix Transpose

CS 179: GPU Programming Lecture 7.

Lecture 5: GPU Compute Architecture for the last time

CUDA Parallelism Model

Parallel Computation Patterns (Reduction)

ECE408 Applied Parallel Programming Lecture 14 Parallel Computation Patterns – Parallel Prefix Sum (Scan) Part-2 © David Kirk/NVIDIA and Wen-mei W.

ECE 498AL Lecture 15: Reductions and Their Implementation

General Purpose Graphics Processing Units (GPGPUs)

ECE 498AL Lecture 10: Control Flow

CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization

Lecture 3: CUDA Threads & Atomics

ECE 498AL Spring 2010 Lecture 10: Control Flow

Lecture 5: Synchronization and ILP

6- General Purpose GPU Programming

Sarah Diesburg Operating Systems COP 4610

Presentation transcript:

CS 179: Lecture 3

A bold problem! Suppose we have a polynomial P(r) with coefficients c0, …, cn-1, given by: We want, for r0, …, rN-1, the sum:

A kernel !!

A more correct kernel

Performance problem! Serialization of atomicAdd()

Parallel accumulation …

Parallel accumulation

Accumulation methods – A visual Purely atomic: “Linear accumulation”: *output

GPU: Internals Multiprocessor (Blocks go here) SIMD processing unit (Warps go here) GPU: Internals Block Warp

Warp 32 threads that execute simultaneously! A[] + B[] -> C[] problem: Didn’t matter! Warp of threads 0-31: A[idx] + B[idx] -> C[idx] Warp of threads 32-63: Warp of threads 64-95: …

Divergence – Example Branches result in different instructions! //Suppose we have a pointer to some floating-point //value *val... if (threadIdx.x % 2 == 0) *value += 2; //Branch A else *value /= 3; //Branch B Branches result in different instructions!

Divergence What happens: Executes normally until if-statement Branches to calculate Branch A (blue threads) Goes back (!) and branches to calculate Branch B (red threads)

Calculating polynomial values – does it matter? Warp of threads 0-31: (calculate some values) Warp of threads 32-63: Warp of threads 64-95: Same instructions! Doesn’t matter!

Linear reduction - does it matter? (after calculating values…) Warp of threads 0-31: Thread 0: Accumulate sum Threads 1-31: Do nothing Warp of threads 32-63: Do nothing Warp of threads 64-95: Doesn’t really matter… in this case.

Improving our reduction More threads participating in the process! “Binary tree”

Improving our reduction

Improving our reduction //Let our shared memory block be partial_outputs[]... synchronize threads before starting... set offset to 1 while ( (offset * 2) <= block dimension): if (thread index % (offset * 2) is 0) AND you won’t exceed your block dimension: add partial_outputs[thread index + offset] to partial_outputs[thread index] double the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

Improving our reduction What will happen? Each warp will do meaningful execution on ~16 threads You use lots more warps than you have to! A “divergent tree” How to improve?

“Non-divergent tree”

“Non-divergent tree” //Let our shared memory block be partial_outputs[]... set offset to highest power of 2 that’s less than the block dimension //For the first iteration, check that you don’t access //out of range memory while (offset >= 1): if (thread index < offset): add partial_outputs[thread index + offset] to partial_outputs[thread index] halve the offset synchronize threads Get thread 0 to atomicAdd() partial_outputs[0] to output

“Non-divergent tree” Suppose we have our block of 512 threads… Warp of threads 0-31: Accumulate result … Warp of threads 224-255: Warp of threads 256-287: Do nothing Warp of threads 480-511:

“Non-divergent tree” Suppose we’re now down to 32 threads… Warp of threads 0-31: Threads 0-15: Accumulate result Threads 16-31: Do nothing Much less divergence Divergence only occurs in the middle, if ever!

Reduction – Four Approaches Atomic only: Divergent tree: Linear: Non-divergent tree: *output

Notes on files? (An aside) Labs and CUDA programming typically have the following files involved: ____.cc Allows C++ code g++ compiles this ____.cu Allows CUDA syntax nvcc compiles this ____.cuh CUDA header file (declare accessible functions)

Big Picture (so far) CUDA allows large speedups! Parallelism with CUDA: Different from CPU parallelism Threads are “small” Memory to think about Increase speedup by thinking about problem constraints! Reduction (and more!)

Big Picture (so far) Steps in these two problems are widely applicable! Dot product Norm calculation (and more)

Other notes Office hours: Lab 1 due Wednesday, 5 PM Monday: 8-10 PM Tuesday: 8-10 PM Lab 1 due Wednesday, 5 PM