1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Optimization on Kepler Zehuan Wang
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Appendix A. Appendix A — 2 FIGURE A.2.1 Historical PC. VGA controller drives graphics display from framebuffer memory. Copyright © 2009 Elsevier, Inc.
Appendix A — 1 FIGURE A.2.2 Contemporary PCs with Intel and AMD CPUs. See Chapter 6 for an explanation of the components and interconnects in this figure.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Graphics Processing Units
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012 Emergence of GPU systems and clusters for general purpose High Performance Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
GPU Programming with CUDA – Optimisation Mike Griffiths
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Emergence of GPU systems and clusters for general purpose high performance computing ITCS 4145/5145 April 3, 2012 © Barry Wilkinson.
GPU Architecture and Programming
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
L17: CUDA, cont. Execution Model and Memory Hierarchy November 6, 2012.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Performance.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Oct 30, 2014.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
Single Instruction Multiple Threads
Sathish Vadhiyar Parallel Programming
EECE571R -- Harnessing Massively Parallel Processors ece
Parallel Computing Lecture
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Emergence of GPU systems for general purpose high performance computing ITCS 4145/5145 © Barry Wilkinson GPUIntro.ppt Nov 4, 2013.
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
ECE 8823: GPU Architectures
Mattan Erez The University of Texas at Austin
ECE 498AL Lecture 10: Control Flow
ECE 498AL Spring 2010 Lecture 10: Control Flow
Programming with Shared Memory Specifying parallelism
Graphics Processing Unit
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow statements in kernel code.

2 Stream processing Term used to denote the processing of a stream of instructions operating in a data parallel fashion as in GPUs. Each execution unit executes the same instruction on different data. A stream is the group of data items being processes A stream processor is the execution unit operating in this fashion with its own local resources (registers etc.)

3 GPU execution resources organized into “Stream Processors” (SP) – previously referred to as execution “cores” (Term used for data parallel computers.) Each stream processor has compute resources such as register file, instruction scheduler, … A number of blocks are assigned to each stream processor for execution Limits on number of threads that can be simultaneously tracked and scheduled– limits the number of size of blocks that can be assigned to each SM Stream Processors

4 Sometime see term “CUDA cores” or “thread processors” or “streaming processors” (SP)* NVIDIA NVIVIA group streaming processors (SPs) into streaming multiprocessor (SMs). Each streaming module shares control logic and instruction cache. * Terms *In the book “Programming Massively Parallel Processors” by Kirk and Hwu, Morgan Kaufmann, 2010 page 8

5 C2050 Fermi (as in coit-grid06.uncc.edu and coit-grid07.uncc.edu) 14 streaming multiprocessor (SMs), Each SM has 32 streaming processor (cores) So 448 cores Apparently Fermi was originally intended to have 512 cores (16 SM) but too hot. GeForce GTX 480 (March 2010) 15 SMs (480 cores) NVIDIA GPUs

6 Tesla K20 (as in coit-grid08.uncc.edu) 2496 stream processors (SPs) Kepler architecture - “SMX (streaming multiprocessor) design that delivers up to 3x more performance per watt compared to the SM in Fermi.”* * NVIDIA® TESLA® KEPLER GPU COMPUTING ACCELERATORS

7 Thread Scheduling Once a block assigned to a SM, divided into 32-threads units called warps. Size of a warp could change between implementations One warp is actually executed in hardware at a time (Some docs talk about a half-warp (16 thread units) actually simultaneously) Execution in SM – starts with the first warp in the first block

8 For a program without control instructions (no if statements etc.), the same instruction is executed for each thread in the warp simultaneously

9 When there is a divergent path, first the instructions on one path are executed and then the instructions in the other path, within each warp. So this causes the two paths to be serialized. But different warps are considered separately. It would be possible for one warp to execute one path and another warp to execute the other path at the same time. Control-flow instructions

10 Maximum performance Ideally have not control-flow statements If control-flow statements necessary: Programmer might be able to arrange each warp to execute just one path Example if (threadID < 16) /*do this */ ; if (threadID < 32 /* do this*/ ; if (threadID < 48) /* do this*/ ; Need to test/check

11 Compiler loop unrolling Sometimes compiler unrolls loops. Then no divergent paths Example for (i = 0; i < 4; i++) a[i] = 0; becomes a[0] = 0; a[1] = 0; a[2] = 0; a[3] = 0;

12 Branch predication instructions Compiler can also use branch predication instructions to eliminate divergent paths Branch predication instruction – a machine instruction that combines an Boolean condition (predicate) with an operation such as addition Example ADD R1, R2, R3 where CC == zero, etc.

Questions