Lecturer: Alan Christopher

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
More on threads, shared memory, synchronization
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
4/2/2013 CS152, Spring 2013 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
April 4, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 16: Graphics Processing Units (GPUs) Krste Asanovic Electrical Engineering.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
4/18/ cs252-S12, Lecture 23 CS252 Graduate Computer Architecture Lecture 23 Graphics Processing Units (GPU) April 18 th, 2012 Krste Asanovic Electrical.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2013 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Computer Engg, IIT(BHU)
The Present and Future of Parallelism on GPUs
Prof. Zhang Gang School of Computer Sci. & Tech.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 5: GPU Compute Architecture
Mattan Erez The University of Texas at Austin
Lecture 26: Multiprocessors
Presented by: Isaac Martin
Lecture 5: GPU Compute Architecture for the last time
Lecture 27: Multiprocessors
Mattan Erez The University of Texas at Austin
General Purpose Graphics Processing Units (GPGPUs)
ECE 498AL Lecture 10: Control Flow
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
ECE 498AL Spring 2010 Lecture 10: Control Flow
Multicore and GPU Programming
Krste Asanovic Electrical Engineering and Computer Sciences
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
GPU Architectures A CPU Perspective
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Lecturer: Alan Christopher CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 28: GP-GPU Programming Lecturer: Alan Christopher

Overview GP-GPU: What and why OpenCL, CUDA, and programming GPUs GPU Performance demo Spring 2014 -- Lecture #28

A Quick Review: Classes of Parallelism ILP: Run multiple instructions from one stream in parallel (e.g. pipelining) TLP: Run multiple instruction streams simultaneously (e.g. openMP) DLP: Run the same operation on multiple data at the same time (e.g. SSE intrinsics) GPUs are here Spring 2014 -- Lecture #28

GPUs Hardware specialized for graphics calculations Originally developed to facilitate the use of CAD programs Graphics calculations are extremely data parallel e.g. translate every vertex in a 3D model to the right Programmers found that that could rephrase some of their problems as graphics manipulations and run them on the GPU Incredibly burdensome for the programmer to use More usable these days – openCL, CUDA Spring 2014 -- Lecture #28

CPU vs. GPU Latency optimized A couple threads of execution Each thread executes quickly Serial code Lots of caching Throughput optimized Many, many threads of execution Each thread executes slowly Parallel code Lots of memory bandwidth Spring 2014 -- Lecture #28

OpenCL and CUDA Extensions to C which allow for relatively easy GPU programming CUDA is NVIDIA proprietary NVIDIA cards only OpenCL is opensource Can be used with NVIDA or ATI cards Intended for general heterogeneous computing Means you can use it with stuff like FPGAs Also means it's relatively clunky Similar tools, but different jargon Spring 2014 -- Lecture #28

Kernels Kernels define the computation for one array index The GPU runs the kernel on each index of a specified range Similar functionality to map, but you get to know the array index and the array value. Call the work at a given index a work-item, a cuda thread, or a µthread. The entire range is called an index-space or grid. Spring 2014 -- Lecture #28

OpenCL vvadd /* C version. */ void vvadd(float *dst, float *a, float *b, unsigned n) { for(int i = 0; i < n; i++) dst[i] = a[i] + b[i] } /* openCL Kernel. */ __kernel void vvadd(__global float *dst, __global float *a, __global float *b, unsigned n) { unsigned tid = get_global_id(0); if (tid < n) dst[tid] = a[tid] + b[tid]; Spring 2014 -- Lecture #28

Programmer's View of Execution Create enough work groups to cover input vector (openCL calls this ensemble of work groups an index space, can be 3- dimensional in openCL, 2 dimensional in CUDA) threadId 0 threadId 1 threadId 255 Local work size (programmer can choose) threadId 256 threadId 257 threadId 511 Conditional (i<n) turns off unused threads in last block threadId m threadId m+1 threadId n+e Spring 2014 -- Lecture #28

Hardware Execution Model GPU Core 0 Lane 0 Lane 1 Lane 15 Core 1 Core 15 CPU CPU Memory GPU Memory GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor. CPU sends whole index-space over to GPU, which distributes work-groups among cores (each work- group executes on one core) Programmer unaware of number of cores Notice that the GPU and CPU have different memory spaces. That'll be important when we start considering which jobs are a good fit for GPUs, and which jobs are a poor fit. Spring 2014 -- Lecture #28

“Single Instruction, Multiple Thread” GPUs use a SIMT model, where individual scalar instruction streams for each work item are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp. OpenCL refers to them as wavefronts.) µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x Scalar instruction stream mul a ld y add st y SIMD execution across wavefront Spring 2014 -- Lecture #28

Teminology Summary Kernel: The function that is mapped across the input. Work-item: The basic unit of execution. Takes care of one index. Also called a microthread or cuda thread. Work-group/Block: A group of work-items. Each work-group is sent to one core in the GPU. Index-space/Grid: The range of indices over which the kernel is applied. Wavefront/Warp: A group of microthreads (work-items) scheduled to be SIMD executed with eachother. Spring 2014 -- Lecture #28

Administrivia Homework 4 is due Sunday (April 6th) Fall 2011 -- Lecture #37

Conditionals in the SIMT Model Simple if-then-else are compiled into predicated execution, equivalent to vector masking More complex control flow compiled into branches How to execute a vector of branches? µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 tid=threadid Scalar instruction stream If (tid >= n) skip Call func1 add st y skip: SIMD execution across warp Spring 2014 -- Lecture #28

Branch Divergence Hardware tracks which µthreads take or don’t take branch If all go the same way, then keep going in SIMD fashion If not, create mask vector indicating taken/not-taken Keep executing not-taken path under mask, push taken branch PC+mask onto a hardware stack and execute later When can execution of µthreads in warp reconverge? Spring 2014 -- Lecture #28

Warps (wavefronts) are multithreaded on a single core One warp of 32 µthreads is a single thread in the hardware Multiple warp threads are interleaved in execution on a single core to hide latencies (memory and functional unit) A single thread block can contain multiple warps (up to 512 µT max in CUDA), all mapped to single core Can have multiple blocks executing on one core [Nvidia, 2010] Spring 2014 -- Lecture #28

OpenCL Memory Model Global – read and write by all work-items and work- groups Constant – read-only by work-items; read and write by host Local – used for data sharing; read/write by work-items in the same work group Private – only accessible to one work-item Spring 2014 -- Lecture #28

SIMT Illusion of many independent threads But for efficiency, programmer must try and keep µthreads aligned in a SIMD fashion Try to do unit-stride loads and store so memory coalescing kicks in Avoid branch divergence so most instruction slots execute useful work and are not masked off Spring 2014 -- Lecture #28

VVADD A: CPU faster B: GPU faster /* C version. */ void vvadd(float *dst, float *a, float *b, unsigned n) { #pragma omp parallel for for(int i = 0; i < n; i++) dst[i] = a[i] + b[i] } /* openCL Kernel. */ __kernel void vvadd(__global float *dst, __global float *a, __global float *b, unsigned n) { unsigned tid = get_global_id(0); if (tid < n) dst[tid] = a[tid] + b[tid]; A: CPU faster B: GPU faster Spring 2014 -- Lecture #28

VVADD /* C version. */ void vvadd(float *dst, float *a, float *b, unsigned n) { #pragma omp parallel for for(int i = 0; i < n; i++) dst[i] = a[i] + b[i] } Only 1 flop per three memory accesses => memory bound calculation. “A many core processor ≡ A device for turning a compute bound problem into a memory bound problem” – Kathy Yelick Spring 2014 -- Lecture #28

VECTOR_COP A: CPU faster B: GPU faster /* C version. */ void vector_cop(float *dst, float *a, float *b, unsigned n) { #pragma omp parallel for for(int i = 0; i < n; i++) { dst[i] = 0; for (int j = 0; j < A_LARGE_NUMBER; j++) dst[i] += a[i]*2*b[i] – a[i]*a[i] – b[i]*b[i]; } /* OpenCL kernel. */ __kernel void vector_cop(__global float *dst, __global float *a, __global float *b, unsigned n) { unsigned i = get_global_id(0); if (tid < n) { dst[i] = 0; for (int j = 0; j < A_LARGE_NUMBER; j++) dst[i] += a[i]*2*b[i] – a[i]*a[i] – b[i]*b[i]; } A: CPU faster B: GPU faster Spring 2014 -- Lecture #28

GP-GPU in the future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway Spring 2014 -- Lecture #28

Acknowledgements These slides contain materials developed and copryright by Krste Asanovic (UCB) AMD codeproject.com Spring 2014 -- Lecture #28

And in conclusion… GPUs thrive when The calculation is data parallel The calculation is CPU-bound The calculation is large CPUs thrive when The calculation is largely serial The calculation is small The programmer is lazy Fall 2011 -- Lecture #37

Bonus OpenCL source code for vvadd and vector_cop demos available at http://www-inst.eecs.berkeley.edu/~cs61c/sp13/lec/39/demo.tar.gz Spring 2014 -- Lecture #28