Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture begins with an example of how a wide- memory bus is utilized in GPUs The impact of memory coalescing and memory bank conflicts.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Main MemoryCS510 Computer ArchitecturesLecture Lecture 15 Main Memory.
ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Instructor Notes This lecture discusses three important optimizations The performance impact of mapping threads to data on the GPU is subtle but extremely.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L8: Memory Hierarchy Optimization, Bandwidth CS6963.
EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
GPGPU platforms GP - General Purpose computation using GPU
OpenCL Introduction A TECHNICAL REVIEW LU OCT
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 – July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Memory Hardware in G80.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
GPU Programming with CUDA – Optimisation Mike Griffiths
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 9: Memory Hardware in G80.
CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
©Wen-mei W. Hwu and David Kirk/NVIDIA, University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 6: DRAM Bandwidth.
GPU Performance Optimisation Alan Gray EPCC The University of Edinburgh.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Introduction to CUDA Programming Optimizing for CUDA Andreas Moshovos Winter 2009 Most slides/material from: UIUC course by Wen-Mei Hwu and David Kirk.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
GPU Computing CIS-543 Lecture 09: Shared and Constant Memory
Computer Architecture Chapter (14): Processor Structure and Function
EECE571R -- Harnessing Massively Parallel Processors ece
GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)
ISPASS th April Santa Rosa, California
ECE408/CS483 Fall 2015 Applied Parallel Programming Lecture 7: DRAM Bandwidth ©Wen-mei W. Hwu and David Kirk/NVIDIA, ECE408/CS483/ECE498AL, University.
ECE408/CS483 Applied Parallel Programming Lecture 7: DRAM Bandwidth
Recitation 2: Synchronization, Shared memory, Matrix Transpose
L18: CUDA, cont. Memory Hierarchy and Examples
DRAM Bandwidth Slide credit: Slides adapted from
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Mattan Erez The University of Texas at Austin
Memory System Performance Chapter 3
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
6- General Purpose GPU Programming
Force Directed Placement: GPU Implementation
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz

The Big Picture Introduction to Heterogeneous Systems OpenCL Basics Memory Issues Scheduling 1

Memory Banks Memory is made up of banks – Memory banks are the hardware units that actually store data The memory banks targeted by a memory access depend on the address of the data to be read/written – Note that on current GPUs, there are more memory banks than can be addressed at once by the global memory bus, so it is possible for different accesses to target different banks Bank response time, not access requests, is the bottleneck Successive data are stored in successive banks (strides of 32-bit words on GPUs) so that a group of threads accessing successive elements will produce no bank conflicts 2

Bank Conflicts – Local Memory Bank conflicts have the largest negative effect on local memory operations – Local memory does not require that accesses are to sequentially increasing elements Accesses from successive threads should target different memory banks – Threads accessing sequentially increasing data will fall into this category 3

Bank Conflicts – Local Memory On AMD, a wavefront that generates bank conflicts stalls until all local memory operations complete – The hardware does not hide the stall by switching to another wavefront The following examples show local memory access patterns and whether conflicts are generated – For readability, only 8 memory banks are shown 4

Bank Conflicts – Local Memory If there are no bank conflicts, each bank can return an element without any delays – Both of the following patterns will complete without stalls on current GPU hardware Memory Bank Thread Memory Bank Thread

Bank Conflicts – Local Memory If multiple accesses occur to the same bank, then the bank with the most conflicts will determine the latency – The following pattern will take 3 times the access latency to complete Memory Bank Thread Conflicts

Bank Conflicts – Local Memory If all accesses are to the same address, then the bank can perform a broadcast and no delay is incurred – The following will only take one access to complete assuming the same data element is accessed Memory Bank Thread

Bank Conflicts – Global Memory Bank conflicts in global memory rely on the same principles, however the global memory bus makes the impact of conflicts more subtle – Since accessing data in global memory requires that an entire bus-line be read, bank conflicts within a work-group have a similar effect as non-coalesced accesses If threads reading from global memory had a bank conflict then by definition it manifest as a non-coalesced access Not all non-coalesced accesses are bank conflicts, however The ideal case for global memory is when different work-groups read from different banks – In reality, this is a very low-level optimization and should not be prioritized when first writing a program 8

Summary GPU memory is different than CPU memory – The goal is high throughput instead of low-latency Memory access patterns have a huge impact on bus utilization – Low utilization means low performance Having coalesced memory accesses and avoiding bank conflicts are required for high performance code Specific hardware information (such as bus width, number of memory banks, and number of threads that coalesce memory requests) is GPU-specific and can be found in vendor documentation 9

The Big Picture Introduction to Heterogeneous Systems OpenCL Basics Memory Issues Optimisations 10

Thread Mapping Consider a serial matrix multiplication algorithm This algorithm is suited for output data decomposition – We will create NM threads Effectively removing the outer two loops – Each thread will perform P calculations The inner loop will remain as part of the kernel Should the index space be MxN or NxM? 11

Thread Mapping Thread mapping 1: with an MxN index space, the kernel would be: Thread mapping 2: with an NxM index space, the kernel would be: Both mappings produce functionally equivalent versions of the program 12

Thread Mapping This figure shows the execution of the two thread mappings on NVIDIA GeForce 285 and 8800 GPUs Notice that mapping 2 is far superior in performance for both GPUs 13

Thread Mapping In mapping 1, consecutive threads (tx) are mapped to different rows of Matrix C, and non-consecutive threads (ty) are mapped to columns of Matrix B – The mapping causes inefficient memory accesses 14

Thread Mapping In mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and C – Accesses to both of these matrices will be coalesced Degree of coalescence depends on the workgroup and data sizes 15

Thread Mapping In mapping 2, consecutive threads (tx) are mapped to consecutive elements in Matrices B and C – Accesses to both of these matrices will be coalesced Degree of coalescence depends on the workgroup and data sizes 16

Matrix Transpose A matrix transpose is a straightforward technique – Out(x,y) = In(y,x) No matter which thread mapping is chosen, one operation (read/write) will produce coalesced accesses while the other (write/read) produces uncoalesced accesses – Note that data must be read to a temporary location (such as a register) before being written to a new location 17 In Out coalesceduncoalesced coalesced Threads InOut

Matrix Transpose If local memory is used to buffer the data between reading and writing, we can rearrange the thread mapping to provide coalesced accesses in both directions – Note that the work group must be square InOut coalesced Threads global mem index local mem index Local memory

Matrix Transpose The following figure shows a performance comparison of the two transpose kernels for matrices of size NxM on an AMD 5870 GPU – “Optimized” uses local memory and thread remapping 19

Runtimes on Fermi 20 HOST: 302/683 !!! All times in msec. Matrix size: 4096x4096

What happened? uncoalescedcoalesced Threads InOut Thread 0 issues read Cache (local memory) Thread 1 issues read Thread 2 issues read Thread 3 issues read

What happened? uncoalescedcoalesced Threads InOut Coalesced write! Cache (local memory) Coalesced read from cache! Coalesced write! Coalesced read from cache! …

Summary When it comes to performance, memory throughput and latency hiding are key! Main Tools are: – Memory choice (global/local/ private) – Memory layout (coalescing & indexing) – Thread Mapping – Workgroup size (synchronisation &latency hiding) 23