CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Intermediate GPGPU Programming in CUDA

INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.

Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.

Chapter 9 Pointers and Dynamic Arrays. Overview 9.1 Pointers 9.2 Dynamic Arrays.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.

Optimization on Kepler Zehuan Wang

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

Efficient Packet Pattern Matching for Gigabit Network Intrusion Detection using GPUs Date:102/1/9 Publisher:IEEE HPCC 2012 Author:Che-Lun Hung, Hsiao-hsi.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

1 100M CUDA GPUs Oil & GasFinanceMedicalBiophysicsNumericsAudioVideoImaging Heterogeneous Computing CPUCPU GPUGPU Joy Lee Senior SW Engineer, Development.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA and the Memory Model (Part II). Code executed on GPU.

Programming with CUDA WS 08/09 Lecture 3 Thu, 30 Oct, 2008.

Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Barzan Shkeh. Outline Introduction Massive multithreading GPGPU CUDA memory types CUDA C/ C++ programming CUDA in Bioinformatics.

An Introduction to Programming with CUDA Paul Richmond

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Martin Kruliš by Martin Kruliš (v1.0)1.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.

CIS 565 Fall 2011 Qing Sun

GPU Architecture and Programming

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Programming with CUDA WS 08/09 Lecture 10 Tue, 25 Nov, 2008.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

1 2D Convolution, Constant Memory and Constant Caching © David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al University of Illinois,

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.

Introduction to CUDA Programming Textures Andreas Moshovos Winter 2009 Some material from: Matthew Bolitho’s slides.

CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.

Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.

CUDA C/C++ Basics Part 3 – Shared memory and synchronization

CUDA C/C++ Basics Part 2 - Blocks and Threads

EECE571R -- Harnessing Massively Parallel Processors ece

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

Friend Class Friend Class A friend class can access private and protected members of other class in which it is declared as friend. It is sometimes useful.

GPU Memories These notes will introduce:

Heterogeneous Programming

Basic CUDA Programming

Local Variables, Global Variables and Variable Scope

GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

6- General Purpose GPU Programming

C Programming Lecture-17 Storage Classes

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

CS179: GPU Programming Lecture 5: Memory

Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling

Memory Overview Very slow access: Between host and device Slow access: Global Memory Fast access: Shared memory, constant memory, texture memory, local memory Very fast access: Register memory

Global Memory Read/write Shared between blocks and grids Same across multiple kernel executions Very slow to access No caching!

Constant Memory Read-only in device Cached in multiprocessor Fairly quick Cache can broadcast to all active threads

Texture Memory Read-only in device 2D cached -- quick access Filtering methods available

Shared Memory Read/write per block Memory is shared within block Generally quick Has bad worst-cases

Local Memory Read/write per thread Not too fast (stored independent of chip) Each thread can only see its own local memory Indexable (can do arrays)

Register Memory Read/write per thread function Extremely fast Each thread can only see its own register memory Not indexable (cant do arrays)

Syntax: Register Memory Default memory type Declare as normal -- no special syntax int var = 1; Only accessible by current thread

Syntax: Local Memory Global variables for threads Can modify across local functions for a thread Declare with __device__ __local__ keyword __device__ __local__ int var = 1; Can also just use __local__

Syntax: Shared Memory Shared across threads in block, not across blocks Cannot use pointers, but can use array syntax for arrays Declare with __device__ __shared__ keyword __device__ __shared__ int var[]; Can also just use __shared__ Dont need to declare size for arrays

Syntax: Global Memory Created with cudaMalloc Can pass pointers between host and kernel Transfer is slow! Declare with __device__keyword __device__ int var = 1;

Syntax: Constant Memory Declare with __device__ __constant__ keyword __device__ __constant__ int var = 1; Can also just use __constant__ Set using cudaMemcpyToSymbol (or cudaMemcpy) cudaMemcpyToSymbol(var, src, count);

Syntax: Texture Memory To be discussed later…

Memory Issues Each multiprocessor has set amount of memory Limits amount of blocks we can have (# of blocks) x (memory used per block) <= total memory Either get lots of blocks using little memory, or fewer blocks using lots of memory

Memory Issues Register memory is limited! Similar to shared memory in blocks Can have many threads using fewer registers, or few threads using many registers Former is better, more parallelism

Memory Issues Global accesses: slow! Can be sped up when memory is contiguous Memory coalescing: making memory contiguous Coalesced accesses are: Contiguous accesses In-order accesses Aligned accesses

Memory Coalescing: Aligned Accesses Threads read 4, 8, or 16 bytes at a time from global memory Accesses must be aligned in memory! Good: Bad: Which is worse, reading 16 bytes from 0xABCD0 or 0xABCDE? 0x00 0x040x14 0x00 0x070x14

Memory Coalescing Aligned Accesses Also bad: beginning unaligned

Memory Coalescing: Aligned Accesses Built-in types force alignment float3 (12B) takes up the same space as float4 (16B) float3 arrays are not aligned! To align a struct, use __align__(x) // x = 4, 8, 16 cudaMalloc aligns the start of each block automatically cudaMalloc2D aligns the start of each row for 2D arrays

Memory Coalescing: Contiguous Accesses Contiguous = memory is together Example: non-contiguous memory Thread 3 and 4 swapped accesses!

Memory Coalescing: Contiguous Accesses Which is better? index = threadIdx.x + blockDim.x * (blockIdx.x + gridDim.x * blockIdx.y); index = threadIdx.x + blockDim.y * (blockIdx.y + gridDim.y * blockIdx.x);

Memory Coalescing: Contiguous Accesses Case 1: Contiguous accesses

Memory Coalescing: Contiguous Accesses Case 1: Contiguous accesses

Memory Coalescing: In-order Accesses In-order accesses Do not skip addresses Access addresses in order in memory Bad example: Left: address 140 skipped Right: lots of skipped addresses

Memory Coalescing Good example:

Memory Coalescing Not as much of an issue in new hardware Many restrictions relaxed -- e.g., do not need to have sequential access However, memory coalescing and alignment still good practice!

Memory Issues Shared memory: Also can be limiting Broken up into banks Optimal when entire warp is reading shared memory together Banks: Each bank services only one thread at a time Bank conflict: when two threads try to access same block Causes slowdowns in program!

Bank Conflicts Bad: Many threads trying to access the same bank

Bank Conflicts Good: Few to no bank conflicts

Bank Conflicts Banks service 32-bit words at a time at addresses mod 64 Bank 0 services 0x00, 0x40, 0x80, etc., bank 1 services 0x04, 0x44, 0x84, etc. Want to avoid multiple thread access to same bank Keep data spread out Split data that is larger than 4 bytes into multiple accesses Be careful of data elements with even stride

Broadcasting Fast distribution of data to threads Happens when entire warp tries to access same address Memory will get broadcasted to all threads in one read

Summary Best memory management: Balances memory optimization with parallelism Break problem up into coalesced chunks Process data in shared memory, then copy back to global Remember to avoid bank conflicts!

Next Time Texture memory CUDA Applications in graphics