1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
More on threads, shared memory, synchronization
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 10, 2011 Atomics.pptx Atomics and Critical Sections These notes will introduce: Accessing.
CUDA More GA, Events, Atomics. GA Revisited Speedup with a more computationally intense evaluation function Parallel version of the crossover and mutation.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 14, 2011 Streams.pptx CUDA Streams These notes will introduce the use of multiple CUDA.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 20, 2011 CUDA Programming Model These notes will introduce: Basic GPU programming model.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 22, 2011 Branching.ppt Control Flow These notes will introduce scheduling control-flow.
CUDA Grids, Blocks, and Threads
Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, April 12, 2012 Timing.ppt Measuring Performance These notes will introduce: Timing Program.
Programming of multiple GPUs with CUDA and Qt library
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
SAGE: Self-Tuning Approximation for Graphics Engines
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu ECE408/CS483/ECE498al, University of Illinois, ECE408 Applied Parallel Programming Lecture 12 Parallel.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 22, 2013 MemCoalescing.ppt Memory Coalescing These notes will demonstrate the effects.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 4, 2013 Streams.pptx Page-Locked Memory and CUDA Streams These notes introduce the use.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 4, 2013 Zero-Copy Host Memory These notes will introduce “zero-copy” memory. “Zero-copy”
CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
OpenCL Programming James Perry EPCC The University of Edinburgh.
1 SC12 The International Conference for High Performance Computing, Networking, Storage and Analysis Salt Lake City, Utah. Workshop 119: An Educator's.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.
1 ITCS 4/5010 GPU Programming, B. Wilkinson, Jan 21, CUDATiming.ppt Measuring Performance These notes introduce: Timing Program Execution How to.
CUDA Memory Types All material not from online sources/textbook copyright © Travis Desell, 2012.
Synchronization These notes introduce:
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming and CUDA Sathish Vadhiyar High Performance Computing.
Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
1 Workshop 9: General purpose computing using GPUs: Developing a hands-on undergraduate course on CUDA programming SIGCSE The 42 nd ACM Technical.
CUDA Programming Model
GPU Memories These notes will introduce:
Device Routines and device variables
CUDA and OpenCL Kernels
A lighthearted introduction to GPGPUs Dr. Keith Schubert
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Device Routines and device variables
Measuring Performance
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
CUDA Grids, Blocks, and Threads
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Measuring Performance
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
CUDA Programming Model
Chapter 4:Parallel Programming in CUDA C
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will introduce: Results of an experiment using constant memory

2 Program The test program simply adds two vectors A and B together to produce a third vector, C One version uses constant memory for A and B Another version uses regular global memory for A and B Note maximum available for constant memory on the GPU (all compute capabilities so far) is 64 Kbytes total.

3 #define N 8192// max size allowed for two vectors in const. mem // Constants held in constant memory __device__ __constant__ int dev_a_Cont[N]; __device__ __constant__ int dev_b_Cont[N]; // regular global memory for comparison __device__ int dev_a[N]; __device__ int dev_b[N]; // result in device global memory __device__ int dev_c[N]; Code Array declarations

4 // kernel routines __global__ void add_Cont() {// using constant memory int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N){ dev_c[tid] = dev_a_Cont[tid] + dev_b_Cont[tid]; } __global__ void add() {//not using constant memory int tid = blockIdx.x * blockDim.x + threadIdx.x; if(tid < N){ dev_c[tid] = dev_a[tid] + dev_b[tid]; }

5 /* GPU using constant memory */ printf("GPU using constant memory\n"); for(int i=0;i<N;i++) { // load arrays with some numbers a[i] = i; b[i] = i*2; } // copy vectors to constant memory cudaMemcpyToSymbol(dev_a_Cont,a,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaMemcpyToSymbol(dev_b_Cont,b,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaEventRecord(start, 0);// start time add_Cont >>();// does not need array ptrs cudaThreadSynchronize();// wait for all threads to complete cudaEventRecord(stop, 0); // end time cudaMemcpyFromSymbol(a,"dev_a_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(b,"dev_b_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(c,"dev_c",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time_Cont, start, stop); Watch for this zero. I missed it off and it took some time to spot Missed originally

6 /* GPU not using constant memory */ printf("GPU using constant memory\n"); for(int i=0;i<N;i++) { // load arrays with some numbers a[i] = i; b[i] = i*2; } // copy vectors to constant memory cudaMemcpyToSymbol(dev_a_Cont,a,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaMemcpyToSymbol(dev_b_Cont,b,N*sizeof(int),0,cudaMemcpyHostToDevice); cudaEventRecord(start, 0);// start time add >>();// does not need array ptrs cudaThreadSynchronize();// wait for all threads to complete cudaEventRecord(stop, 0); // end time cudaMemcpyFromSymbol(a,"dev_a_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(b,"dev_b_Cont",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaMemcpyFromSymbol(c,"dev_c",N*sizeof(int),0,cudaMemcpyDeviceToHost); cudaEventSynchronize(stop); cudaEventElapsedTime(&elapsed_time, start, stop);

7 Speedup around 1.2 after first launch (20%) 1 st launch, rd run, nd run, 1.217

Questions