A lighthearted introduction to GPGPUs Dr. Keith Schubert

Slides:



Advertisements
Similar presentations
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Advertisements

More on threads, shared memory, synchronization
What is GPGPU? Many of these slides are taken from Henry Neeman’s presentation at the University of Oklahoma.
Λλ Fernando Magno Quintão Pereira P ROGRAMMING L ANGUAGES L ABORATORY Universidade Federal de Minas Gerais - Department of Computer Science P ROGRAM A.
NVIDIA Research Parallel Computing on Manycore GPUs Vinod Grover.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
Parallel Programming using CUDA. Traditional Computing Von Neumann architecture: instructions are sent from memory to the CPU Serial execution: Instructions.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, March 3, 2011 ConstantMemTiming.ppt Measuring Performance of Constant Memory These notes will.
CUDA Grids, Blocks, and Threads
Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.
Programming of multiple GPUs with CUDA and Qt library
CUDA C/C++ BASICS NVIDIA Corporation © NVIDIA 2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
GPU Parallel Computing Zehuan Wang HPC Developer Technology Engineer
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
ME964 High Performance Computing for Engineering Applications “They have computers, and they may have other weapons of mass destruction.” Janet Reno, former.
Programming Massively Parallel Processors Using CUDA & C++AMP Lecture 1 - Introduction Wen-mei Hwu, Izzat El Hajj CEA-EDF-Inria Summer School 2013.
More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA and GPU Training Workshop April 21, 2014
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
CUDA and GPU Training: Sessions 1 & 2 April 16 & 23, 2012 University of Georgia CUDA Teaching Center.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
High Performance Computing with GPUs: An Introduction Krešimir Ćosić, Thursday, August 12th, LSST All Hands Meeting 2010, Tucson, AZ GPU Tutorial:
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
CIS 565 Fall 2011 Qing Sun
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : Prof. Christopher Cooper’s and Prof. Chowdhury’s course note slides are used in this lecture.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
Lecture 6: Shared-memory Computing with GPU. Free download NVIDIA CUDA a-downloads CUDA programming on visual studio.
CS 193G Lecture 2: GPU History & CUDA Programming Basics.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.
1 ITCS 5/4010 Parallel computing, B. Wilkinson, Jan 14, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One dimensional.
CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.
Introduction to CUDA Programming CUDA Programming Introduction Andreas Moshovos Winter 2009 Some slides/material from: UIUC course by Wen-Mei Hwu and David.
CS 179: GPU Programming Lecture 7. Week 3 Goals: – More involved GPU-accelerable algorithms Relevant hardware quirks – CUDA libraries.
Unit -VI  Cloud and Mobile Computing Principles  CUDA Blocks and Treads  Memory handling with CUDA  Multi-CPU and Multi-GPU solution.
CUDA C/C++ Basics Part 2 - Blocks and Threads
Lecture 20 Computing with Graphical Processing Units
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Basic CUDA Programming
CUDA and OpenCL Kernels
Some things are naturally parallel
Slides from “PMPP” book
CS 179: GPU Programming Lecture 7.
CUDA Grids, Blocks, and Threads
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
ECE 8823A GPU Architectures Module 2: Introduction to CUDA C
CUDA Grids, Blocks, and Threads
General Purpose Graphics Processing Units (GPGPUs)
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
Parallel Computing 18: CUDA - I
Presentation transcript:

A lighthearted introduction to GPGPUs Dr. Keith Schubert Warp Speed A lighthearted introduction to GPGPUs Dr. Keith Schubert

Why Bother?

Jacquard Looms Again

Thread Block Grid Registers (F) type var[n]; Local Memory(S) Index threadIdx.x Shared Memory (F) __shared__ type var; Index blockIdx.x Size blockDim.x Constant Memory (F) __constant__ type var; Global Memory (S) __device__ type var;

CPU (host) GPU (device) CPU (host) GPU (device) CPU (host) GPU (device)

Amdahl’s Law

Vector Addition Embarrassingly parallel c[i]=a[i]+b[i]

Kernel __global__ void vector_add(const float *a, const float *b, float *c, const size_t n){ unsigned int i = threadIdx.x + blockDim.x * blockIdx.x; if(i < n) c[i] = a[i] + b[i]; }

Setup Constants int main(void){ const int n_e = 1<<20; const int n_b = n_e * sizeof(float); const size_t n_tpb = 256; size_t n_bl = n_e / n_tpb; if(n_e % n_t) n_bl++;

Pointers float *a_d = 0; float *b_d = 0; float *c_d = 0; float *a_h = 0; float *b_h = 0; float *c_h = 0; a_h = (float*)malloc(n_b); b_h = (float*)malloc(n_b); c_h = (float*)malloc(n_b); cudaMalloc((void**)&a_d, n_b); cudaMalloc((void**)&b_d, n_b); cudaMalloc((void**)&c_d, n_b);

Verify and Initialize if(a_h == 0 || b_h == 0 || c_h == 0 || a_d == 0 || b_d == 0 || c_d == 0){ printf("Out of memory. We wish to hold the whole sky, but we never will.\n"); return 1; } for(int i = 0; i < n_e; i++){ a_h[i] = i* ((float)rand() / RAND_MAX); b_h[i] = (n_e-i)* ((float)rand() / RAND_MAX);

Using the GPU cudaMemcpy(a_d, a_h, n_b, cudaMemcpyHostToDevice); cudaMemcpy(b_d, b_h, n_b, cudaMemcpyHostToDevice); vector_add<<<n_bl, n_tpb>>>(a_d, b_d, c_d, n_e); cudaMemcpy(c_h, c_d, n_b, cudaMemcpyDeviceToHost);

Cleanup for(int i = 0; i < 20; i++) printf("[%d] %7.1f + %7.1f = %7.1f\n", i, a_h[i], b_h[i], c_h[i]); free(a_h); free(b_h); free(c_h); cudaFree(a_d); cudaFree(b_d); cudaFree(c_d); }

?