GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Kernel1 >>() B0 B1 Grid 1 Serial code kernel2 >>() B2 B3 B4 B5 B0 B1 Grid 2 B2 B3 B4 B5 CPUGPU Thread Block Time Serial code Thread synchronization.
CS179: GPU Programming Lecture 5: Memory. Today GPU Memory Overview CUDA Memory Syntax Tips and tricks for memory handling.
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Intermediate GPGPU Programming in CUDA
Speed, Accurate and Efficient way to identify the DNA.
List Ranking and Parallel Prefix
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Overview Motivation Scala on LLVM Challenges Interesting Subsets.
Introduction to the CUDA Platform
Multi-GPU and Stream Programming Kishan Wimalawarne.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Memory Layout C and Data Structures Baojian Hua
Parallelization and CUDA libraries Lei Zhou, Yafeng Yin, Hong Man.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
An Introduction to Programming with CUDA Paul Richmond
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Martin Kruliš by Martin Kruliš (v1.0)1.
SC’13: Hands-on Practical Hybrid Parallel Application Performance Engineering1 Score-P Hands-On CUDA: Jacobi example.
CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.
Low-Latency Accelerated Computing on GPUs
GPU Programming with CUDA – Optimisation Mike Griffiths
1 ITCS 4/5010 GPU Programming, UNC-Charlotte, B. Wilkinson, Jan 14, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU.
Algorithm Engineering „GPGPU“ Stefan Edelkamp. Graphics Processing Units  GPGPU = (GP)²U General Purpose Programming on the GPU  „Parallelism for the.
CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.
Advanced / Other Programming Models Sathish Vadhiyar.
NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.
GPU Architecture and Programming
Parallel Processing1 GPU Program Optimization (CS 680) Parallel Programming with CUDA * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
OpenCL Programming James Perry EPCC The University of Edinburgh.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
CUDA Dynamic Parallelism
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Martin Kruliš by Martin Kruliš (v1.0)1.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Synchronization These notes introduce:
Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.
C Programming Chapters 11, . . .
Martin Kruliš by Martin Kruliš (v1.0)1.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
1 ITCS 4/5145GPU Programming, UNC-Charlotte, B. Wilkinson, Nov 4, 2013 CUDAProgModel.ppt CUDA Programming Model These notes will introduce: Basic GPU programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
CUDA Programming Model
Current Generation Hypervisor Type 1 Type 2.
CS427 Multicore Architecture and Parallel Computing
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Introduction to CUDA.
CUDA Programming Model
GPU Lab1 Discussion A MATRIX-MATRIX MULTIPLICATION EXAMPLE.
Synchronization These notes introduce:
Presentation transcript:

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

Dynamic Parallelism (CUDA 5+) GPU Object Linking (CUDA 5+) Unified Memory (CUDA 6+) Other Developer Tools Overview

Before CUDA 5 threads had to be launched from the host Limited ability to perform recursive functions Dynamic Parallelism allows threads to be launched from the device Improved load balancing Deep Recursion Dynamic Parallelism CPU Kernel A Kernel B Kernel C Kernel D GPU

//Host Code... A >>(data); B >>(data); C >>(data); //Kernel Code __global__ void vectorAdd(float *data) { do_stuff(data); X >>(data); do_more stuff(data); } An Example

CUDA 4 required a single source file for a single kernel No linking of compiled device code CUDA 5.0+ Allows different object files to be linked Kernels and host code can be built independently GPU Object Linking Main.cpp ___________________________ a.cu____________________b.cu____________________c.cu____________________ a.ob.oc.o + Program.exe

Objects can also be built into static libraries Shared by different sources Much better code reuse Reduces compilation time Closed source device libraries GPU Object Linking Main.cpp ___________________________ a.cu____________________b.cu____________________ a.ob.o ab.culib + Program.exe + + Main2.cpp ___________________________ ab.culib Program2.exe + + foo.cubar.cu...

Developer view is that GPU and CPU have separate memory Memory must be explicitly copied Deep copies required for complex data structures Unified Memory changes that view Single pointer to data accessible anywhere Simpler code porting Unified Memory System Memory GPU Memory CPUGPU Unified Memory CPUGPU

Unified Memory Example void sortfile(FILE *fp, int N) { char *data; data = (char *)malloc(N); fread(data, 1, N, fp); qsort(data, N, 1, compare); use_data(data); free(data); } void sortfile(FILE *fp, int N) { char *data; cudaMallocManaged(&data, N); fread(data, 1, N, fp); qsort(data, N, 1, compare); cudaDeviceSynchronize(); use_data(data); free(data); }

XT and Drop-in libraries cuFFT and cuBLAS optimised for multi GPU (on the same node) GPUDirect Direct Transfer between GPUs (cut out the host) To support direct transfer via Infiniband (over a network) Developer Tools Remote Development using Nsight Eclipse Enhanced Visual Profiler Other Developer Tools