CUDA 5.0 By Peter Holvenstot CS6260. CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being installed.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA and the Memory Model (Part II). Code executed on GPU.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Panda: MapReduce Framework on GPU’s and CPU’s
Cuda Streams Presented by Savitha Parur Venkitachalam.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Jared Barnes Chris Jackson.  Originally created to calculate pixel values  Each core executes the same set of instructions Mario projected onto several.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
Extracted directly from:
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson CUDA-3.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
CIS 565 Fall 2011 Qing Sun
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,
CUDA Dynamic Parallelism
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
CS/EE 217 GPU Architecture and Parallel Programming Lecture 17: Data Transfer and CUDA Streams.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.
Martin Kruliš by Martin Kruliš (v1.0)1.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 Fall 2015 Applied Parallel Programming.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Introduction to CUDA Li Sung-Chi Taiwan Evolutionary Intelligence Laboratory 2016/12/14 Group Meeting Presentation.
An Update on Accelerating CICE with OpenACC
Gwangsun Kim, Jiyun Jeong, John Kim
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Pipeline parallelism and Multi–GPU Programming
More on GPU Programming
Chapter 2: The Linux System Part 3
Introduction to CUDA.
CUDA Execution Model – III Streams and Events
Synchronization These notes introduce:
6- General Purpose GPU Programming
Presentation transcript:

CUDA 5.0 By Peter Holvenstot CS6260

CUDA 5.0 Latest iteration of CUDA toolkit Requires Compute Capability 3.0 Compatible Kepler cards being

Major New Features GPUDirect  Allows Direct Memory Access GPU Object Linking  Libraries for GPU code Dynamic Parallelism  Kernels inside kernels

GPUDirect Allows Direct Memory Access to PCIe bus Third-party device access now supported Requires use of pinned memory DMAs can be chained across network

GPUDirect

Pinned Memory Malloc() - unpinned, can be paged out CudaHostAlloc() - pinned Cannot be paged out Takes longer to allocate, but allows features requiring DMA and increases copy performance

Kernel Linking Kernels now support compilation to.obj file Allows compiling into/against static libraries Allows closed-source distribution of libraries

Dynamic Parallelism CUDA 4.1: __device__ functions may make inline-able recursive calls However, __global__ functions/kernels cannot CUDA 5: GPU/kernels may launch additional kernels

Dynamic Parallelism Most important feature in release Reduces need for synchronization Allows program flow to be controlled by GPU Allows recursion and subdivision of problems

Dynamic Parallelism CPU code can now become a kernel Kernel calls can be used as tasks GPU controls kernel launch/flow/scheduling Increases practical thread count to thousands

Dynamic Parallelism Interesting data is not uniformly distributed Dynamic parallelism can launch additional threads in interesting areas Allows higher resolution in critical areas without slowing down others

Source: NVIDIA

Dynamic Parallelism Nested Dependencies Source: NVIDIA

Dynamic Parallelism Scheduling can be controlled by streams No new concurrency guarantees Launched kernels may execute out-of-order within a stream Named streams can guarantee concurrency

Dynamic Parallelism Nested Dependencies - cudaDeviceSynchronize () Can be used inside a kernel Synchronizes all launches by any kernel in block Does NOT imply __syncthreads()!

Dynamic Parallelism Kernel launch implies memory sync operation Child sees state at time of launch Parent sees child writes after sync Local and shared memory are private, cannot be shared with children

Questions?

Sources Parallelism_Programming_Guide.pdf rdma/index.html TC2012/PresentationPDF/S0338-GTC2012- CUDA-Programming-Model.pdf cuda-5-and-beyond