CUDA Libraries and Language Extensions for GKLEE.

Slides:



Advertisements
Similar presentations
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 DeviceRoutines.pptx Device Routines and device variables These notes will introduce:
Advertisements

Intermediate GPGPU Programming in CUDA
INF5063 – GPU & CUDA Håkon Kvale Stensland iAD-lab, Department for Informatics.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Outline Reading Data From Files Double Buffering GMAC ECE
Multi-GPU and Stream Programming Kishan Wimalawarne.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
CS179: GPU Programming Lecture 8: More CUDA Runtime.
Programming with CUDA WS 08/09 Lecture 7 Thu, 13 Nov, 2008.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.
1 Advanced CUDA Feature Highlights. Homework Assignment #3 Problem 2: Select one of the following questions below. Write a CUDA program that illustrates.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Basic C programming for the CUDA architecture. © NVIDIA Corporation 2009 Outline of CUDA Basics Basic Kernels and Execution on GPU Basic Memory Management.
CUDA Advanced Memory Usage and Optimization Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
CUDA Asynchronous Memory Usage and Execution Yukai Hung Department of Mathematics National Taiwan University Yukai Hung
10/16/ Realizing Concurrency using the thread model B. Ramamurthy.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ME964 High Performance Computing for Engineering Applications CUDA Memory Model & CUDA API Sept. 16, 2008.
CIS 565 Fall 2011 Qing Sun
B. RAMAMURTHY 10/24/ Realizing Concurrency using the thread model.
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 24: Advanced CUDA Feature Highlights April 21, 2009.
CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.
Copyright 2005, The Ohio State University 1 Pointers, Dynamic Data, and Reference Types Review on Pointers Reference Variables Dynamic Memory Allocation.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
CUDA Odds and Ends Joseph Kider University of Pennsylvania CIS Fall 2011.
CS 179 Lecture 13 Host-Device Data Transfer 1. Moving data is slow So far we’ve only considered performance when the data is already on the GPU This neglects.
RealTimeSystems Lab Jong-Koo, Lim
Lecture 9 Streams and Events Kyu Ho Park April 12, 2016 Ref:[PCCP]Professional CUDA C Programming.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Lecture 7 CUDA Shared Memory Kyu Ho Park Mar. 29, 2016 Ref:[PCCP] Professional CUDA C Programming, Cheng, Grossman, McKercher, 2014.
B. RAMAMURTHY 5/10/2013 Amrita-UB-MSES Realizing Concurrency using the thread model.
7/9/ Realizing Concurrency using Posix Threads (pthreads) B. Ramamurthy.
CUDA C/C++ Basics Part 3 – Shared memory and synchronization
Realizing Concurrency using the thread model
Computer Engg, IIT(BHU)
GPU Computing CIS-543 Lecture 10: Streams and Events
CUDA Programming Model
Realizing Concurrency using the thread model
Basic CUDA Programming
GPU Programming using OpenCL
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
Pointers, Dynamic Data, and Reference Types
More on GPU Programming
Realizing Concurrency using the thread model
Classes and Objects.
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
Realizing Concurrency using the thread model
Realizing Concurrency using Posix Threads (pthreads)
Realizing Concurrency using the thread model
CUDA Execution Model – III Streams and Events
Realizing Concurrency using Posix Threads (pthreads)
System Calls System calls are the user API to the OS
DPACC API Guidelines 2019/10/12.
Presentation transcript:

CUDA Libraries and Language Extensions for GKLEE

What's the point? ● We want to make GKLEE as friendly and easy/practical to use as possible to make it a real-world development tool ● CUDA extensions and APIs cover a lot of ground – how do we approch GKLEE's complete handling?

What to Handle? ● There are many items to cover ● Break down into levels: – Language intrinsics / vital functions – Easy to implement items (make stubs) – Items that require a large degree of emulation/virtualization in GKLEE – are vital to semantics of kernels / programs and should either be handled properly or cause termination with message – Items that are unrelated and can be ignored

Top Priority Items ● Language Intrinsics ● Vital Functions

CUDA C/C++ Extensions ● Function Type Qualifiers ● Variable Type Qualifiers ● Types (Vector Types) ● Built-in Variables ● Execution Configurations (also see the configuration functions in Runtime API)

Function Type Qualifiers __device__ (executed on device, callable only from device) __global__ (a kernel declaration, callable from host only) __host__ exec on host, callable on host (__device__ and __host__ can be used together) __noinline__ hints not to inline functions __forceinline__ forces inline

Variable Type Qualifiers __device__ a var that resides on device __constant__ fully accessible from all threads in grid and host __shared__ **THIS IS CURRENTLY BROKEN (in some cases) __restrict__ ensures to the compiler that pointer is not aliased so it can do reordering and common sub-expression elimination -- supported by LLVM?

Built-in Variables GridDim BlockDim BlockIdx ThreadIdx warpSize

Vital Functions ● Memory Fence Functions ● Synchronization ● Warp Vote Functions ● On-Device Asserts

Variations on __syncthreads

Warp Vote Functions

On device Assertion

**Execution Configuration**

Lower Priority

Heap Memory Alloc

Profiler Counter Function

Launch Bounds

#pragma unroll

Runtime API – Streams Functions cudaError_t cudaStreamCreate (cudaStream_t pStream) Create an asynchronous stream. cudaError_t cudaStreamDestroy (cudaStream_t stream) Destroys and cleans up an asynchronous stream. cudaError_t cudaStreamQuery (cudaStream_t stream) Queries an asynchronous stream for completion status. cudaError_t cudaStreamSynchronize (cudaStream_t stream) Waits for stream tasks to complete. cudaError_t cudaStreamWaitEvent (cudaStream_t stream, cudaEvent_t event, unsigned int flags) Make a compute stream wait on an event.

RT API – Events Functions cudaError_t cudaEventCreate (cudaEvent_t event) Creates an event object. cudaError_t cudaEventCreateWithFlags (cudaEvent_t event, unsigned int flags) Creates an event object with the specified flags. cudaError_t cudaEventDestroy (cudaEvent_t event) Destroys an event object. cudaError_t cudaEventElapsedTime (float ms, cudaEvent_t start, cudaEvent_t end) Computes the elapsed time between events. cudaError_t cudaEventQuery (cudaEvent_t event) Queries an event’s status. cudaError_t cudaEventRecord (cudaEvent_t event, cudaStream_t stream=0) Records an event. cudaError_t cudaEventSynchronize (cudaEvent_t event) Waits for an event to complete.

RT API – Execution Control cudaError_t cudaConfigureCall (dim3 gridDim, dim3 blockDim, size_t sharedMem=0, cudaStream_- t stream=0) Configure a device-launch. cudaError_t cudaFuncGetAttributes (struct cudaFuncAttributes attr, const char func) Find out attributes for a given function. cudaError_t cudaFuncSetCacheConfig (const char func, enum cudaFuncCache cacheConfig) Sets the preferred cache configuration for a device function. cudaError_t cudaLaunch (const char entry) Launches a device function. cudaError_t cudaSetDoubleForDevice (double d) Converts a double argument to be executed on a device. cudaError_t cudaSetDoubleForHost (double d) Converts a double argument after execution on a device. cudaError_t cudaSetupArgument (const void arg, size_t size, size_t offset) Configure a device launch.

RT API – Memory Management (first 6 of ~50) cudaError_t cudaArrayGetInfo (struct cudaChannelFormatDesc desc, struct cudaExtent extent, unsigned int flags, struct cudaArray array) Gets info about the specified cudaArray. cudaError_t cudaFree (void devPtr) Frees memory on the device. cudaError_t cudaFreeArray (struct cudaArray array) Frees an array on the device. cudaError_t cudaFreeHost (void ptr) Frees page-locked memory. cudaError_t cudaGetSymbolAddress (void devPtr, const char symbol) Finds the address associated with a CUDA symbol. cudaError_t cudaGetSymbolSize (size_t size, const char symbol) Finds the size of the object associated with a CUDA symbol.

RT API – Unified Address Space ● Allows host and device memory to be handled with a unified address cudaError_t cudaPointerGetAttributes (struct cudaPointerAttributes attributes, const void ptr) Returns attributes about a specified pointer.

RT API – direct peer mem access cudaError_t cudaDeviceCanAccessPeer (int canAccessPeer, int device, int peerDevice) Queries if a device may directly access a peer device’s memory. cudaError_t cudaDeviceDisablePeerAccess (int peerDevice) Disables direct access to memory allocations on a peer device and unregisters any registered allocations from that device. cudaError_t cudaDeviceEnablePeerAccess (int peerDevice, unsigned int flags) Enables direct access to memory allocations on a peer device.

RT API – Graphics Interfaces ● OpenGL ● Direct3D ● VDPAU – Video Decode and Presentation API for Unix ● General graphics interop ● Texture ● Surface

RT API – Version Info cudaError_t cudaDriverGetVersion (int driverVersion) Returns the CUDA driver version. cudaError_t cudaRuntimeGetVersion (int runtimeVersion) Returns the CUDA Runtime version.

RT API – C++ Bindings ● (Sample functions – use templates to bind class) template cudaError_t cudaBindSurfaceToArray (const struct surface &surf, const struct cudaArray array) [C++ API] Binds an array to a surface template cudaError_t cudaBindSurfaceToArray (const struct surface &surf, const struct cudaArray array, const struct cudaChannelFormatDesc &desc) [C++ API] Binds an array to a surface

RT API – Profiler Control cudaError_t cudaProfilerInitialize (const char configFile, const char outputFile, cudaOutputMode_t output- Mode) Initialize the profiling. cudaError_t cudaProfilerStart (void) Start the profiling. cudaError_t cudaProfilerStop (void) Stop the profiling.

Specific API (RT & Driver) ● Data Structures ● Enumerations ● #defines

RT API Driver API Interactions ● Init/Tear Down ● Contexts ● Streams ● Events ● Arrays ● Graphics

Driver API (lower level access) ● Initialization ● Version management ● Device management ● Context management ● Module management ● Memory management ● Unified addressing ● Streams ● Events ● Exec Control ●...