UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE

UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE
Quasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university

Agenda Background Demo The Quasar workflow Questionnaire
High level programming Coffee break Gepura tools Advanced programming

Background

Short introduction: research group IPI
Focus on image and video restoration and analysis: 30 doctorandi, 5 postdocs, 2 technology-developers, 6 professors Various topics: video, 3D, medical, segmentation, remote sensing,…

Our challenges Variable data Complex iterative algorithms
Hard constraints (e.g. realtime) Research environment -> rapid prototyping Variable hardware

Quasar: the start Originally: scripting system for writing “plugins” for my photo restoration tool. Introduction to Quasar

Quasar: a brief history
Originally (2009): translation from annotated C# code Jan 2011: first Quasar script: Simple, MATLAB-like syntax. parallel_do keyword to specify code that has to be executed in parallel (e.g. on a GPU). Variable types derived through type inference (not needed to declare variables, except parameters of kernel functions). Very easy to write various filters in a very short time frame: Brightness & contrast enhancement Color mixer Color correction Bilateral filtering Space variant blurring Gamma correction Nonlocal means filtering... All filters could run in real-time on a video sequence.

Quasar started evolving
1 More automatization 2 More optimizations 3 Integrated Development Environment 4 More robuust 5 From a few simple scripts to a full blown language with real life research examples

Todays Quasar Ecosystem
IDE & runtime optimisation Knowledge base Libraries High level programming language

Quasar installation…

The Quasar workflow

The benefits of GPUs. Growing application domain
Exploits massive parallelism e.g. to process large amounts of data in parallel Speed-ups of 10 to 100 Energy efficient calculations/Watt Applied to many applications: Multi-media, finances, big-data, scientific computing,… Commodity HW Standard in desktops and laptops Single precision NVIDIA GPU Intel CPU NEW TREND: integration in embedded (e.g., automotive applications) and mobile devices (smartphones: OnePlus, LG, HTC, …)

The drawbacks of GPU programming solved by Quasar
Low level coding experts needed Strong coupling between algorithm development and implementation Long development lead times Each HW platform requires new optimizations

High level of abstraction Hardware agnostic programming
Scripting language Compact code High level of abstraction Hardware agnostic programming Algorithm (quasar code) Data Code analysis Code optimization Developer feedback Kernel decomposition Data characteristics Kernel characteristics Compilation .NET OpenMP & SIMD OpenCL CUDA Memory management Load balancing Scheduling Kernel parameter optimization Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

Algorithm (quasar code) Algorithm (quasar code)
Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

Quasar - Scripting language
Same abstraction level as Python and Matlab: ) 2 weeks vs 3 months Shorter develop-ment cycles

Example: code written in CUDA vs. Quasar
#include <cuda.h> // Kernel that executes on the CUDA device __global__ void square_array(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx] * a[idx]; } // main routine that executes on the host int main(void) float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudaMalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<N; i++) a_h[i] = (float)i; cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); // Print results for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]); // Cleanup free(a_h); cudaFree(a_d); a_d = 0..9 print a_d.^2

Algorithm (quasar code)
Data Code analysis Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

Automatic parallelization

Development feedback

Quasar – Reductions Allow to provide an alternative implementation for certain operations (e.g., BLAS – Basic Linear Algebra Subroutines) reduction (alpha : scalar, x : vec, y : vec) -> alpha * x + y = blas_sscal(alpha, x, y) reduction (x) -> real(ifft2(x)) = irealfft2(x) Define “trivial” optimizations reduction (x:mat) -> real(x) = x reduction (x:mat) -> imag(x) = zeros(size(x)) reduction (x:mat) -> transpose(transpose(x)) = x reduction (x:mat) -> x[:,:] = x Shorthands reduction (x : cube) -> x[:] = reshape(x,[1,numel(x)])

Reductions: an example
reduction y -> cosh(y)*sqrt(1-tanh(y)^2) = 1 reduction z -> lim((1/z+1)^z,z=0) = exp(1) reduction x -> log(exp(x))=x reduction x -> sin(x)^2+cos(x)^2=1 reduction n -> sum(1/2^n,n=0..infinity)=2 symbolic x,y,z,n,infinity print log(lim((1+1/z)^z,z=0))+(sin(x)^2+cos(x)^2)==sum((cosh(y)*sqrt(1-tanh(y)^2))/2^n,n=0..infinity) OUTPUT: Result after 6 reductions: (2==2)

Data Code analysis Data characteristics Kernel characteristics Compilation Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

Compilation: parallel code
SSE Neon and sme are simd (similar instruction, multiple data) coarse fine granularity

Compilation: automatic detection of nested parallelism
x = imread("lena_big.tif")[:,:,1] y = zeros(size(x)) B = 16 % block size for m = 0..B..size(x,0)-1 for n = 0..B..size(x,1)-1 A = x[m..m+B-1,n..n+B-1] y[m..m+B-1,n..n+B-1] = sin(A)+B end end Parallel loop Parallel operation on every element of a matrix Mapped onto the dynamic parallelism of the GPU (CUDA 5.0) parallel_do Host function Kernel function 1 Kernel function 2

Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware

Runtime execution: memory management
State 4 CPU GPU: NA State 5 CPU: NA GPU CPU out of memory GPU out of memory Copy to CPU Copy to GPU GPU out of memory CPU out of memory State 2 CPU: dirty GPU: non-dirty State 3 CPU: non-dirty GPU: dirty Modify on CPU Modify on GPU update GPU State 1 CPU: non-dirty GPU: non-dirty Update CPU Automatic memory management Allocation/disposal Transparent marshalling Transfer between CPU-GPU

Runtime execution: Optimization to Hardware
Automated parameter optimization: Block size Grid size Number of threads Number of warps Shared memory

Runtime execution: Load Balancing
timeCPU>timeGPU timeGPU1> timeCPU>timeGPU2 timeCPU<timeGPU Automated load balancing based on: Data Hardware characteristics Kernel characteristics

Runtime execution: Scheduler
Sequential: Concurrent: the CPU runs asynchronously from the GPU. This also means that while the GPU is still processing its data, the CPU is already looking forward and planning future memory transfers. If the GPU supports it (and most recent devices do), the memory transfers are scheduled in parallel with the kernel execution. So this means that the memory transfers can be nearly for-free in some cases. Je kan ook concurrent kernel execution vermelden (meerdere kernels kunnen overlappen indien de GPU dit ondersteunt, en recente GPUs ondersteunen dit) Automated scheduling: Reduce memory transfer times Concurrent kernel execution (if supported)

Results Faster development
2 weeks vs. 3 months for a CUDA implementation of an MRI reconstruction algorithm Faster execution using the GPU 64 fps vs 2.91 fps for a template matching algorithm More efficient code: 300 lines of Quasar code vs lines of C++ code for a registration algorithm

Questionnaire http://bit.ly/1Ppy1jo

High level programming

High-level programming in Quasar: an introduction
Same abstraction level as Python and Matlab:

Variables & data types Variables: dynamic typing Data types:
Optional type annotation Data types: (c)scalar (u)int8 / (u)int16 / (u)int32 string (c)vec / (c)mat / (c)cube (i)vec𝑥 (𝑥=1,…,32) cell kernel_function function object Pass by value Pass by reference

Arrays, matrices, cubes, …
Zero-based indexing Useful functions: zeros(.) ones(.) eye(.) size(.) … ( Documentation)

Operators

Control structures if – elseif match with break continue for while
repeat

Functions function [out1,…,outM] = fname(in1,…,inN) main()
Functions in functions Specific typing Default values

Reference manual Complete description of Quasar functionalities Access
‘Help’ menu in IDE

Redshift – the Quasar IDE

Redshift – the Quasar IDE
Computation engine Debug tools Definition window Code editor Current directory Data window Console Output window

Gepura tools

Advanced programming

Kernel functions Candidate functions to be executed on GPU
Launched in parallel function [s1,…,sM] = __kernel__ kname(in1,…,inN,pos) Typed arguments necessary s1,…,sM: scalar Input arguments passed by reference pos: position in array/matrix/cube/… [s1,…,sM] = parallel_do(dims,argin1,…,arginN,kname) dims: area to process in parallel

Kernel functions

Device functions Only functions that can be used in kernel functions
dname = __device__(in1,…,inN) -> function_output

Shared memory in kernel functions
Global memory access can be accelerated: Global memory Pixel 1 Pixel 2 Shared memory Global memory

Shared memory in kernel functions
Global memory access can be accelerated: Boundary handling Shared memory 10000 runs Global memory 3051ms Global memory 831ms

UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE
Quasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university

UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE

Similar presentations

Presentation on theme: "UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE

Similar presentations

Presentation on theme: "UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE"— Presentation transcript:

Similar presentations

About project

Feedback