Download presentation
Presentation is loading. Please wait.
Published byThomasina Sabina Mosley Modified over 7 years ago
1
UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE
Quasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university
2
Agenda Background Demo The Quasar workflow Questionnaire
High level programming Coffee break Gepura tools Advanced programming
3
Background
4
Short introduction: research group IPI
Focus on image and video restoration and analysis: 30 doctorandi, 5 postdocs, 2 technology-developers, 6 professors Various topics: video, 3D, medical, segmentation, remote sensing,…
5
Our challenges Variable data Complex iterative algorithms
Hard constraints (e.g. realtime) Research environment -> rapid prototyping Variable hardware
6
Quasar: the start Originally: scripting system for writing “plugins” for my photo restoration tool. Introduction to Quasar
7
Quasar: a brief history
Originally (2009): translation from annotated C# code Jan 2011: first Quasar script: Simple, MATLAB-like syntax. parallel_do keyword to specify code that has to be executed in parallel (e.g. on a GPU). Variable types derived through type inference (not needed to declare variables, except parameters of kernel functions). Very easy to write various filters in a very short time frame: Brightness & contrast enhancement Color mixer Color correction Bilateral filtering Space variant blurring Gamma correction Nonlocal means filtering... All filters could run in real-time on a video sequence.
8
Quasar started evolving
1 More automatization 2 More optimizations 3 Integrated Development Environment 4 More robuust 5 From a few simple scripts to a full blown language with real life research examples
9
Todays Quasar Ecosystem
IDE & runtime optimisation Knowledge base Libraries High level programming language
10
Quasar installation…
11
Demo
12
The Quasar workflow
13
The benefits of GPUs. Growing application domain
Exploits massive parallelism e.g. to process large amounts of data in parallel Speed-ups of 10 to 100 Energy efficient calculations/Watt Applied to many applications: Multi-media, finances, big-data, scientific computing,… Commodity HW Standard in desktops and laptops Single precision NVIDIA GPU Intel CPU NEW TREND: integration in embedded (e.g., automotive applications) and mobile devices (smartphones: OnePlus, LG, HTC, …)
14
The drawbacks of GPU programming solved by Quasar
Low level coding experts needed Strong coupling between algorithm development and implementation Long development lead times Each HW platform requires new optimizations
15
High level of abstraction Hardware agnostic programming
Scripting language Compact code High level of abstraction Hardware agnostic programming Algorithm (quasar code) Data Code analysis Code optimization Developer feedback Kernel decomposition Data characteristics Kernel characteristics Compilation .NET OpenMP & SIMD OpenCL CUDA Memory management Load balancing Scheduling Kernel parameter optimization Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
16
Algorithm (quasar code) Algorithm (quasar code)
Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
17
Quasar - Scripting language
Same abstraction level as Python and Matlab: ) 2 weeks vs 3 months Shorter develop-ment cycles
18
Example: code written in CUDA vs. Quasar
#include <cuda.h> // Kernel that executes on the CUDA device __global__ void square_array(float *a, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx<N) a[idx] = a[idx] * a[idx]; } // main routine that executes on the host int main(void) float *a_h, *a_d; // Pointer to host & device arrays const int N = 10; // Number of elements in arrays size_t size = N * sizeof(float); a_h = (float *)malloc(size); // Allocate array on host cudaMalloc((void **) &a_d, size); // Allocate array on device // Initialize host array and copy it to CUDA device for (int i=0; i<N; i++) a_h[i] = (float)i; cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice); // Do calculation on device: int block_size = 4; int n_blocks = N/block_size + (N%block_size == 0 ? 0:1); square_array <<< n_blocks, block_size >>> (a_d, N); // Retrieve result from device and store it in host array cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost); // Print results for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]); // Cleanup free(a_h); cudaFree(a_d); a_d = 0..9 print a_d.^2
19
Algorithm (quasar code)
Data Code analysis Code analysis Data characteristics Kernel characteristics Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
20
Automatic parallelization
21
Development feedback
22
Quasar – Reductions Allow to provide an alternative implementation for certain operations (e.g., BLAS – Basic Linear Algebra Subroutines) reduction (alpha : scalar, x : vec, y : vec) -> alpha * x + y = blas_sscal(alpha, x, y) reduction (x) -> real(ifft2(x)) = irealfft2(x) Define “trivial” optimizations reduction (x:mat) -> real(x) = x reduction (x:mat) -> imag(x) = zeros(size(x)) reduction (x:mat) -> transpose(transpose(x)) = x reduction (x:mat) -> x[:,:] = x Shorthands reduction (x : cube) -> x[:] = reshape(x,[1,numel(x)])
23
Reductions: an example
reduction y -> cosh(y)*sqrt(1-tanh(y)^2) = 1 reduction z -> lim((1/z+1)^z,z=0) = exp(1) reduction x -> log(exp(x))=x reduction x -> sin(x)^2+cos(x)^2=1 reduction n -> sum(1/2^n,n=0..infinity)=2 symbolic x,y,z,n,infinity print log(lim((1+1/z)^z,z=0))+(sin(x)^2+cos(x)^2)==sum((cosh(y)*sqrt(1-tanh(y)^2))/2^n,n=0..infinity) OUTPUT: Result after 6 reductions: (2==2)
24
Algorithm (quasar code)
Data Code analysis Data characteristics Kernel characteristics Compilation Compilation Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
25
Compilation: parallel code
SSE Neon and sme are simd (similar instruction, multiple data) coarse fine granularity
26
Compilation: automatic detection of nested parallelism
x = imread("lena_big.tif")[:,:,1] y = zeros(size(x)) B = 16 % block size for m = 0..B..size(x,0)-1 for n = 0..B..size(x,1)-1 A = x[m..m+B-1,n..n+B-1] y[m..m+B-1,n..n+B-1] = sin(A)+B end end Parallel loop Parallel operation on every element of a matrix Mapped onto the dynamic parallelism of the GPU (CUDA 5.0) parallel_do Host function Kernel function 1 Kernel function 2
27
Algorithm (quasar code)
Data Code analysis Data characteristics Kernel characteristics Compilation Runtime Runtime Blue: internal to the Quasar workflow Orange: external factors Hardware characteristics CPU Multi-core CPU Many-core accelerator GPU SOC Heterogeneous hardware
28
Runtime execution: memory management
State 4 CPU GPU: NA State 5 CPU: NA GPU CPU out of memory GPU out of memory Copy to CPU Copy to GPU GPU out of memory CPU out of memory State 2 CPU: dirty GPU: non-dirty State 3 CPU: non-dirty GPU: dirty Modify on CPU Modify on GPU update GPU State 1 CPU: non-dirty GPU: non-dirty Update CPU Automatic memory management Allocation/disposal Transparent marshalling Transfer between CPU-GPU
29
Runtime execution: Optimization to Hardware
Automated parameter optimization: Block size Grid size Number of threads Number of warps Shared memory
30
Runtime execution: Load Balancing
timeCPU>timeGPU timeGPU1> timeCPU>timeGPU2 timeCPU<timeGPU Automated load balancing based on: Data Hardware characteristics Kernel characteristics
31
Runtime execution: Scheduler
Sequential: Concurrent: the CPU runs asynchronously from the GPU. This also means that while the GPU is still processing its data, the CPU is already looking forward and planning future memory transfers. If the GPU supports it (and most recent devices do), the memory transfers are scheduled in parallel with the kernel execution. So this means that the memory transfers can be nearly for-free in some cases. Je kan ook concurrent kernel execution vermelden (meerdere kernels kunnen overlappen indien de GPU dit ondersteunt, en recente GPUs ondersteunen dit) Automated scheduling: Reduce memory transfer times Concurrent kernel execution (if supported)
32
Results Faster development
2 weeks vs. 3 months for a CUDA implementation of an MRI reconstruction algorithm Faster execution using the GPU 64 fps vs 2.91 fps for a template matching algorithm More efficient code: 300 lines of Quasar code vs lines of C++ code for a registration algorithm
33
Questionnaire http://bit.ly/1Ppy1jo
34
High level programming
35
High-level programming in Quasar: an introduction
Same abstraction level as Python and Matlab:
36
Variables & data types Variables: dynamic typing Data types:
Optional type annotation Data types: (c)scalar (u)int8 / (u)int16 / (u)int32 string (c)vec / (c)mat / (c)cube (i)vec𝑥 (𝑥=1,…,32) cell kernel_function function object Pass by value Pass by reference
37
Arrays, matrices, cubes, …
Zero-based indexing Useful functions: zeros(.) ones(.) eye(.) size(.) … ( Documentation)
38
Operators
39
Control structures if – elseif match with break continue for while
repeat
40
Functions function [out1,…,outM] = fname(in1,…,inN) main()
Functions in functions Specific typing Default values
41
Reference manual Complete description of Quasar functionalities Access
‘Help’ menu in IDE
42
Redshift – the Quasar IDE
43
Redshift – the Quasar IDE
Computation engine Debug tools Definition window Code editor Current directory Data window Console Output window
44
Gepura tools
45
Advanced programming
46
Kernel functions Candidate functions to be executed on GPU
Launched in parallel function [s1,…,sM] = __kernel__ kname(in1,…,inN,pos) Typed arguments necessary s1,…,sM: scalar Input arguments passed by reference pos: position in array/matrix/cube/… [s1,…,sM] = parallel_do(dims,argin1,…,arginN,kname) dims: area to process in parallel
47
Kernel functions
48
Device functions Only functions that can be used in kernel functions
dname = __device__(in1,…,inN) -> function_output
49
Shared memory in kernel functions
Global memory access can be accelerated: Global memory Pixel 1 Pixel 2 Shared memory Global memory
50
Shared memory in kernel functions
Global memory access can be accelerated: Boundary handling Shared memory 10000 runs Global memory 3051ms Global memory 831ms
51
UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE
Quasar UNLOCKING THE POWER OF HETEROGENEOUS HARDWARE Joris Roels, Dirk Van Haerenborgh, Jonas De Vylder & Bart Goossens IPI, iMinds, Ghent university
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.