Processing Framework Sytse van Geldermalsen Masters Grid Computing – University of Amsterdam Internship at Amsterdam Medical Centre Hello and welcome to my presentation. I study grid computing at the university of Amsterdam and currently doing my internship here. The context of my project I’m working on here is high performance computing on a single computer. Especially here at the AMC with all the medical imaging processing, post processing of large amounts of data of patients in follow up projects/research, there is need of fast processing, as a large part of our group here is working with the grid. My audience is a little divided, I some of you are more involved with the grid and high performance programming. I will try to stay basic and general, but nearing the end of the presentation it will become a little bit more technical.
Contents OpenCL Concepts Problems Research and projects Processing Framework Example These are the contents of the presentation, I will start with the library I am using to achieve high performance computing. This is called OpenCL – or Open Computing Language and I’m going to talk about the key concepts behind it and why it makes it an attractive computing platform There are some Concepts in the world of OpenCL that I will explain as they because I am going to use them throughout the presentation OpenCL is a relatively new development (past 3 years) and there are some problems facing developers who write OpenCL programs, one of which is how fast and easy it is to write OpenCL programs These problems are being tackled in various ways in current research to speed up application development. With a lot of ideas presented in this research I aim to bring some of them together and introduce a new Processing Framework that will help developers program with OpenCL I would also like to show a small example of what it would look like to conceptually write code to use this framework.
OpenCL Requires vendor support Portable ARM, AMD, Intel, Apple, Vivante Corporation, STMicroelectronics International NV, IBM Corporation, Imagination Technologies, Creative Labs, NVIDIA Portable Works on heterogeneous architecture Provides great computational power Okay so, OpenCL is a low level library for writing programs that execute across heterogeneous systems consisting of CPU’s, GPU’s, FPGA’s, CELL processors. This is basically a modern computer, which may have 2 or more of these devices. The vendor is responsible to support OpenCL, these are the current supporting vendors: Vivante Corporation – embedded GPUs, STMicroelectronics International NV, IBM Corporation, Imagination Technologies – mobile grapics systems ARM, AMD, Intel, IBM, already the big names in hardware manufacturing. You could call OpenCL portable, because you can write the same executable code which can be run for any of these devices, for any operating system Works on heterogeneous architecture – windows, linux – the same code! The GPU in particular has seen great computational power speedup which I will show you in the next slide versus a CPU http://www.khronos.org/opencl/
How much computational power? Speed of GPU’s are growing rapidly vs CPU’s With all of this potential power on a personal computer, why not use it? Many of us here use the grid to run large computations, this is also a possible option. So the question is, why is this so much faster? A GPU has hundreds of computational cores which are capable of running multiple threads, versus the CPU’s two or four cores, with one or two threads per core. A GPU is designed to run a lot of simple calculations concurrently. http://www.r-bloggers.com/cpu-and-gpu-trends-over-time/
Key Concepts OpenCL - Runtime system Kernel Accelerated Device Runtime system OpenCL is a runtime system. When the program runs, it compiles OpenCL Kernel code. This Kernel, is eventually the piece of code that will be run on the accelerated device. I will give an example what this looks like compared to standard code. Because this kernel will be compiled during runtime, any device can be used on any system. An OpenCL supported accelerated device is a dedicated processing device that is capable of running OpenCL kernels.
OpenCL Kernel // Sequential c/c++ code for( int x = 0; x < 1024; x++ ) { for( int y = 0; y < 1024; y++ ) matrix[x][y] = matrix[x][y] + 1; // Code is run 1048576 times for this thread } // Parallel kernel code kernel void MatrixIncrement( global int** matrix) int x = get_global_id(0); int y = get_global_id(1); matrix[x][y] = matrix[x][y] + 1; // Code is run once for this thread Say we have a matrix of width 1024 and height 1024 And we want to simply increment the value. In a normal sequential style it will be programed as folows, With two loops we will iterate over indexes x and y, and increment that value. Simple as that, however, that line of code will be run more than a million times, one after the other. In the parallel kernel code, a function will be executed by the device. Here, every thread will perform this code. It will retrieve its unique index position, and run the increment once. Done.
Problems Low level C/C++ Library A lot of overhead code Things can and will go wrong Now this may seem all simple and easy, but lets not forget, We are dealing with a low level library, where the programmer has to handle a lot of things. A lot of programmers do not want to get tied up in the hassle of managing the devices we are working on, managing the code and time it compiles or runs, et cetera. This is overhead code that reduces productivity, and interest in using this library. Because the programmer has to deal with so much, things will go wrong, they always do. Lets look at a model that represents ease of application development of opencl applications Wrappers pycl, cloo, etc.
Ease of OpenCL application development Increase ease of application development Drivers and Hardware, CPU’s, GPU’s, Cell Processors, FPGA’s OpenCL C Library Tools: Debuggers, Profilers, High Level Frameworks Middleware/Library: Video, Imaging, Math/Sciences, Physics Wrappers C++, C#, Java, Python, Javascript At the lowest level, the drivers communicate with the hardware. The programmer does not deal with this. On top of this, the opencl library communicates with the drivers to compile and run kernels. Here the programmer needs to handle compiling the code, requesting available devices, creating memory specifically for a device, etc. Tools have been created for the library that can debug it, profile it. Very handy tools There are some lower middleware and libraries to do some video, imaging math or physics. On top of that there are high level frameworks that seamlessly handle these layers to increase ease of application development. This is the scope of my project, and I aim to make it easier for programmers to write OpenCL code. We are going to take a look at the high level frameworks section, where a lot of research has been done in optimizing, and making it easier to write opencl code.
High Level Frameworks Research has been done in: Scheduling multiple kernels on device Overlapping memory transfers with kernel execution Load balancing Distribution over GPU’s on the grid Task scheduling Various research has been done in optimizing the use of an accelerated device, this is a bit more technical Running multiple kernels on the same device at the same time yielded in some performance enhancement. Transferring memory to a device and running a kernel on it at the same time. Load balancing between different accelerated devices. Running opencl code on the grid Load balancing pci express bus memory transfer..
Dataflow Processing Framework In a nutshell: Based on ideas of different research Increase the ease of development Uses the dataflow concept Simplicity Asynchronous overlapped data transfers and kernel executions The research puts forward some interesting ideas, however I start to see the lack of ease of development, conceptual development of an algorithm, so I designed a framework that brings together a lot of ideas. In a nutshell, I wanted to make the framework to increase the ease of development, Use the dataflow concept, useful for conceptual programming For performance I do asynchronous overlapped data transfer and kernel executions Keep it simple For a programmer
Conceptual Example 1 2 3 4 5 6 Input A Input B Legend: Async Process Async memory xfer CPU Process GPU Process Data Dependency Data Output 1 2 3 4 5 Lets look at a conceptual example of a dataflow processing scenario When I look at an algorithm I see some data being processed in some way, and it can be broken down into a flow. A conceptual example shows a lot more Here we see six asynchronous processes, each will require input data, and produce output data. Some processes must wait for others, and the data must be prepared for them Here we have two inputs, which will be needed by the CPU and GPU. This data will get sent to the appropriate devices in parallel. Once the data is ready for those processes they can start. Etc. This all happens independently and overlapped. 6 Output A Output B
Programming with the Framework Programmer defines a number of processes and data The process uses a OpenCL kernel or a standard C/C++ function User defines the arguments of the kernel with the defined data These processes compute on user selected device: CPU/ GPU/FPGA… etc Signal the framework to run
Programming Example Framework 1 Sort 2 Sort 3 Filter 4 Search Array A ProcessingFramework pf; ProcessingComponent one, two, three, four; DeviceMemory ArrayA, ArrayB, Output; ArrayA = pf.CreateInputMemory( mem_size ); ArrayB = pf.CreateInputMemory( mem_size ); Output = pf.CreateOutputMemory( mem_size ); one = pf.CreateAPC( pf.GPUDevice(), "Sort" ); two = pf.CreateAPC( pf.CPUDevice(), "Sort" ); three = pf.CreateAPC( pf.GPUDevice(), "Filter" ); four = pf.CreateAPC( pf.CPUDevice(), "Search" ); one.SetArg( 0, ArrayA ); one.SetWorkSize( arr_size ); two.SetArg( 0, ArrayB ); two.SetWorkSize( arr_size ); three.SetDependency( one, 0, ArrayA ); three.SetDependency( two, 1, ArrayB ); three.SeWorkSize( arr_size ); four.SetDependency( three, 0, ArrayA ); four.SetArg( 1, Output ); four.SetWorkSize( arr_size ); pf.Run(); Array A Array B 1 Sort 2 Sort 3 Filter 4 Search Output