Processing Framework Sytse van Geldermalsen

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Contemporary Languages in Parallel Computing Raymond Hummel.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
GPGPU platforms GP - General Purpose computation using GPU
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Introduction CSE 1310 – Introduction to Computers and Programming Vassilis Athitsos University of Texas at Arlington 1.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
GPU Architecture and Programming
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
QCAdesigner – CUDA HPPS project
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Single Instruction Multiple Threads
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
CPU Central Processing Unit
Computer Engg, IIT(BHU)
Prof. Zhang Gang School of Computer Sci. & Tech.
Computer Hardware What is a CPU.
Unit 2 Technology Systems
These slides are based on the book:
GPU Architecture and Its Application
NFV Compute Acceleration APIs and Evaluation
Generalized and Hybrid Fast-ICA Implementation using GPU
Multiprocessor System Distributed System
Bus Systems ISA PCI AGP.
CSC391/691 Intro to OpenCV Dr. Rongzhong Li Fall 2016
GPUs: Not Just for Graphics Anymore
A Closer Look at Instruction Set Architectures
GWE Core Grid Wizard Enterprise (
Enabling machine learning in embedded systems
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Constructing a system with multiple computers or processors
Texas Instruments TDA2x and Vision SDK
What happens inside a CPU?
Introduction CSE 1310 – Introduction to Computers and Programming
Lecture 2: Intro to the simd lifestyle and GPU internals
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Lecture 5: GPU Compute Architecture
Chapter 4: Threads.
Lecture 5: GPU Compute Architecture for the last time
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
Chapter 4: Threads.
Compiler Back End Panel
Lesson Objectives Aims You should be able to:
General Programming on Graphical Processing Units
General Programming on Graphical Processing Units
Big O Notation.
Compiler Back End Panel
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Constructing a system with multiple computers or processors
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Introduction to Operating Systems
Multithreaded Programming
Algorithms and Problem Solving
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Chapter 4: Threads & Concurrency
Why Threads Are A Bad Idea (for most purposes)
Multicore and GPU Programming
Why Threads Are A Bad Idea (for most purposes)
Why Threads Are A Bad Idea (for most purposes)
6- General Purpose GPU Programming
Multicore and GPU Programming
Course Code 114 Introduction to Computer Science
Presentation transcript:

Processing Framework Sytse van Geldermalsen Masters Grid Computing – University of Amsterdam Internship at Amsterdam Medical Centre Hello and welcome to my presentation. I study grid computing at the university of Amsterdam and currently doing my internship here. The context of my project I’m working on here is high performance computing on a single computer. Especially here at the AMC with all the medical imaging processing, post processing of large amounts of data of patients in follow up projects/research, there is need of fast processing, as a large part of our group here is working with the grid. My audience is a little divided, I some of you are more involved with the grid and high performance programming. I will try to stay basic and general, but nearing the end of the presentation it will become a little bit more technical.

Contents OpenCL Concepts Problems Research and projects Processing Framework Example These are the contents of the presentation, I will start with the library I am using to achieve high performance computing. This is called OpenCL – or Open Computing Language and I’m going to talk about the key concepts behind it and why it makes it an attractive computing platform There are some Concepts in the world of OpenCL that I will explain as they because I am going to use them throughout the presentation OpenCL is a relatively new development (past 3 years) and there are some problems facing developers who write OpenCL programs, one of which is how fast and easy it is to write OpenCL programs These problems are being tackled in various ways in current research to speed up application development. With a lot of ideas presented in this research I aim to bring some of them together and introduce a new Processing Framework that will help developers program with OpenCL I would also like to show a small example of what it would look like to conceptually write code to use this framework.

OpenCL Requires vendor support Portable ARM, AMD, Intel, Apple, Vivante Corporation, STMicroelectronics International NV, IBM Corporation, Imagination Technologies, Creative Labs, NVIDIA Portable Works on heterogeneous architecture Provides great computational power Okay so, OpenCL is a low level library for writing programs that execute across heterogeneous systems consisting of CPU’s, GPU’s, FPGA’s, CELL processors. This is basically a modern computer, which may have 2 or more of these devices. The vendor is responsible to support OpenCL, these are the current supporting vendors: Vivante Corporation – embedded GPUs, STMicroelectronics International NV, IBM Corporation, Imagination Technologies – mobile grapics systems ARM, AMD, Intel, IBM, already the big names in hardware manufacturing. You could call OpenCL portable, because you can write the same executable code which can be run for any of these devices, for any operating system Works on heterogeneous architecture – windows, linux – the same code! The GPU in particular has seen great computational power speedup which I will show you in the next slide versus a CPU http://www.khronos.org/opencl/

How much computational power? Speed of GPU’s are growing rapidly vs CPU’s With all of this potential power on a personal computer, why not use it? Many of us here use the grid to run large computations, this is also a possible option. So the question is, why is this so much faster? A GPU has hundreds of computational cores which are capable of running multiple threads, versus the CPU’s two or four cores, with one or two threads per core. A GPU is designed to run a lot of simple calculations concurrently. http://www.r-bloggers.com/cpu-and-gpu-trends-over-time/

Key Concepts OpenCL - Runtime system Kernel Accelerated Device Runtime system OpenCL is a runtime system. When the program runs, it compiles OpenCL Kernel code. This Kernel, is eventually the piece of code that will be run on the accelerated device. I will give an example what this looks like compared to standard code. Because this kernel will be compiled during runtime, any device can be used on any system. An OpenCL supported accelerated device is a dedicated processing device that is capable of running OpenCL kernels.

OpenCL Kernel // Sequential c/c++ code for( int x = 0; x < 1024; x++ ) { for( int y = 0; y < 1024; y++ ) matrix[x][y] = matrix[x][y] + 1; // Code is run 1048576 times for this thread } // Parallel kernel code kernel void MatrixIncrement( global int** matrix) int x = get_global_id(0); int y = get_global_id(1); matrix[x][y] = matrix[x][y] + 1; // Code is run once for this thread Say we have a matrix of width 1024 and height 1024 And we want to simply increment the value. In a normal sequential style it will be programed as folows, With two loops we will iterate over indexes x and y, and increment that value. Simple as that, however, that line of code will be run more than a million times, one after the other. In the parallel kernel code, a function will be executed by the device. Here, every thread will perform this code. It will retrieve its unique index position, and run the increment once. Done.

Problems Low level C/C++ Library A lot of overhead code Things can and will go wrong Now this may seem all simple and easy, but lets not forget, We are dealing with a low level library, where the programmer has to handle a lot of things. A lot of programmers do not want to get tied up in the hassle of managing the devices we are working on, managing the code and time it compiles or runs, et cetera. This is overhead code that reduces productivity, and interest in using this library. Because the programmer has to deal with so much, things will go wrong, they always do. Lets look at a model that represents ease of application development of opencl applications Wrappers pycl, cloo, etc.

Ease of OpenCL application development Increase ease of application development Drivers and Hardware, CPU’s, GPU’s, Cell Processors, FPGA’s OpenCL C Library Tools: Debuggers, Profilers, High Level Frameworks Middleware/Library: Video, Imaging, Math/Sciences, Physics Wrappers C++, C#, Java, Python, Javascript At the lowest level, the drivers communicate with the hardware. The programmer does not deal with this. On top of this, the opencl library communicates with the drivers to compile and run kernels. Here the programmer needs to handle compiling the code, requesting available devices, creating memory specifically for a device, etc. Tools have been created for the library that can debug it, profile it. Very handy tools There are some lower middleware and libraries to do some video, imaging math or physics. On top of that there are high level frameworks that seamlessly handle these layers to increase ease of application development. This is the scope of my project, and I aim to make it easier for programmers to write OpenCL code. We are going to take a look at the high level frameworks section, where a lot of research has been done in optimizing, and making it easier to write opencl code.

High Level Frameworks Research has been done in: Scheduling multiple kernels on device Overlapping memory transfers with kernel execution Load balancing Distribution over GPU’s on the grid Task scheduling Various research has been done in optimizing the use of an accelerated device, this is a bit more technical Running multiple kernels on the same device at the same time yielded in some performance enhancement. Transferring memory to a device and running a kernel on it at the same time. Load balancing between different accelerated devices. Running opencl code on the grid Load balancing pci express bus memory transfer..

Dataflow Processing Framework In a nutshell: Based on ideas of different research Increase the ease of development Uses the dataflow concept Simplicity Asynchronous overlapped data transfers and kernel executions The research puts forward some interesting ideas, however I start to see the lack of ease of development, conceptual development of an algorithm, so I designed a framework that brings together a lot of ideas. In a nutshell, I wanted to make the framework to increase the ease of development, Use the dataflow concept, useful for conceptual programming For performance I do asynchronous overlapped data transfer and kernel executions Keep it simple For a programmer

Conceptual Example 1 2 3 4 5 6 Input A Input B Legend: Async Process Async memory xfer CPU Process GPU Process Data Dependency Data Output 1 2 3 4 5 Lets look at a conceptual example of a dataflow processing scenario When I look at an algorithm I see some data being processed in some way, and it can be broken down into a flow. A conceptual example shows a lot more Here we see six asynchronous processes, each will require input data, and produce output data. Some processes must wait for others, and the data must be prepared for them Here we have two inputs, which will be needed by the CPU and GPU. This data will get sent to the appropriate devices in parallel. Once the data is ready for those processes they can start. Etc. This all happens independently and overlapped. 6 Output A Output B

Programming with the Framework Programmer defines a number of processes and data The process uses a OpenCL kernel or a standard C/C++ function User defines the arguments of the kernel with the defined data These processes compute on user selected device: CPU/ GPU/FPGA… etc Signal the framework to run

Programming Example Framework 1 Sort 2 Sort 3 Filter 4 Search Array A ProcessingFramework pf; ProcessingComponent one, two, three, four; DeviceMemory ArrayA, ArrayB, Output; ArrayA = pf.CreateInputMemory( mem_size ); ArrayB = pf.CreateInputMemory( mem_size ); Output = pf.CreateOutputMemory( mem_size ); one = pf.CreateAPC( pf.GPUDevice(), "Sort" ); two = pf.CreateAPC( pf.CPUDevice(), "Sort" ); three = pf.CreateAPC( pf.GPUDevice(), "Filter" ); four = pf.CreateAPC( pf.CPUDevice(), "Search" ); one.SetArg( 0, ArrayA ); one.SetWorkSize( arr_size ); two.SetArg( 0, ArrayB ); two.SetWorkSize( arr_size ); three.SetDependency( one, 0, ArrayA ); three.SetDependency( two, 1, ArrayB ); three.SeWorkSize( arr_size ); four.SetDependency( three, 0, ArrayA ); four.SetArg( 1, Output ); four.SetWorkSize( arr_size ); pf.Run(); Array A Array B 1 Sort 2 Sort 3 Filter 4 Search Output