Brook GLES Pi: Democratising Accelerator Programming

Slides:



Advertisements
Similar presentations
GPGPU Programming Dominik G ö ddeke. 2Overview Choices in GPGPU programming Illustrated CPU vs. GPU step by step example GPU kernels in detail.
Advertisements

GPU Programming using BU Shared Computing Cluster
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Why GPUs? Robert Strzodka. 2Overview Computation / Bandwidth / Power CPU – GPU Comparison GPU Characteristics.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
FSOSS Dr. Chris Szalwinski Professor School of Information and Communication Technology Seneca College, Toronto, Canada GPU Research Capabilities.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
CUDA Programming Model Xing Zeng, Dongyue Mou. Introduction Motivation Programming Model Memory Model CUDA API Example Pro & Contra Trend Outline.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Contemporary Languages in Parallel Computing Raymond Hummel.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
GPGPU platforms GP - General Purpose computation using GPU
OpenSSL acceleration using Graphics Processing Units
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Computer Graphics Graphics Hardware
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
1 Dr. Scott Schaefer Programmable Shaders. 2/30 Graphics Cards Performance Nvidia Geforce 6800 GTX 1  6.4 billion pixels/sec Nvidia Geforce 7900 GTX.
Accelerating image recognition on mobile devices using GPGPU
Robert Liao Tracy Wang CS252 Spring Overview Traditional GPU Architecture The NVIDIA G80 Processor CUDA (Compute Unified Device Architecture) LAPACK.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
- DHRUVA TIRUMALA BUKKAPATNAM Geant4 Geometry on a GPU.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
OpenCL Programming James Perry EPCC The University of Edinburgh.
May 8, 2007Farid Harhad and Alaa Shams CS7080 Overview of the GPU Architecture CS7080 Final Class Project Supervised by: Dr. Elias Khalaf By: Farid Harhad.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Graphics Graphics Hardware
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
Employing compression solutions under openacc
Programmable Shaders Dr. Scott Schaefer.
Our Graphics Environment
Thilina Gunarathne, Bimalee Salpitkorala, Arun Chauhan, Geoffrey Fox
OpenCL 소개 류관희 충북대학교 소프트웨어학과.
Enabling machine learning in embedded systems
Texas Instruments TDA2x and Vision SDK
Graphics Processing Unit
Performance Tuning Team Chia-heng Tu June 30, 2009
Linchuan Chen, Xin Huo and Gagan Agrawal
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
NVIDIA Fermi Architecture
General Programming on Graphical Processing Units
General Programming on Graphical Processing Units
Computer Graphics Graphics Hardware
Graphics Processing Unit
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Brook GLES Pi: Democratising Accelerator Programming www.bsc.es Brook GLES Pi: Democratising Accelerator Programming Matina Maria Trompouki, Leonidas Kosmidis HPG 2018

Introduction and Motivation Modern and future computing relies on general purpose accelerators Increased public and scientific interest Increased importance of learning and experimenting with their programming paradigm However their cost is high Accelerator is expensive Requires a high-end host computer to be used GPUs manycores FPGAs

Introduction and Motivation Traditional educational model shifted to self-education Homework practice Massively Open Online Courses (MOOC) Experimentation with educational computers Unaffordable for these target groups Accelerator programming opportunities are limited! GPUs manycores FPGAs

Brook GLES Pi Port of the open-source accelerator programming language Brook on the low-cost ($25) educational computer Raspberry Pi Enables the use of its embedded GPU VideoCore IV, capable of 24 Gflops Standalone development on the device Large and collaborative community Open-source implementation Portability across every embedded GPU supporting OpenGL ES 2 (99% of the embedded devices with a GPU in the market) Allows teaching, experimenting and learning GPGPU programming with affordable devices

Brook We implemented our solution in the Brook programming language [1] open source language developed circa 2004 predecessor of CUDA and OpenCL Commercially adopted by AMD before OpenCL, rebranded as Brook+ Source-to-Source compiler and Runtime Restricted subset of C (no recursion, no goto, no pointers) transforms CUDA-like programs to graphics operations supports multiple backends [1] I. Buck et al, Brook for GPUs, SIGGRAPH 2004

Brook vs CUDA Example

Additions Ported Brook on the Raspberry Pi Introduced an OpenGL ES 2 compiler backend and runtime Optimised for the Raspberry Pi Based only on the core OpenGL ES 2 standard No vendor specific extensions Maximum portability across all embedded devices with GPUs supporting OpenGL ES 2 (>99% of the market) Shares common code base with Brook Auto [1] 2K Lines of code in the compiler and runtime 4K Lines of code in regression tests and benchmarks [1] Trompouki et al, Brook Auto: High-Level Certification-Friendly Programming for GPU-powered Automotive Systems, DAC 2018

Additions The compiler backend uses Nvidia’s Cg compiler like the original Brook Provided only in binary form for x86 Raspberry is ARM-based Emulate compiler using the binary translator qemu-x86 Enables standalone development and compilation on the device Compilation time is similar to the native compilers, e.g. gcc Original Brook only supports floating point (and their vector) streams Added support for char and int Signed and unsigned versions Vector additions

Additions Stream datatypes limitations due to OpenGL ES 2 Input and output streams are limited to 32-bit up to vectors of 4 chars, 1 int, or 1 floating point Only a single output stream is permitted per kernel, up to 32-bit When a kernel violates these rules and the OpenGL ES 2 backend is enabled, the programmer is instructed to rewrite the code

Dropped Features Iterators Unusual Brook feature, syntactic sugar for creating and initialising streams of indices No mapping with any CUDA/OpenCL concept AMD’s Brook+ examples use indexof instead Would complicate unnecessarily the implementation We want equivalence of CPU/GPU code but OpenGL ES 2 only supports normalised coordinates

Dropped Features GatherOp CPU emulation of gather operations Only needed in GPUs not supporting dependent texture lookups All known OpenGL ES 2 GPUs support it ScatterOp CPU emulation of read-modify-write Slow performance Scatter to gather transformation is encouraged for accelerators whenever is possible Neither have an equivalent in CUDA/OpenCL

Currently Unsupported Features Structs Supported in accelerators but their use is discouraged Limits vectorisation Under-utilises memory bandwidth Memory layout transformation is recommended Array of Stuctures (AoS) to Structure of Arrays (SoA) Plans to be supported in the future to allow experiencing the performance difference

Results Experimental Setup AMD Brook+ SDK applications, different input sizes up to 2048 elements Raspberry Pi Brook GLES Pi backend and runtime, OpenGL ES 2 $25 cost Comparison with: NVIDIA GeForce GTX 1050Ti (High/Medium-End desktop GPU) $250 GPU cost, $2500 total cost including the host NVIDIA Jetson TX2 (High-End embedded GPU) $600 platform cost Original Brook OpenGL backend and runtime

Relative Performance of the Systems used in the comparison Relative performance CPU vs GPU of the systems used in comparison Reported by the flops benchmark OpenMP CPU support is disabled in the multicore high-end systems Their speedups are higher Platform (GPU/CPU) Performance Ratio Bandwidth Ratio Raspberry Pi VideoCore IV vs ARM 23 x 1/33 x NVIDIA GTX 1050 Ti vs AMD CPU 19 x 1/11 x NVIDIA Jetson TX2 Pascal GPU vs ARM 11 x 1/4 x

Brook GLES Pi Evaluation: Weak scaling GPU Programs Benchmarks that do not scale with input size in high-end GPUs do not scale either in Brook GLES Pi Relative performance in the same order of magnitude

Brook GLES Pi Evaluation: Strong scaling GPU Programs Same performance trends, relative performance within 2 orders of magnitude

Conclusion Brook GLES Pi Port of the Brook Programming Language over OpenGL ES 2 Optimised for the low-cost educational computer Raspberry Pi Portable OpenGL ES 2 allows the use of any embedded GPU Similar performance trends with original Brook on high-end GPU systems Democratises accelerator programming Code: http://github.com/lkosmid/brook

Thank you!

Brook GLES Pi: Democratising Accelerator Programming www.bsc.es Brook GLES Pi: Democratising Accelerator Programming Matina Maria Trompouki, Leonidas Kosmidis HPG 2018