Enabling machine learning in embedded systems

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Emerging of Software Technologies Reporter: Jeremie Lucero.

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

GPU System Architecture Alan Gray EPCC The University of Edinburgh.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Some Thoughts on Technology and Strategies for Petaflops.

System Level Design: Orthogonalization of Concerns and Platform- Based Design K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

Contemporary Languages in Parallel Computing Raymond Hummel.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

Introduction to Embedded Development. What is an Embedded System ? An embedded system is a computer system embedded in a device with a dedicated function.

High Performance Computing G Burton – ICG – Oct12 – v1.1 1.

TRIALOG 25 rue du Général Foy F Paris - France Tel Fax url:

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Revisiting Kirchhoff Migration on GPUs Rice Oil & Gas HPC Workshop

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

GPU Architecture and Programming

1 Latest Generations of Multi Core Processors

Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Full and Para Virtualization

By Adam Reimel. Outline Introduction Platform Architecture Future Conclusion.

Overview of SAP Application Services By Accely. Introduction Developed organizations in any business industry will invest in SAP programs to offer progressive.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Designing Cisco Data Center Unified Fabric

Mihaela Malița Gheorghe M. Ștefan

Comparing TensorFlow Deep Learning Performance Using CPUs, GPUs, Local PCs and Cloud Pace University, Research Day, May 5, 2017 John Lawrence, Jonas Malmsten,

MobiRNN: Efficient Recurrent Neural Network Execution On Mobile GPU

TensorFlow CS 5665 F16 practicum Karun Joseph, A Reference:

TensorFlow– A system for large-scale machine learning

Generalized and Hybrid Fast-ICA Implementation using GPU

GENERAL TRENDS.

Bringing the benefit of Technology Enabled Care to the wider community

<The Future of IVI and CE Connectivity> Pavel Stankoulov

Electron Ion Collider New aspects of EIC experiment instrumentation and computing, as well as their possible impact on and context in society (B) COMPUTING.

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

Texas Instruments TDA2x and Vision SDK

Introduction to Parallelism.

Multi-Processing in High Performance Computer Architecture:

Processing Framework Sytse van Geldermalsen

GPU Programming using OpenCL

IoT Chip Market Size, Share, Trends, Growth and Demand Forecast to 2023 The global IoT chip market is likely to grow from $4,582.6 million in 2013 to IoT.

SOC Runtime Gregory Stoner.

Deep Learning Packages

CS 179 Project Intro.

Alternative Processor Panel Results 2008

The Free Lunch Ended 7 Years Ago

Multithreaded Programming

Introduction to Operating Systems

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Java Programming Introduction

Graphics Processing Unit

Standards for Machine Learning, Inferencing and Vision Acceleration

CMPE419 Mobile Application Development

6- General Purpose GPU Programming

CSE 502: Computer Architecture

Multicore and GPU Programming

Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs

Presentation transcript:

Enabling machine learning in embedded systems Rod Burns, Developer Relations, Codeplay Software

Leadership Products Enabling Advanced Applications on Complex Processor Systems Company Markets High-performance software solutions for custom heterogeneous systems Enabling the toughest processor systems with open-standards-based tools and middleware Established 2002 in Scotland, UK Vision Processing Machine Learning Data Compute High Performance Computing (HPC) Automotive (ISO 26262) IoT, Smartphones & Tablets Medical & Industrial Products Customers/Partners C++ platform with SYCL™, enabling vision & machine learning e.g. TensorFlow™ The heart of Codeplay's compute technology, enabling OpenCL™, SPIR™, HSA™ and Vulkan™ Codeplay is internationally recognized for expertise in Heterogeneous Systems, and has many years of experience in the development of Compilers, Runtimes, Debuggers, Test Systems, and other specialized tools. Codeplay has delivered standards-compliant systems for some of the largest semiconductor companies in the world, focusing specifically on high-performance heterogeneous processor solutions for CPUs, GPUs, DSPs, FPGAs and other specialized imaging and vision processors. Working within The Khronos™ Group to define new open standards such as OpenCL™, SPIR™, SYCL™, and Vulkan®, and leading the creation of new System Runtime and Tools standards through the HSA Foundation, Codeplay has earned a reputation as one of the leaders in compute systems. Many Global Companies

Connected Artificial Intelligence What do these artificial intelligence applications have in common?

Connected - Powerful - Power Hungry High bandwidth connectivity Access to powerful machines No power constraints

Embedded Machine Learning Cannot rely on connectivity More limited processors Power constrained

Provides A Unique Challenge This car needs to be able to process all this data and make instant decisions

Why Different Sensors Are Needed? The car needs multiple data sets to make decisions This data needs to be processed

Where Do We Need To Go? GFLOPS Desktop CPU Desktop GPU (200W+) “On a 100 millimetre-squared chip, Google needs something like 50 teraflops of performance” - Daniel Rosenband (Google’s self-driving car project) at HotChips 2016 Desktop CPU Desktop GPU (200W+) Integrated Desktop GPU (~15W) Mobile CPU GFLOPS Smartphone GPU (~2W) These trend lines seem to violate the rules of physics… Year of introduction

Software Drives Us Towards This Target What do software developers need, to build complex applications for AI?

How Can Software Get Us There? Deep neural networks are bringing the reality of these systems closer We must make effective use of processors

Tensors Tensors are n-dimensional matrices represented by arrays Calculations on matrices can be done using linear algebra These operations are complex and processor intensive

Linear Algebra TensorFlow relies heavily on linear algebra for processing This involves many matrix calculations Matrix calculations are processor intensive

Heterogeneous Systems GPUs can be used to run operations in parallel Accelerates performance Reduces overall power used

Implications On Hardware 22222222222222222222 CPU Matrix calculations involve many similar operations Run serially these operations are limited to a small number of cores on a CPU 2*3 2*3 2*3 2*3 333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 GPU … n more rows

Implications On Hardware 22222222222222222222 CPU A GPU or other accelerator can run many thousand operations in parallel 333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333333 GPU … n more rows

What Does This Mean For TensorFlow Since linear algebra involves many similar calculations on data sets these can be run in parallel on GPUs and other accelerator processors https://xiaoxiaowang87.github.io/build_a_pc/

TensorFlow And Eigen TensorFlow uses the Eigen library for linear algebra operations Eigen offers additional performance benefits

Kernel Fusion The Eigen library uses kernel fusion This involves executing a sequence of kernels that can share some data This reduces the overhead of memory movement

What Is A kernel? A kernel is a function that is applied on some data As a simple example we could have a kernel that does C = A + B This kernel iterates over all the data provided to it

Combining Kernels These operations can be combined and run together This avoids expensive memory movement overheads

Applying Fusion To TensorFlow Eigen TensorFlow uses Eigen to achieve kernel-fusion CUDA does this for NVIDIA GPUs, SYCL is used here for AMD GPUs Speedup by fusion Speedup by fusion Speedup by fusion Speedup by fusion Unfused performance improvement: AMD GPU vs multi-core Intel CPU Total performance improvement delivered by SYCL is both of these graphs combined

Implications For Embedded Applications Embedded hardware is less powerful than in a data centre Increased performance and reduced power usage makes more complex AI applications possible

Open Standards For Hardware Software developers need a consistent environment and interface for developing AI applications

OpenCL And SYCL OpenCL ™ Cross-platform parallel programming for a range of processors Developers use C for programming hardware ComputeAorta implementation from Codeplay SYCL ™ Higher level abstraction of OpenCL Modern C++ supported and single source ComputeCpp implementation used for TensorFlow with OpenCL TriSYCL - alternative implementation

Integration With TensorFlow Eigen is a C++ library, OpenCL does not support C++ Templates in Eigen can be re- used with SYCL

Benefits Of SYCL Integration Devices already offering OpenCL SPIR/SPIR-V support can be targeted New TensorFlow operations can be added using C++ The interfaces remain the same between layers

Benefits For Developers TensorFlow applications will be accelerated without any special coding Developers can target a wider choice of hardware Support for embedded hardware processors

Beyond TensorFlow We are working on integrating OpenCV, Caffe(2) and other AI frameworks on embedded hardware We use open standards OpenCL and SYCL to achieve this

SYCLBLAS Our research team is building a BLAS framework using SYCL This has the potential to enable many deep learning frameworks on OpenCL hardware

Our Solution Stack via LLVM LLVM LLVM LLVM CPU DSP GPU FPGA

Renesas R-Car With OpenCL And SYCL We are enabling AI frameworks including OpenCV and TensorFlow on Renesas R-Car hardware http://developer.codeplay.com

Combining open standards to deliver platforms for software developers SYCL combines C++ single-source with OpenCL acceleration OpenCL lets us run on a very wide range of accelerators now and in the future Single-source is most widely-adopted machine learning programming model C++ single-source lets us create customizable graph models Combining open standards to deliver platforms for software developers

Open Standards For AI Talk to me about open standards with TensorFlow and other AI frameworks on embedded hardware rod@codeplay.com