Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

© Fastvideo, Key Points We implemented the fastest JPEG codec Many applications using JPEG can benefit from our codec.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Contemporary Languages in Parallel Computing Raymond Hummel.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Accelerating image recognition on mobile devices using GPGPU

GPU Architecture and Programming

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Full and Para Virtualization

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Martin Kruliš by Martin Kruliš (v1.1)1.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

µC-States: Fine-grained GPU Datapath Power Management

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Gwangsun Kim, Jiyun Jeong, John Kim

CSC391/691 Intro to OpenCV Dr. Rongzhong Li Fall 2016

Enabling Effective Utilization of GPUs for Data Management Systems

Chapter 9: Virtual Memory

Chai: Collaborative Heterogeneous Applications for Integrated-architectures Juan Gómez-Luna1, Izzat El Hajj2, Li-Wen Chang2, Víctor García-Flores3,4, Simon.

Enabling machine learning in embedded systems

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Texas Instruments TDA2x and Vision SDK

Collaborative Computing for Heterogeneous Integrated Systems

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Embedded OpenCV Acceleration

The Small batch (and Other) solutions in Mantle API

Many-core Software Development Platforms

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

Linchuan Chen, Xin Huo and Gagan Agrawal

SOC Runtime Gregory Stoner.

libflame optimizations with BLIS

Introduction to OpenCL 2.0

NVIDIA Fermi Architecture

Chapter 2: System Structures

Interference from GPU System Service Requests

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Windows Virtual PC / Hyper-V

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Java Programming Introduction

Advanced Micro Devices, Inc.

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Presentation transcript:

Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D. Accelerating Computer Vision and Image Processing via Heterogeneous Compute… Transparently! Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

Vision and image processing are computationally demanding! Automatic Inspection Medical image analysis Autonomous Navigation Human Machine Interfaces Augmented Reality Robotics Security/Surveillance Data Analytics and Organization… and more Millions of pixels per image 100s of calculations per pixel 100s–1000s of image frames per second Complex + constantly evolving algorithms Hungry for PERFORMANCE But needs to be PROGRAMMABLE And within POWER & COST budgets

The Traditional solution CPU for system control, IO and UI Hardware offload for compute-intensive processing (DSP / FPGA / ASIC) Various tradeoffs of Performance, Programmability, Power, Cost CPU DSP/ FPGA/ ASIC DSP/ FPGA/ ASIC … SystemMemory Device Memory Device Memory

Evolution of heterogeneous compute CPU cc Main Memory Host Memory PCIe Memory (pinned) dGPU GPU Device Memory PCIe® 1. GPU Compute: dGPU brought 100s–1000s of GFLOPS of data-parallel performance. … But... Data copy overheads Kernel launch constraints Expert programming needed CPU cc HSA iGPU Physical Memory Unified (Bidirectionally Coherent, Pageable) Virtual Memory Cache 3. Heterogeneous Systems Architecture (HSA) APU: True heterogeneous compute across CPU/GPU Share pointers freely Move work freely across CPU and GPU Use standard programming languages CPU cc Main Memory Host Memory Device Visible Host Memory iGPU Device Memory Host Visible Device Memory … 2. APU with iGPU: Easier to use SOC, Unified memory eliminates some copies.

Open-source Computer Vision Library ISCAS MultiCoreWare Google Contributors Nvidia Intel Willow Garage Core team Itseez 2000 2008 2009 2012 2013 present First public release v2.0, C++ API @github v2.4.3, OpenCL™ v3.0, T-API ~3K algorithms/functions/samples BSD license ~8M downloads

Open-source Computer Vision Library Filters Transformations Edges Segmentation Detection/recognition Robust features Optical Flow Depth Calibration Streetview (google) Background subtraction End applications Images courtesy of Itseez

Opencv supports Multiple platforms In OpenCV 2.4.x, CPU and OpenCL™ are similar yet distinct code paths. // initialization VideoCapture vcap(...); CascadeClassifier fd("haar_ff.xml"); Mat frame, frameGray; vector<Rect> faces; for(;;){ vcap >> frame; cvtColor(frame, frameGray, BGR2GRAY); equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces); } // initialization VideoCapture vcap(...); ocl::OclCascadeClassifier fd("haar_ff.xml"); ocl::oclMat frame, frameGray; vector<Rect> faces; for(;;){ vcap >> frame; ocl::cvtColor(frame, frameGray, BGR2GRAY); ocl::equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces); } Describe CPU code path with example Mention CUDA code path is similar but no need to show example Some commentary on how this is a pain for developers

Introducing The transparent API // initialization VideoCapture vcap(...); CascadeClassifier fd("haar_ff.xml"); UMat frame, frameGray; vector<Rect> faces; for(;;){ vcap >> frame; cvtColor(frame, frameGray, BGR2GRAY); equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces); } Introduce T-API Show code example Commentary on unification of source code (C++, IPP, OpenCL) — ignore CUDA Mention advantage of single binary

T-API: Under the hood Easy transition path from 2.x to 3.x. Code that used to work in 2.x, should still work. Therefore, cv::Mat is still around. Both Mat and UMat are views into UMatData, which does the heavy lifting. UMat: UMatData: Reference counts Dirty bits Opaque handles (e.g. clBuffer) CPU data GPU data Handles data synchronization efficiently getMat(…) getUMat(…) Mat:

T-api: how does it work? CPU CPU, GPU, … ARM® NVIDIA dGPU APPLICATION Language Binding Windows® Linux Mac OSX iOS Android WinRT Windows® Concurrency TBB GDC C++ C python java OpenCV OS Multi-threading API T-API Implementation C++ IPP OpenCL NEON CUDA CPU CPU, GPU, … ARM® NVIDIA dGPU Hardware

Experimental results Runs under comparison uplift RX-427B OpenCL iGPU/ RX-427B C++ CPU 5.6x E8860 OpenCL dGPU/ 15.5x R9-290X OpenCL dGPU/ RX 427B C++ CPU 65.5x T-API enables comparison of various execution paths (C++, IPP, OpenCL™) Performance transparently scales based on platform capabilities imgproc module performance test suite. image resolution: 3840x2160

Vision algorithms are different… and complex CPUs and GPUs are complementary compute cores GPUs do well with parallelizable algorithms, large data sets, dense data access (cache locality), high arithmetic complexity (compute-to-memory ratio, occupancy) CPUs do well on serial algorithms, single-thread execution, branchy code, memory irregular/intensive operations Vision algorithms are complex and would benefit from flexible partition across both CPU and GPU Example 1: Machine learning (Viola Jones face detection) During Adaboost cascade, each stage increases sparsity, reduces data locality, and reduces occupancy: Ideally GPU for first stages and end with CPU Example 2: Object Recognition Dense feature detection: GPU Keypoint finalization: CPU or GPU depending on density Descriptors: CPU/GPU depending on density Model update (e.g. Bag of Words, Deformable parts model etc): CPU Recognition (Dictionary lookup, or energy minimization): CPU or GPU based on available libraries for chosen algorithm

Hsa platforms enable flexible CPU/GPU compute Unified Coherent Memory CPU 1 N … 2 CU 3 M-2 M-1 M hUMA GPU CPU hQ GPU CPU Unified Coherent Memory enables data sharing across all processors Processors architected to operate cooperatively Designed to enable the application to run on different processors at different times GPU and CPU have uniform visibility into entire memory space GPU and CPU have equal flexibility to be used to create and dispatch work items Heterogeneous Systems Architecture (HSA)

HSA Platforms enable easy programmability Shared Virtual Memory (SVM) Coarse-grained SVM Sharing of complex data structures containing pointers Fine-Grained Buffer SVM Concurrent access from CPU & GPU without map/unmap (platform atomics sync CPU/GPU) Fine-Grained System SVM Use any system memory pointer (malloc, new, stack, etc.) Platform Atomics Allows fine-grained atomics within a kernel Synchronize host/device while kernel is running (can keep state live on GPU) Dynamic Parallelism (a.k.a. Device Enqueue) Enqueue child kernels Solve non-gridded problems OpenCL™ App C++ (AMP, HC) App OpenMP App Python App OpenCL Runtime Various Runtimes HSA Helper Libraries HSAIL Runtime HSAIL Finalizer HSAIL Kernel Driver HSA platforms support mainstream programming languages Execution HSAIL … Build

conclusions Heterogeneous Compute (HC) is very effective in accelerating Computer Vision and Image Processing workloads and offers an excellent alternative to custom hardware. OpenCV, one of the most popular libraries for vision and image processing, is HC-accelerated. Transparent-API (T-API) introduced in OpenCV 3.0 helps makes it even easier to use HC acceleration. Results show strong acceleration and excellent scaling across multiple platforms with single source programming and a single binary. The next generation of HC platforms based on HSA open up even more value for developers to flexibly map workloads across CPU/GPU while still programming in mainstream high-level languages.

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc. used by permission by Khronos.