Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.
Accelerating Computer Vision and Image Processing via Heterogeneous Compute… Transparently! Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

Vision and image processing are computationally demanding!
Automatic Inspection Medical image analysis Autonomous Navigation Human Machine Interfaces Augmented Reality Robotics Security/Surveillance Data Analytics and Organization… and more Millions of pixels per image 100s of calculations per pixel 100s–1000s of image frames per second Complex + constantly evolving algorithms Hungry for PERFORMANCE But needs to be PROGRAMMABLE And within POWER & COST budgets

The Traditional solution
CPU for system control, IO and UI Hardware offload for compute-intensive processing (DSP / FPGA / ASIC) Various tradeoffs of Performance, Programmability, Power, Cost CPU DSP/ FPGA/ ASIC DSP/ FPGA/ ASIC … SystemMemory Device Memory Device Memory

Evolution of heterogeneous compute
CPU cc Main Memory Host Memory PCIe Memory (pinned) dGPU GPU Device Memory PCIe® 1. GPU Compute: dGPU brought 100s–1000s of GFLOPS of data-parallel performance. … But... Data copy overheads Kernel launch constraints Expert programming needed CPU cc HSA iGPU Physical Memory Unified (Bidirectionally Coherent, Pageable) Virtual Memory Cache 3. Heterogeneous Systems Architecture (HSA) APU: True heterogeneous compute across CPU/GPU Share pointers freely Move work freely across CPU and GPU Use standard programming languages CPU cc Main Memory Host Memory Device Visible Host Memory iGPU Device Memory Host Visible Device Memory … 2. APU with iGPU: Easier to use SOC, Unified memory eliminates some copies.

Open-source Computer Vision Library
ISCAS MultiCoreWare Google Contributors Nvidia Intel Willow Garage Core team Itseez 2000 2008 2009 2012 2013 present First public release v2.0, C++ API @github v2.4.3, OpenCL™ v3.0, T-API ~3K algorithms/functions/samples BSD license ~8M downloads

Open-source Computer Vision Library
Filters Transformations Edges Segmentation Detection/recognition Robust features Optical Flow Depth Calibration Streetview (google) Background subtraction End applications Images courtesy of Itseez

Opencv supports Multiple platforms
In OpenCV 2.4.x, CPU and OpenCL™ are similar yet distinct code paths. // initialization VideoCapture vcap(...); CascadeClassifier fd("haar_ff.xml"); Mat frame, frameGray; vector<Rect> faces; for(;;){ vcap >> frame; cvtColor(frame, frameGray, BGR2GRAY); equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces); } // initialization VideoCapture vcap(...); ocl::OclCascadeClassifier fd("haar_ff.xml"); ocl::oclMat frame, frameGray; vector<Rect> faces; for(;;){ vcap >> frame; ocl::cvtColor(frame, frameGray, BGR2GRAY); ocl::equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces); } Describe CPU code path with example Mention CUDA code path is similar but no need to show example Some commentary on how this is a pain for developers

Introducing The transparent API
// initialization VideoCapture vcap(...); CascadeClassifier fd("haar_ff.xml"); UMat frame, frameGray; vector<Rect> faces; for(;;){ vcap >> frame; cvtColor(frame, frameGray, BGR2GRAY); equalizeHist(frameGray, frameGray); fd.detectMultiScale(frameGray, faces); } Introduce T-API Show code example Commentary on unification of source code (C++, IPP, OpenCL) — ignore CUDA Mention advantage of single binary

T-API: Under the hood Easy transition path from 2.x to 3.x. Code that used to work in 2.x, should still work. Therefore, cv::Mat is still around. Both Mat and UMat are views into UMatData, which does the heavy lifting. UMat: UMatData: Reference counts Dirty bits Opaque handles (e.g. clBuffer) CPU data GPU data Handles data synchronization efficiently getMat(…) getUMat(…) Mat:

T-api: how does it work? CPU CPU, GPU, … ARM® NVIDIA dGPU APPLICATION
Language Binding Windows® Linux Mac OSX iOS Android WinRT Windows® Concurrency TBB GDC C++ C python java OpenCV OS Multi-threading API T-API Implementation C++ IPP OpenCL NEON CUDA CPU CPU, GPU, … ARM® NVIDIA dGPU Hardware

Experimental results Runs under comparison uplift RX-427B OpenCL iGPU/ RX-427B C++ CPU 5.6x E8860 OpenCL dGPU/ 15.5x R9-290X OpenCL dGPU/ RX 427B C++ CPU 65.5x T-API enables comparison of various execution paths (C++, IPP, OpenCL™) Performance transparently scales based on platform capabilities imgproc module performance test suite. image resolution: 3840x2160

Vision algorithms are different… and complex
CPUs and GPUs are complementary compute cores GPUs do well with parallelizable algorithms, large data sets, dense data access (cache locality), high arithmetic complexity (compute-to-memory ratio, occupancy) CPUs do well on serial algorithms, single-thread execution, branchy code, memory irregular/intensive operations Vision algorithms are complex and would benefit from flexible partition across both CPU and GPU Example 1: Machine learning (Viola Jones face detection) During Adaboost cascade, each stage increases sparsity, reduces data locality, and reduces occupancy: Ideally GPU for first stages and end with CPU Example 2: Object Recognition Dense feature detection: GPU Keypoint finalization: CPU or GPU depending on density Descriptors: CPU/GPU depending on density Model update (e.g. Bag of Words, Deformable parts model etc): CPU Recognition (Dictionary lookup, or energy minimization): CPU or GPU based on available libraries for chosen algorithm

Hsa platforms enable flexible CPU/GPU compute
Unified Coherent Memory CPU 1 N … 2 CU 3 M-2 M-1 M hUMA GPU CPU hQ GPU CPU Unified Coherent Memory enables data sharing across all processors Processors architected to operate cooperatively Designed to enable the application to run on different processors at different times GPU and CPU have uniform visibility into entire memory space GPU and CPU have equal flexibility to be used to create and dispatch work items Heterogeneous Systems Architecture (HSA)

HSA Platforms enable easy programmability
Shared Virtual Memory (SVM) Coarse-grained SVM Sharing of complex data structures containing pointers Fine-Grained Buffer SVM Concurrent access from CPU & GPU without map/unmap (platform atomics sync CPU/GPU) Fine-Grained System SVM Use any system memory pointer (malloc, new, stack, etc.) Platform Atomics Allows fine-grained atomics within a kernel Synchronize host/device while kernel is running (can keep state live on GPU) Dynamic Parallelism (a.k.a. Device Enqueue) Enqueue child kernels Solve non-gridded problems OpenCL™ App C++ (AMP, HC) App OpenMP App Python App OpenCL Runtime Various Runtimes HSA Helper Libraries HSAIL Runtime HSAIL Finalizer HSAIL Kernel Driver HSA platforms support mainstream programming languages Execution HSAIL … Build

conclusions Heterogeneous Compute (HC) is very effective in accelerating Computer Vision and Image Processing workloads and offers an excellent alternative to custom hardware. OpenCV, one of the most popular libraries for vision and image processing, is HC-accelerated. Transparent-API (T-API) introduced in OpenCV 3.0 helps makes it even easier to use HC acceleration. Results show strong acceleration and excellent scaling across multiple platforms with single source programming and a single binary. The next generation of HC platforms based on HSA open up even more value for developers to flexibly map workloads across CPU/GPU while still programming in mainstream high-level languages.

Disclaimer & Attribution
The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc. used by permission by Khronos.

Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

Similar presentations

Presentation on theme: "Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

Similar presentations

Presentation on theme: "Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D."— Presentation transcript:

Similar presentations

About project

Feedback