Embedded OpenCV Acceleration

Slides:



Advertisements
Similar presentations
System Level Benchmarking Analysis of the Cortex™-A9 MPCore™ John Goodacre Director, Program Management ARM Processor Division October 2009 Anirban Lahiri.
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
VEGAS: Soft Vector Processor with Scratchpad Memory Christopher Han-Yu Chou Aaron Severance, Alex D. Brant, Zhiduo Liu, Saurabh Sant, Guy Lemieux University.
Some Thoughts on Technology and Strategies for Petaflops.
Embedded Computing From Theory to Practice November 2008 USTC Suzhou.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
HPCC Mid-Morning Break Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery Introduction to the new GPU (GFX) cluster.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Computer Graphics Graphics Hardware
Christopher Mitchell CDA 6938, Spring The Discrete Cosine Transform  In the same family as the Fourier Transform  Converts data to frequency domain.
Extracted directly from:
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
History of Microprocessor MPIntroductionData BusAddress Bus
1 Latest Generations of Multi Core Processors
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
CIT 140: Introduction to ITSlide #1 CSC 140: Introduction to IT Operating Systems.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
1 ECE 734 Final Project Presentation Fall 2000 By Manoj Geo Varghese MMX Technology: An Optimization Outlook.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Graphics Graphics Hardware
NFV Compute Acceleration APIs and Evaluation
TI Information – Selective Disclosure
Memory Management.
Chapter 1: Introduction
Chapter 2 Memory and process management
CSC391/691 Intro to OpenCV Dr. Rongzhong Li Fall 2016
Hiba Tariq School of Engineering
Microarchitecture.
Visit for more Learning Resources
Computer Organization
Enabling machine learning in embedded systems
OpenCV 3.0 Latest news and the Roadmap
ESE532: System-on-a-Chip Architecture
THE CPU i Bytes 1.1.
Texas Instruments TDA2x and Vision SDK
Architecture & Organization 1
Real-Time Ray Tracing Stefan Popov.
FPGAs in AWS and First Use Cases, Kees Vissers
INTRODUCTION TO MICROPROCESSORS
Chapter 8: Main Memory.
Using FPGAs with Processors in YOUR Designs
Microprocessors Chapter 4.
Introduction to Computer Systems
Architecture & Organization 1
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Constructing a system with multiple computers or processors
A High Performance SoC: PkunityTM
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Computer Graphics Graphics Hardware
Introduction to Heterogeneous Parallel Computing
Multi-Core Programming Assignment
Portable SystemC-on-a-Chip
Memory System Performance Chapter 3
Types of Parallel Computers
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Chapter 13: I/O Systems.
Multicore and GPU Programming
ARM920T Processor This training module provides an introduction to the ARM920T processor embedded in the AT91RM9200 microcontroller.We’ll identify the.
Presentation transcript:

Embedded OpenCV Acceleration Dario Pennisi

Open-Source Computer Vision Library Over 2500 algorithms and functions Introduction Open-Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained 19/09/2018

History Launched in 1999 by Intel First Alpha released in 2000 Showcasing Intel Performance Library First Alpha released in 2000 1.0 version released in 2006 Corporate support by Willow Garage in 2008 2.0 version released in 2009 Improved c++ interfaces Releases each 6 months In 2014 taken over by ItSeez 3.0 in beta now Drop C API support 19/09/2018

Application structure Building blocks to ease vision applications Image Retrieval Pre Processing Feature Extraction Object Detection OpenCV highgui imgproc features2d objdetect calib3d video stitching ml Recognition Reconstruction Analisys Decision Making 19/09/2018

Environment 19/09/2018 Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration SSE/AVX/NEON OpenCL CUDA 19/09/2018

Desktop vs Embedded Desktop Industrial Embedded Cores/Threads 8/16 4/4 Core Frequency >4GHz >1.4GHz L1 Cache 32K+32K L2 Cache 256K per core 2M shared L3 Cache 20M - DDR Controllers 4x64 Bit DDR4 @ 1066 MHz 2x32 Bit DDR3 @ 800MHz TDP 140W (CPU) 10W (SoC) GPU cores 2880 1+4+16 19/09/2018

Dimensioning system is fundamental System Engineering Dimensioning system is fundamental Understand your algorithm Carefully choose your toolbox Embedded means no chance for “one size fits all” 19/09/2018

Acceleration Strategies Optimize Algorithms Profile Optimize Partition (CPU/GPU/DSP) FPGA acceleration High level synthesis Custom DSP RTL coding Brute Force Increase number of CPUs Increase CPU Frequency Accelerated libraries NEON OpenCL/CUDA 19/09/2018

Bottlenecks Know your enemy 19/09/2018

Access to external memory is expensive CPU load instructions are slow Memory has Latency Memory bandwidth is shared among CPUs Cache Prevents CPU to access external memory Data and instruction 19/09/2018

What happens when we have cache miss? Disordered accesses What happens when we have cache miss? Fetch data from same memory row  13 clocks Fetch data from a different row 23 clocks Cache line usually 32 bytes 8 clocks to fill a line (32 bit data bus) Memory bandwidth Efficiency 38% on same row 26% on different row 19/09/2018

Bottlenecks - Cache 1920x1080 YCbCr 4:2:2 (Full HD) 4MB Double the size of the biggest ARM L2 cache 1280x720 YCbCr 4:2:2 (HD)  1.8 MB Just fits L2 Cache… ok if reading and writing to the same frame 720x576 YCbCr 4:2:2 (SD)  800KB 2 images in L2 cache… 19/09/2018

Mostly designed for PCs OpenCV Algorithms Mostly designed for PCs Well structured General purpose Optimized functions for SSE/AVX Relatively optimized Small number of accelerated functions NEON Cuda (nVidia GPU/Tegra) OpenCL (GPU, Multicore processors) 19/09/2018

NEON SIMD instructions work on vectors of registers Multicore ARM/NEON NEON SIMD instructions work on vectors of registers Load-process-store philosophy Load/store costs 1 cycle only if in L1 cache 4-12 cycles if in L2 25 to 35 cycles on L2 cache miss SIMD instructions can take from 1 to 5 clocks Fast clock useless on big datasets/small computation 19/09/2018

Very similar to ARM/NEON Generic DSP Very similar to ARM/NEON High speed pipeline impaired by inefficient memory access subsystem When smart DMA is available it is very complex to program When DSP is integrated in SoC it shares ARM’s bandwidth 19/09/2018

Bandwidth and inefficiencies: OpenCL on GPU OpenCL on Vivante GC2000 Claimed capability up to 16 GFLOPS Real Applications only on internal registers: 13.8 GFLOPS computing 1000x1000 matrix: 600 MFLOPS Bandwidth and inefficiencies: Only 1K local memory and 64 byte memory cache 19/09/2018

Same code can run on FPGA and GPU OpenCL on FPGA Same code can run on FPGA and GPU Transform selected functions in hardware Automated memory access coalescing Each function requires dedicated logic Large FPGAs required Partial reconfiguration may solve this Significant compilation time 19/09/2018

HLS requires Code to be heavily modified HLS on FPGA High Level Synthesis Convert C to hardware HLS requires Code to be heavily modified Pragmas to instruct compiler Code restructuring Not portable anymore Each function requires dedicated logic Large FPGAs required Partial reconfiguration may solve this Significant compilation time 19/09/2018

Data intensive processing A different approach Demanding algorithms on low cost/power HW Algorithm Analysis Memory Access Pattern Data intensive processing Decision Making DMA DSP NEON ARM program Custom Instruction (RTL) 19/09/2018

External co-processing ARM GPU Memory FPGA PCIe ARM Memory FPGA 19/09/2018

Co-processor details FPGA Co-Processor Separate memory Adds bandwidth Reduces access conflict Algorithm aware DMA Access memory in ordered way Add caching through embedded RAM Algorithm specific processors HLS/OpenCL synthesized IP blocks DSP with custom instructions Hardcoded IP blocks Block capture DPRAM(s) DSP core (s) Memory DMA Processor DPRAM DSP core/IP Block ARM 19/09/2018

Co-processor details Flex DMA Block Capture DPRAM DSP Core Dedicated processor with DMA custom instruction Software defined memory access pattern Block Capture Extracts data for each tile DPRAM Local, high speed cache DSP Core Dedicated processor with Algorithm specific custom instructions ARM ARM Memory Flex DMA Flex DMA Flex DMA Flex DMA Block capture Block capture Block capture Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM(s) DPRAM DPRAM(s) DPRAM DPRAM DSP core (s) DSP core (s) DSP core (s) DSP core (s) DSP core/IP Block DSP core/IP Block 19/09/2018

Environment 19/09/2018 Application C++ Java Python OpenCV cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration SSE/AVX/NEON OpenCL CUDA FPGA OpenVX 19/09/2018

OpenVX 19/09/2018

OpenVX Graph Manager Graph Construction Graph Execution Allocates resources Logical representation of algorithm Graph Execution Concatenate nodes avoiding memory storage Tiling extensions Single node execution can be split in multiple tiles Multiple accelerators executing single task in parallel Memory Node2 Node1 Node2 Memory Node1 19/09/2018

Summary What we learnt OpenCV today is mainly PC oriented. ARM, Cuda, OpenCL support growing Existing acceleration only on selected functions Embedded CV requires good partitioning among resources When ASSPs are not enough FPGAs are key OpenVX provides a consistent HW acceleration platform, not only for OpenCV What we learnt 19/09/2018

Questions 19/09/2018

Thank you 19/09/2018