Embedded OpenCV Acceleration

Embedded OpenCV Acceleration
Dario Pennisi

Open-Source Computer Vision Library Over 2500 algorithms and functions
Introduction Open-Source Computer Vision Library Over 2500 algorithms and functions Cross platform, portable API Windows, Linux, OS X, Android, iOS Real Time performance BSD license Professionally developed and maintained 19/09/2018

History Launched in 1999 by Intel First Alpha released in 2000
Showcasing Intel Performance Library First Alpha released in 2000 1.0 version released in 2006 Corporate support by Willow Garage in 2008 2.0 version released in 2009 Improved c++ interfaces Releases each 6 months In 2014 taken over by ItSeez 3.0 in beta now Drop C API support 19/09/2018

Application structure
Building blocks to ease vision applications Image Retrieval Pre Processing Feature Extraction Object Detection OpenCV highgui imgproc features2d objdetect calib3d video stitching ml Recognition Reconstruction Analisys Decision Making 19/09/2018

Environment 19/09/2018 Application C++ Java Python OpenCV
cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration SSE/AVX/NEON OpenCL CUDA 19/09/2018

Desktop vs Embedded Desktop Industrial Embedded Cores/Threads 8/16 4/4
Core Frequency >4GHz >1.4GHz L1 Cache 32K+32K L2 Cache 256K per core 2M shared L3 Cache 20M - DDR Controllers 4x64 Bit 1066 MHz 2x32 Bit 800MHz TDP 140W (CPU) 10W (SoC) GPU cores 2880 1+4+16 19/09/2018

Dimensioning system is fundamental
System Engineering Dimensioning system is fundamental Understand your algorithm Carefully choose your toolbox Embedded means no chance for “one size fits all” 19/09/2018

Acceleration Strategies
Optimize Algorithms Profile Optimize Partition (CPU/GPU/DSP) FPGA acceleration High level synthesis Custom DSP RTL coding Brute Force Increase number of CPUs Increase CPU Frequency Accelerated libraries NEON OpenCL/CUDA 19/09/2018

Bottlenecks Know your enemy 19/09/2018

Access to external memory is expensive
CPU load instructions are slow Memory has Latency Memory bandwidth is shared among CPUs Cache Prevents CPU to access external memory Data and instruction 19/09/2018

What happens when we have cache miss?
Disordered accesses What happens when we have cache miss? Fetch data from same memory row  13 clocks Fetch data from a different row 23 clocks Cache line usually 32 bytes 8 clocks to fill a line (32 bit data bus) Memory bandwidth Efficiency 38% on same row 26% on different row 19/09/2018

Bottlenecks - Cache 1920x1080 YCbCr 4:2:2 (Full HD) 4MB
Double the size of the biggest ARM L2 cache 1280x720 YCbCr 4:2:2 (HD)  1.8 MB Just fits L2 Cache… ok if reading and writing to the same frame 720x576 YCbCr 4:2:2 (SD)  800KB 2 images in L2 cache… 19/09/2018

Mostly designed for PCs
OpenCV Algorithms Mostly designed for PCs Well structured General purpose Optimized functions for SSE/AVX Relatively optimized Small number of accelerated functions NEON Cuda (nVidia GPU/Tegra) OpenCL (GPU, Multicore processors) 19/09/2018

NEON SIMD instructions work on vectors of registers
Multicore ARM/NEON NEON SIMD instructions work on vectors of registers Load-process-store philosophy Load/store costs 1 cycle only if in L1 cache 4-12 cycles if in L2 25 to 35 cycles on L2 cache miss SIMD instructions can take from 1 to 5 clocks Fast clock useless on big datasets/small computation 19/09/2018

Very similar to ARM/NEON
Generic DSP Very similar to ARM/NEON High speed pipeline impaired by inefficient memory access subsystem When smart DMA is available it is very complex to program When DSP is integrated in SoC it shares ARM’s bandwidth 19/09/2018

Bandwidth and inefficiencies:
OpenCL on GPU OpenCL on Vivante GC2000 Claimed capability up to 16 GFLOPS Real Applications only on internal registers: 13.8 GFLOPS computing 1000x1000 matrix: 600 MFLOPS Bandwidth and inefficiencies: Only 1K local memory and 64 byte memory cache 19/09/2018

Same code can run on FPGA and GPU
OpenCL on FPGA Same code can run on FPGA and GPU Transform selected functions in hardware Automated memory access coalescing Each function requires dedicated logic Large FPGAs required Partial reconfiguration may solve this Significant compilation time 19/09/2018

HLS requires Code to be heavily modified
HLS on FPGA High Level Synthesis Convert C to hardware HLS requires Code to be heavily modified Pragmas to instruct compiler Code restructuring Not portable anymore Each function requires dedicated logic Large FPGAs required Partial reconfiguration may solve this Significant compilation time 19/09/2018

Data intensive processing
A different approach Demanding algorithms on low cost/power HW Algorithm Analysis Memory Access Pattern Data intensive processing Decision Making DMA DSP NEON ARM program Custom Instruction (RTL) 19/09/2018

External co-processing
ARM GPU Memory FPGA PCIe ARM Memory FPGA 19/09/2018

Co-processor details FPGA Co-Processor Separate memory
Adds bandwidth Reduces access conflict Algorithm aware DMA Access memory in ordered way Add caching through embedded RAM Algorithm specific processors HLS/OpenCL synthesized IP blocks DSP with custom instructions Hardcoded IP blocks Block capture DPRAM(s) DSP core (s) Memory DMA Processor DPRAM DSP core/IP Block ARM 19/09/2018

Co-processor details Flex DMA Block Capture DPRAM DSP Core
Dedicated processor with DMA custom instruction Software defined memory access pattern Block Capture Extracts data for each tile DPRAM Local, high speed cache DSP Core Dedicated processor with Algorithm specific custom instructions ARM ARM Memory Flex DMA Flex DMA Flex DMA Flex DMA Block capture Block capture Block capture Block capture Block capture Block capture DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM(s) DPRAM DPRAM(s) DPRAM DPRAM(s) DPRAM DPRAM DSP core (s) DSP core (s) DSP core (s) DSP core (s) DSP core/IP Block DSP core/IP Block 19/09/2018

Environment 19/09/2018 Application C++ Java Python OpenCV
cv::parallel_for_ Threading APIs Concurrency CStripes GCD OpenMP TBB OS Acceleration SSE/AVX/NEON OpenCL CUDA FPGA OpenVX 19/09/2018

OpenVX 19/09/2018

OpenVX Graph Manager Graph Construction Graph Execution
Allocates resources Logical representation of algorithm Graph Execution Concatenate nodes avoiding memory storage Tiling extensions Single node execution can be split in multiple tiles Multiple accelerators executing single task in parallel Memory Node2 Node1 Node2 Memory Node1 19/09/2018

Summary What we learnt OpenCV today is mainly PC oriented.
ARM, Cuda, OpenCL support growing Existing acceleration only on selected functions Embedded CV requires good partitioning among resources When ASSPs are not enough FPGAs are key OpenVX provides a consistent HW acceleration platform, not only for OpenCV What we learnt 19/09/2018

Questions 19/09/2018

Thank you 19/09/2018

Embedded OpenCV Acceleration

Similar presentations

Presentation on theme: "Embedded OpenCV Acceleration"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Embedded OpenCV Acceleration

Similar presentations

Presentation on theme: "Embedded OpenCV Acceleration"— Presentation transcript:

Similar presentations

About project

Feedback