How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013.

How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013

OpenCV Overview Open Source Computer Vision (OpenCV) is widely used to develop Computer Vision applications Library of optimized video functions Optimized for desktop processors and GPUs Tens of thousands users Runs out of the box on ARM processors in Zynq However HD processing with OpenCV is often limited by external memory Memory bandwidth is a bottleneck for performance Memory accesses limit power efficiency Zynq All-programmable SOCs are a great way of implementing embedded computer vision applications High performance and Low Power Quick overview of OpenCV OpenCV is a open source computer vision library of video functions in high level programming languages such as C, C++, Python, Java. The library has functions and growing. It has a wide user base of tens of thousands. They are optimized for running on a CPU or GPU. They also run on Arm processor. The openCV functions, since they are written to optimize for desktop processor’s seemingly infinite memory resources rely heavily on read and write accesses to memory. However this type of architecture is not ideal for embedded application and can be power hungry. All programmoable SOCs offer a great fit solution to implement open CV funciotns by partitioning the design into sw and hardware domains and offer an optimal solution in terms of performance and power.

Real-Time Computer Vision Applications
Real-time Analytics Function Advanced Drivers Assist for Safety Lane or Pedestrian detection Surveillance for Security Friend vs Foe recognition Machine Vision for Quality High velocity object detection Here are some of the Computer vision applications. The main domains of computer vision applications involves processing images in real-time for functions such as object detection, object tracking or segmentation, and make informed decisions to improve the safety and quality of living. For example: in Driver’s assist application, computer vision is used to detect lane and pedestrian and inform driver about drifting In surveillance application, it is used to detect ominous activity In machine vision, it is used to detect defects in fast moving production lines In medical imaging, it is used to detect tumors and use non invasive surgery to attack the tumors. These are just few examples of computer vision applications Medical Imaging For non invasive surgery Tumor detection

Real-time Video Analytics Processing
Pixel based Image Processing and Feature Extraction Frame based Feature processing and decision making 4Kx2K Pixel based Image processing and Feature extraction F1 F2 F3 ….. 1080p Lets see what’s involved in computer vision application, A typical computer vision application involves processing an image, extracting relevant features and making a decision. The image processing on the left side is data intensive and higher the resolution of the image, higher the bandwidth and performance requirements. As you see here, image processing in HD, and ultrad HD requires 100 ops/pixel on 100 of millions of pixels in a second resulting 100s Gops. A CPU cannot simply do it. It will require a pipelined and dedicated hw architecture. While the decisions made based on the features extracted from the image processing are also intensive but you are talking about few 1000 features/sec resulting in millions of ops – entirely in the realm of a CPU. 720p 480p 100s Ops/pixel 8MPx100 Ops/ frame = 100s Gops 10000s Ops/feature 1000s of features/sec = Mops

Heterogeneous Implementation of Real-time Video Analytics
Pixel based Image Processing and Feature Extraction Frame based Feature processing and decision making Hardware Domain (FPGA) 4Kx2K Pixel based Image processing and Feature extraction Software Domain (ARM) F1 F2 F3 ….. 1080p Hence, a natural implementation of many computer vision applications is to combine Programmable Logic to implement pixel processing and feature extraction with an embedded processor to implement feature processing and decision making. 720p 480p 100s Ops/pixel 8MPx100 Ops/ frame = 100s Gops 10000s Ops/feature 1000s of features/sec = Mops

Xilinx Real-time Image Analytics Implementation: Zynq All Programmable SoC
Pixel based Image Processing and Feature Extraction Frame based Feature processing and decision making Frame based Feature processing and decision making 4Kx2K Pixel based Image processing and Feature extraction F1 F2 F3 ….. 1080p Zynq, for the first in the fPGA industry offers a processor centric archictecture with programmable logic fabric. Zynq contains a dual core arm cortex a9 processor running up to 1Ghz and a tightly integrated programmable logic and thru axi interface between them offer much more bandwidth then a two-chip solution at a several orders lower magnitude power. The high performance nature of programmable logic is ideal for image processing operations which can implement a pipelined and parallel hw accelerators and pass on the relevant features extracted to the powerful arm processor which can easily implement frame based processing in software. 720p 480p 100s Ops/pixel 8MPx100 Ops/ frame = 100s Gops 10000s Ops/feature 1000s of features/sec = Mops

Vivado: Productivity gains for OpenCV functions
C simulation of HD video algorithm ~1 fps RTL simulation of HD video 1 frame per hour Real-time FPGA implementation up to 60fps To take advantage of the powerful zynq processor, Xilinx’s Vivado design suite, allows design teams to work at higher levels of abstraction. Available as part of Vivado Design Suite System Edition, Xilinx’s Vivado High Level Synthesis allows users to code in C, C++ and SystemC by enabling algorithmic, data type and interface abstraction at the higher level of C-based design specification. Vivado HLS also enables development and acceleration of real-time smart vision algorithms on Xilinx Zynq®7000 All Programmable SoC devices by providing video functions integrated into an OpenCV environment for computer vision running on the dual-core ARM processing system. Using Vivado HLS users can see significant productvity gains for open CV applications Firstly, users can simulate the openCV equivalent HLS function in C reducing significantly the simulation time from hours to sec. Secondly, Using Vivado HLS video library functions users can rapidly get real-time execution of pixel processing upto 60fps of basic functions for analytics by leveraging the high performance FPGA fabric.

Accelerating OpenCV Applications
Driver Assist Broadcast Monitor HD Surveillance Cinema Projection Frame-level processing Library for PS Pixel processing interfaces and basic functions for analytics Video Conferencing Digital Signage Vivado HLS Studio Cinema Camera Before I handover the webcast to Stephen, I want to summarize the benefits of implementing the openCV applications on all programmable Zynq platform. There are many computer visions application which can take advantage of Xilinx’s HLS tool flow to accelerate open CV implementations on Zynq platform. When targeting a Zynq All Programmable SoC, design teams can now more rapidly develop C/C++ code for the dual-core ARM processing system, while compute intensive functions are automatically accelerated in the high performance FPGA fabric. The resulting implementation enables up to a 100X performance improvement of existing C/C++ algorithms through hardware acceleration. At the same time, Vivado HLS accelerates system verification and implementation times by up to a 100X compared to RTL design entry flows. Consumer Displays Office-class MFP Machine Vision Medical Displays

Zynq Video TRD architecture
DDR3 External Memory Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals SD Card S_AXI_GP 32b bit S_AXI_HP 64 bit AXI4 Stream IP Core AXI Interconnect AXI VDMA HDMI Video Input Xylon Display Controller HDMI HLS-generated pipeline Video access to external memory using 64-bit High Performance ports Control register access using 32-bit General Purpose ports Video streams implemented using AXI4-Stream

IP Centric Design flow Accelerated IP Generation and Integration
C based IP Creation User Preferred System Integration Environment C, C++ or SystemC System Generator for DSP C Libraries Floating point math.h Fixed point Video VHDL or Verilog plus SW Drivers Vivado IP Integrator Slide 11: HLS Libs and flow Time on foil: 2 mins Speaker Notes Title: Vivado HLS : Accelerate C Libraries and System IP Integration Today we have support today for user C, C++, SystemC to VHDL or Verilog generation With we add more users by providing access to application specific C libraries We have support for math.h (single and double precision floating-point) today We have 31 basic OpenCV video functions that go into production in and we will continue to add more functions. We are working on enabling our ecosystem and partners to provide OpenCV video functions for Xilinx programmable logic by leveraging HLS Next we will also begin offering some DSP functions (filters, FFTs and DDS) libraries in 2nd half of this year and will continue to drive market specific libraries Vivado HLS creates IP that can be used by our customers in their preferred Xilinx System IP integration environment Vivado HLS is creating IP with interfaces that users can use within IP Integrator to integrate IP systems Main Vivado HLS creates IP that can be added as a block inside System Generator for Vivado implementation flow. This block can be seamlessly used within the System Generator design environment as it will participate in data type, rate propagation and HDL netlist generation. Vivado HLS packages IP and creates IP-XACT based package that can be added to the Vivado IP catalog. This IP can then be used by user in a RTL design or in Vivado IP integrator Vivado HLS enables users to target 7 series FPGA or Zynq SoC (when available) in user preferred Xilinx design environment with Vivado implementation flow Key Take-a-way Vivado HLS created IP along with C libraries can be targeted for 7 Series FPGA or Zynq SoC in user preferred Xilinx design environment (RTL, IP Integrator and System Generator) In case audience asks both Vivado High-Level Synthesis and System Generator for DSP are available stand alone or as part of the ISE® Design Suite DSP and System Editions and Vivado Design Suite System Edition. Vivado HLS device and implementation flow support with both System Edition and standalone license is shown below: Vivado HLS supports 7 series and Zynq devices with ISE Design Suite (IDS) DSP or System Edition license Vivado HLS supports all devices supported by ISE and Vivado with Vivado HLS standalone license IP Subsystem Xilinx IP 3rd Party IP User IP Vivado RTL Integration

Using OpenCV in FPGA designs
Pure OpenCV Application Integrated OpenCV Application Accelerated OpenCV Application OpenCV Reference Image File Read (OpenCV) OpenCV2AXIvideo AXIvideo2Mat HLS video library function chain Mat2AXIvideo AXIvideo2OpenCV Image File Write (OpenCV) Synthesizable Block Image File Read (OpenCV) OpenCV function chain Image File Write (OpenCV) Live Video Input OpenCV function chain Live Video Output Live Video Input AXIvideo2Mat HLS video library function chain Mat2AXIvideo Live Video Output Synthesized Block

Pure OpenCV Application
HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory Image File Read (OpenCV) OpenCV function chain Image File Write (OpenCV) SD Card

HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory 1 Image File Read (OpenCV) OpenCV function chain Image File Write (OpenCV) SD Card

HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory 1 2 3 4 5 Image File Read (OpenCV) OpenCV function chain Image File Write (OpenCV) SD Card

HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory Image File Read (OpenCV) OpenCV function chain Image File Write (OpenCV) SD Card

Integrated OpenCV Application
HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory 1 2 3 4 5 Live Video Input OpenCV function chain Live Video Output SD Card

OpenCV Reference / Software Execution
HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory 1 2 3 4 5 Image File Read (OpenCV) OpenCV2AXIvideo AXIvideo2Mat HLS video library function chain Mat2AXIvideo AXIvideo2OpenCV Image File Write (OpenCV) SD Card

OpenCV Reference / In system Test
HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory 1 2 Image File Read (OpenCV) OpenCV2AXIvideo AXIvideo2Mat HLS video library function chain Mat2AXIvideo AXIvideo2OpenCV Image File Write (OpenCV) SD Card

Accelerated OpenCV Application
HDMI Video Input Xylon Display Controller HLS-generated pipeline AXI VDMA AXI Interconnect Processing System DDR Memory Controller Dual Core Cortex-A9 DDR3 Hardened Peripherals DDR3 External Memory 1 2 Live Video Input AXIvideo2Mat HLS video library function chain Mat2AXIvideo Live Video Output SD Card

OpenCV design flow Develop OpenCV application on Desktop
OpenCV Block A OpenCV Block B OpenCV Block C OpenCV Block D Develop OpenCV application on Desktop Run OpenCV application on ARM cores without modification Abstract FPGA portion using I/O functions Replace OpenCV function calls with synthesizable code Run HLS to generate FPGA accelerator Replace call to synthesizable code with call to FPGA accelerator

Partitioned OpenCV Application
OpenCV Block A OpenCV Block B OpenCV Block C OpenCV Block D opencv2AXIvideo AXIvideo2HLS HLS Block B HLS Block C HLS2AXIvideo AXIvideo2opencv Synchronization Synthesizable

OpenCV Design Tradeoffs
OpenCV-based image processing is built around memory frame buffers Poor access locality -> small caches perform poorly Complex architectures for performance -> higher power Likely ‘good enough’ for many applications Low resolution or framerate Processing of features or regions of interest in a larger image Streaming architectures give high performance and low power Chaining image processing functions reduces external memory accesses Video-optimized line buffers and window buffers simpler than processor caches Can be implemented with streaming optimizations in HLS Requires conversion of code to be synthesizable

HLS Video Libraries OpenCV functions are not directly synthesizable with HLS Dynamic memory allocation Floating point Assumes images are modified in external memory The HLS video library is intended to replace many basic OpenCV functions Similar interfaces and algorithms to OpenCV Focus on image processing functions implemented in FPGA fabric Includes FPGA-specific optimizations Fixed point operations instead of floating point On-chip Linebuffers and window buffers Not necessarily bit-accurate

Xilinx HLS Video Library 2013.2
Video Data Modeling Linebuffer class Window class AXI4-Stream IO Functions AXIvideo2Mat Mat2AXIvideo OpenCV Interface Functions cvMat2AXIvideo AXIvideo2cvMat cvMat2hlsMat hlsMat2cvMat IplImage2AXIvideo AXIvideo2IplImage IplImage2hlsMat hlsMat2IplImage CvMat2AXIvideo AXIvideo2CvMat CvMat2hlsMat hlsMat2CvMat Video Functions AbsDiff Duplicate MaxS Remap AddS EqualizeHist Mean Resize AddWeighted Erode Merge Scale And FASTX Min Set Avg Filter2D MinMaxLoc Sobel AvgSdv GaussianBlur MinS Split Cmp Harris Mul SubRS CmpS HoughLines2 Not SubS CornerHarris Integral PaintMask Sum CvtColor InitUndistortRectifyMap Range Threshold Dilate Max Reduce Zero For function signatures and descriptions, see the HLS user guide UG 902

Video Library Functions
C++ code contained in hls namespace. #include “hls_video.h” Similar interface, equivalent behavior with OpenCV, e.g. OpenCV library: HLS video library: Some constructor arguments have corresponding or replacement template parameters, e.g. ROWS and COLS specify the maximum size of an image processed cvScale(src, dst, scale, shift); hls::Scale<...>(src, dst, scale, shift); cv::Mat mat(rows, cols, CV_8UC3); hls::Mat<ROWS, COLS, HLS_8UC3> mat(rows, cols);

Video Library Core Structures
OpenCV HLS Video Library cv::Point_<T>, CvPoint hls::Point_<T>, hls::Point cv::Size_<T>, CvSize hls::Size_<T>, hls::Size cv::Rect_<T>, CvRect hls::Rect_<T>, hls::Rect cv::Scalar_<T>, CvScalar hls::Scalar<N, T> cv::Mat, IplImage, CvMat hls::Mat<ROWS, COLS, T> cv::Mat mat(rows, cols, CV_8UC3); hls::Mat<ROWS, COLS, HLS_8UC3> mat (rows, cols); IplImage* img = cvCreateImage(cvSize(cols,rows), IPL_DEPTH_8U, 3); hls::Mat<ROWS, COLS, HLS_8UC3> img, (rows, cols); hls::Mat<ROWS, COLS, HLS_8UC3> img; hls::Window<ROWS, COLS, T> hls::LineBuffer<ROWS, COLS, T>

Limitations Must replace OpenCV calls with video library functions
Frame buffer access not supported through pointers use VDMA and AXI Stream adapter functions Random access not supported data read more than once must be duplicated see hls::Duplicate() In-place update not supported e.g. cvRectangle (img, point1, point2) OpenCV HLS Video Library Read operation pix = cv_mat.at<T>(i,j) pix = cvGet2D(cv_img,i,j) hls_img >> pix Write operation cv_mat.at<T>(i,j) = pix cvSet2D(cv_img,i,j,pix) hls_img << pix

OpenCV Code One image input, one image output
Processed by chain of functions sequentially … IplImage* src=cvLoadImage("test_1080p.bmp"); IplImage* dst=cvCreateImage(cvGetSize(src), src->depth, src->nChannels); cvSobel(src, dst, 1, 0); cvSubS(dst, cvScalar(100,100,100), src); cvScale(src, dst, 2, 0); cvErode(dst, src); cvDilate(src, dst); cvSaveImage("result_1080p.bmp", dst); cvReleaseImage(&src); cvReleaseImage(&dst); test_opencv.cpp Image Read (OpenCV) OpenCV function chain Image Write (OpenCV)

Integrated OpenCV Application
System provides pointer to frame buffers Synthesizable code can also be run on ARM void img_process(ZNQ_S32 *rgb_data_in, ZNQ_S32 *rgb_data_out, int height, int width, int stride, int flag_OpenCV) { // constructing OpenCV interface IplImage* src_dma = cvCreateImageHeader(cvSize(width, height), IPL_DEPTH_8U, 4); IplImage* dst_dma = src_dma->imageData = (char*)rgb_data_in; dst_dma->imageData = (char*)rgb_data_out; src_dma->widthStep = 4 * stride; dst_dma->widthStep = 4 * stride; if (flag_OpenCV) { opencv_image_filter(src_dma, dst_dma); } else { sw_image_filter(src_dma, dst_dma); } cvReleaseImageHeader(&src_dma); cvReleaseImageHeader(&dst_dma); } img_filters.c Live Video Input OpenCV function chain Live Video Output

Accelerated with Vivado HLS video library
Top level function extracted for HW acceleration #include “hls_video.h” // header file of HLS video library #include “hls_opencv.h” // header file of OpenCV I/O // typedef video library core structures typedef hls::stream<ap_axiu<32,1,1,1> > AXI_STREAM; typedef hls::Scalar<3, uchar> RGB_PIXEL; typedef hls::Mat<1080,1920,HLS_8UC3> RGB_IMAGE; void image_filter(AXI_STREAM& src_axi, AXI_STREAM& dst_axi, int rows, int cols); top.h Image Read (OpenCV) OpenCV2AXIvideo AXIvideo2Mat HLS video library function chain Mat2AXIvideo AXIvideo2OpenCV Image Write (OpenCV) #include “top.h” … IplImage* src=cvLoadImage("test_1080p.bmp"); IplImage* dst=cvCreateImage(cvGetSize(src), src->depth, src->nChannels); AXI_STREAM src_axi, dst_axi; IplImage2AXIvideo(src, src_axi); image_filter(src_axi, dst_axi, src->height, src->width); AXIvideo2IplImage(dst_axi, dst); cvSaveImage("result_1080p.bmp", dst); cvReleaseImage(&src); cvReleaseImage(&dst); test.cpp

Accelerated with Vivado HLS video library
HW Synthesizable Block for FPGA acceleration Consist of video library function and interfaces Replace OpenCV function with similar function in hls namespace Image Read (OpenCV) OpenCV2AXIvideo AXIvideo2Mat HLS video library function chain Mat2AXIvideo AXIvideo2OpenCV Image Write (OpenCV) void image_filter(AXI_STREAM& input, AXI_STREAM& output, int rows, int cols) { //Create AXI streaming interfaces for the core #pragma HLS RESOURCE variable=input core=AXIS metadata="-bus_bundle INPUT_STREAM" #pragma HLS RESOURCE variable=output core=AXIS metadata="-bus_bundle OUTPUT_STREAM" #pragma HLS RESOURCE variable=rows core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS RESOURCE variable=cols core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS RESOURCE variable=return core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS INTERFACE ap_stable port=rows #pragma HLS INTERFACE ap_stable port=cols RGB_IMAGE img_0(rows, cols), img_1(rows, cols), img_2(rows, cols); RGB_IMAGE img_3(rows, cols), img_4(rows, cols), img_5(rows, cols); RGB_PIXEL pix(50, 50, 50); #pragma HLS dataflow hls::AXIvideo2Mat(input, img_0); hls::Sobel<1,0,3>(img_0, img_1); hls::SubS(img_1, pix, img_2); hls::Scale(img_2, img_3, 2, 0); hls::Erode(img_3, img_4); hls::Dilate(img_4, img_5); hls::Mat2AXIvideo(img_5, output); } top.cpp

Using Linux Userspace API
Modify device tree to include register map Call from userspace after mmap() { compatible = "xlnx,generic-hls"; reg = <0x400d0000 0xffff>; interrupts = <0x0 0x37 0x4>; interrupt-parent = <0x1>; }; Live Video Input AXIvideo2Mat HLS video library function chain Mat2AXIvideo Live Video Output Ximage_filter xsfilter; int fd_uio = 0; if ((fd_uio = open("/dev/uio0", O_RDWR)) < 0) { printf("UIO: Cannot open device node\n"); } xsfilter.Control_bus_BaseAddress = (u32)mmap(NULL, XSOBEL_FILTER_CONTROL_BUS_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd_uio, 0); xsfilter.IsReady = XIL_COMPONENT_IS_READY; // init the configuration for image filter XImage_filter_SetRows(&xsfilter, sobel_configuration.height); XImage_filter_SetCols(&xsfilter, sobel_configuration.width); XImage_filter_EnableAutoRestart(&xsfilter); XImage_filter_Start(&xsfilter);

HLS Directives for Video Processing
Assign ‘input’ to be an AXI4 stream named “INPUT_STREAM” Assign control interface to an AXI4-Lite interface Assign ‘rows’ to be accessible through the AXI4-Lite interface Declare that ‘rows’ will not be changed during the execution of the function Enable streaming dataflow optimizations #pragma HLS RESOURCE variable=input core=AXIS metadata="-bus_bundle INPUT_STREAM" #pragma HLS RESOURCE variable=return core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS RESOURCE variable=rows core=AXI_SLAVE metadata="-bus_bundle CONTROL_BUS" #pragma HLS INTERFACE ap_stable port=rows #pragma HLS dataflow

A more complex OpenCV example: fast-corners
This code is not ‘streaming’ and must be rewritten Random access and in-place operation on ‘dst’ void opencv_image_filter(IplImage* img, IplImage* dst ) { IplImage* gray = cvCreateImage(cvSize(img->width,img->height), 8, 1 ); cvCvtColor( img, gray, CV_BGR2GRAY ); std::vector<cv::KeyPoint> keypoints; cv::Mat gray_mat(gray,0); cv::FAST(gray_mat, keypoints, 20,true ); int rect=2; cvCopy(img,dst); for (int i=0; i<keypoints.size(); i++) { cvRectangle(dst, cvPoint(keypoints[i].pt.x,keypoints[i].pt.y), cvPoint(keypoints[i].pt.x+rect,keypoints[i].pt.y+rect), cvScalar(255,0,0),1); } cvReleaseImage( &gray ); } opencv_top.cpp

This code is ‘streaming’ Note that function correspondence is not 1:1! void opencv_image_filter(IplImage* src, IplImage* dst) { IplImage* gray = cvCreateImage( cvGetSize(src), 8, 1 ); IplImage* mask = cvCreateImage( cvGetSize(src), 8, 1 ); IplImage* dmask = cvCreateImage( cvGetSize(src), 8, 1 ); std::vector<cv::KeyPoint> keypoints; cv::Mat gray_mat(gray,0); cvCvtColor(src, gray, CV_BGR2GRAY ); cv::FAST(gray_mat, keypoints, 20, true); GenMask(mask, keypoints); cvDilate(mask,dmask); cvCopy(src,dst); PrintMask(dst,dmask,cvScalar(255,0,0)); cvReleaseImage( &mask ); cvReleaseImage( &dmask ); cvReleaseImage( &gray ); } opencv_top.cpp hls::FASTX hls::PaintMask

Synthesizable code Note ‘#pragma HLS stream” hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> _src(rows,cols); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> _dst(rows,cols); hls::AXIvideo2Mat(input, _src); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> src0(rows,cols); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC3> src1(rows,cols); #pragma HLS stream depth=20000 variable=src1.data_stream hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC1> mask(rows,cols); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC1> dmask(rows,cols); hls::Scalar<3,unsigned char> color(255,0,0); hls::Duplicate(_src,src0,src1); hls::Mat<MAX_HEIGHT,MAX_WIDTH,HLS_8UC1> gray(rows,cols); hls::CvtColor<HLS_BGR2GRAY>(src0,gray); hls::FASTX(gray,mask,20,true); hls::Dilate(mask,dmask); hls::PaintMask(src1,dmask,_dst,color); hls::Mat2AXIvideo(_dst, output); top.cpp

Streams and Reconvergent paths
hls::Mat conceptually represents a whole image, but is implemented as a stream of pixels Fast-corners contains a reconvergent path The stream of pixels for src1 must include enough buffering to match the delay through FASTX and Dilate (approximately 10 video lines * 1920 pixels) template<int ROWS, int COLS, int T> class Mat { public: HLS_SIZE_T rows, cols; hls::stream<HLS_TNAME(T)> data_stream[HLS_MAT_CN(T)]; }; hls_video_core.h CvtColor FASTX Dilate PaintMask src1 #pragma HLS stream depth=20000 variable=src1.data_stream

Performance Analysis AXI Performance Monitor collects statistics on memory bandwidth see /mnt/AXI_PerfMon.log Video + fast corners 1920*1080*60*32 = ~4 Gb/s per stream HP0: Read 4.01 Gb/s, Write 4.01 Gb/s, Total 8.03 Gb/s HP2: Read 4.01 Gb/s, Write 4.01 Gb/s, Total 8.03 Gb/s

Power Analysis Voltage and Current can be read from the digital power regulators on the ZC702 board. Custom, realtime HD video processing in 2-3 Watts total system power FASTX is less than 200 mW incremental power

HLS and Zynq accelerates OpenCV apps
OpenCV functions enable fast prototyping of Computer Vision algorithms Computer Vision applications are inherently heterogenous and require a mix HW and SW implementation Vivado HLS video library accelerates mapping of openCV functions to FPGA programmable fabric Zynq offers power-optimized integrated solution with high performance programmable logic and embedded ARM

Additional OpenCV Collateral at Xilinx.com
Download XAPP1167 from Xilinx.com QuickTake: Leveraging OpenCV and High-Level Synthesis with Vivado

How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013.

Similar presentations

Presentation on theme: "How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013.

Similar presentations

Presentation on theme: "How to Accelerate OpenCV Applications with the Zynq-7000 All Programmable SoC using Vivado HLS Video Libraries August 28, 2013."— Presentation transcript:

Similar presentations

About project

Feedback