Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform

Slides:



Advertisements
Similar presentations
3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.
Advertisements

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
© 2004 Xilinx, Inc. All Rights Reserved Implemented by : Alon Ben Shalom Yoni Landau Project supervised by: Mony Orbach High speed digital systems laboratory.
Hardware accelerator for PPC microprocessor Final presentation By: Instructor: Kopitman Reem Fiksman Evgeny Stolberg Dmitri.
Define Embedded Systems Small (?) Application Specific Computer Systems.
Reliable Data Storage using Reed Solomon Code Supervised by: Isaschar (Zigi) Walter Performed by: Ilan Rosenfeld, Moshe Karl Spring 2004 Midterm Presentation.
Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.
EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.
An Introduction to H.264/AVC and 3D Video Coding.
Adaptive Video Coding to Reduce Energy on General Purpose Processors Daniel Grobe Sachs, Sarita Adve, Douglas L. Jones University of Illinois at Urbana-Champaign.
General Purpose FIFO on Virtex-6 FPGA ML605 board midterm presentation
Final presentation Encryption/Decryption on embedded system Supervisor: Ina Rivkin students: Chen Ponchek Liel Shoshan Winter 2013 Part A.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
General Purpose FIFO on Virtex-6 FPGA ML605 board Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf 1 Semester: spring 2012.
USB host for web camera connection
Xilinx at Work in Hot New Technologies ® Spartan-II 64- and 32-bit PCI Solutions Below ASSP Prices January
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
Live Action First Person Shooter Game Patrick Judd Ian Katsuno Bao Le.
Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.
SYSTEM-ON-CHIP (SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY.
Implementation of MAC Assisted CORDIC engine on FPGA EE382N-4 Abhik Bhattacharya Mrinal Deo Raghunandan K R Samir Dutt.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
Supervisor: INA RIVKIN Students: Video manipulation algorithm on ZYNQ Part B.
Department of Electrical Engineering Electronics Computers Communications Technion Israel Institute of Technology High Speed Digital Systems Lab. High.
Aug 25, 2005 page1 Aug 25, 2005 Integration of Advanced Video/Speech Codecs into AccessGrid National Center for High Performance Computing Speaker: Barz.
Hardware Image Signal Processing and Integration into Architectural Simulator for SoC Platform Hao Wang University of Wisconsin, Madison.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
PROJECT - ZYNQ Yakir Peretz Idan Homri Semester - winter 2014 Duration - one semester.
-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.
Hardware Accelerator for Hot-word Recognition Gautam Das Govardan Jonathan Mathews Wasim Shaikh Mojes Koli.
Performed By: Itamar Niddam and Lior Motorin Instructor: Inna Rivkin Bi-Semesterial. Winter 2012/2013 3/12/2012.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Final Presentation Hardware DLL Real Time Partial Reconfiguration Management of FPGA by OS Submitters:Alon ReznikAnton Vainer Supervisors:Ina RivkinOz.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
Design with Vivado IP Integrator
Aditya Dayal M. Tech, VLSI Design ITM University, Gwalior.
© Copyright 2013 Xilinx. Zynq Development Flow to Accelerate C Code.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
Fast iteration and prototyping in high-performance computing medical applications: a case study with Mentor Vista 8th INFIERI WORKSHOP 10/21/2016.
Maj Jeffrey Falkinburg Room 2E46E
Lab 4 HW/SW Compression and Decompression of Captured Image
Accelerate HD video processing through affordable hardware
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
M. Bellato INFN Padova and U. Marconi INFN Bologna
Prototyping SoC-based Gate Drive Logic for Power Convertors by Generating code from Simulink models. Researchers Rounak Siddaiah, Graduate Student-University.
Dynamo: A Runtime Codesign Environment
CA Final Project – Multithreaded Processor with IPC Interface
ECE354 Embedded Systems Introduction C Andras Moritz.
Introduction to Programmable Logic
Texas Instruments TDA2x and Vision SDK
ENG3050 Embedded Reconfigurable Computing Systems
FPGAs in AWS and First Use Cases, Kees Vissers
System on a Chip (SoC) Cristian Sisterna Universidad Nacional San Juan
Overview of Embedded SoC Systems
Using FPGAs with Processors in YOUR Designs
FPro Bus Protocol and MMIO Slot Specification
Dynamically Reconfigurable Architectures: An Overview
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
ECE 699: Lecture 3 ZYNQ Design Flow.
Introduction to Heterogeneous Parallel Computing
Implementation of a GNSS Space Receiver on a Zynq
Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian
NetFPGA - an open network development platform
Cloud-DNN: An Open Framework for Mapping DNN Models to Cloud FPGAs
Presentation transcript:

Highly Efficient and Flexible Video Encoder on CPU+FPGA Platform Di Wu, Liang Zhang, Peng Liu, Yao Song

Motivation For Video Encoding Is it possible to combine the advantages of both ASIC and Software solution? ASIC high throughput low power consumption difficult to upgrade Some tasks not suitable for hardware Pure software easy upgrade to new algorithm and standard highest quality poor encoding performance

CPU+FPGA Industry has provided CPU+FPGA SoCs, and it is a trend to integrate FPGA for GPUs in datacenter! Xilinx Zynq - Dual-core ARM Cortex A9 + FPGA For video encoding Video Frame Encoding (FPGA) Video Packet Wrapping (CPU) Video Packet Transmission (CPU) High performance! Easy upgrading! Easy programming!

Discussions in x264 Community [x264-devel] FPGAs and x264 https://mailman.videolan.org/pipermail/x264-devel/2009-July/006009.html “Just offloading simple DSP functions to an fpga is a bad idea when the host is a modern cpu.”

Xilinx Zynq Architecture programming system (PS) and programmable logic (PL) AXI interface connecting PS and PL The programming system contains a dual-core ARM processor, users can develop their own logics such as hardware accelerator in PL. AXI interface can be used to transfer data between PS and PL. AXI ports in PS include GP0/GP1 (master), HP0/HP1(slave) and ACP port. http://www.xilinx.com/support/documentation/data_sheets/ds190-Zynq-7000-Overview.pdf

Design Process Hardware Software Export hardware to SDK efforts of our project Hardware Developing custom IPs Instantiating user defined IPs in Vivado Instantiating programming system (PS) and predefined IP as well as user IP in Vivado Synthesizing the system and implementing on Zynq Software Export hardware to SDK Create BSP Application development and debugging

H.264 Baseline Profile one of the three profiles defined in H.264 standard widely used in mobile devices supports predictive encoding, discrete cosine transform and quantisation, entropy encoding According to the H.264 standard[1], the baseline profile encoder supports intra and inter-predictive coding (temporal and spatial compression) , core transform, Hadamard transform, quantisation. These steps not only perform compression , but also make it easier for entropy encoder to compress the data. The entropy encoder used in H.264 baseline profile is context-based adaptive variable length coding (CAVLC). The diagram of the baseline encoder is shown in figure 1[2].

Video Encoder’s Tasks Data Access Input video frames Reference frames Intermediate data Encoded video packets Encoding Motion Estimation Prediction (Intra + Inter) Filtering Entropy coding

Data Access Challenges We want Low latency Low cost Zynq provides AXI-based interconnection -> high throughput, but high latency Only PS has memory controller Solution Minimum data exchange between PS and PL

AXI Interfaces Three kinds of interfaces AXI-Lite (Memory Map) AXI-Full (Memory Map) AXI-Stream We use AXI-Lite interface for control signals, AXI-Full for YUV frames and encoded packets

PS-PL Interconnection In Zynq http://www.googoolia.com/wp/2014/06/20/lesson-8-an-overview-on-zynq-architecture/

Interfaces of Our H264 Encoder Interrupt port AXI-Lite slave interface for configuration AXI-Full master interface for YUV frames reading and encoded packets writing Encoder IP Interrupt AXI-Lite Slave configuration AXI-Full Master YUV frames Encoded packets

We use the open source implementation from The Encoder Engine Encoder Engine CLK CLK2 NEWSLICE control signals NEWLINE QP xbuffer_DONE intra4x4_READYI Y values of pixels intra4x4_STROBEI intra4x4_DATAI intra8x8cc_readyi U,V values of pixels intra4x4_READYI intra4x4_STROBEI tobytes_BYTE Encoded packets tobytes_STROBE tobytes_DONE We use the open source implementation from http://hardh264.sourceforge.net/

Internal of Our H264 Encoder Encoder Controller AXI Lite Interrupt Configuration Encoder Engine Y buffer(4-way) AXI Master Burst YUV frames U buffer(4-way) V buffer(4-way) Encoded packets Data Path Open Source Our Verilog code Control Path Xilinx IP

AXI Interfaces Implementation Implement from scratch RTL implementation based on templates generated by Vivado Design with HLS Use Xilinx’s IP AXI Lite Slave interface AXI Master Interface

Block Design Let’s integrate our IP to Zynq SoC Interrupt Controller Encoder IP Interrupt ARM Processor GP0 configuration YUV frames AXI4 Interconnect HP0 Encoded packets DDR Memory Controller PL PS

Software Implementation How to control the video encoder IP? The AXI-Lite interface (memory map) can access registers Control informations need to transfer to video encoder Start address of a YUV frame YUV format Video resolution QP value of the encoder frame number output address of h264 stream Information provided by video encoder video packet size

Encoding Process in PS Put a video frame (YUV) in DRAM (from camera, decoder’s output, etc.) Config the parameters Start encoding On interrupt, save encoded result

Encoding Process in PL Move a YUV frame data from DDR RAM to Y, U, V buffers Feed the pixels to encoder engine Wait for the encoder to generate video packets and move the packets to DDR RAM After finish the encoding of one frame, generate an interrupt

Implementation Our implementation is still ongoing Hopefully finish the design and evaluation before report deadline Development Environment Vivado 2015.3 Zedboard

Workload Characteristics Data intensive Frame by frame, not stream Computation intensive Real-time requirement for some applications

The Encoder Engine Verification We simulated the encoder engine with test bench, it can generate the correct NAL unit stream. Source YUV: coastguard(30 frames), 352x288, YUV420 QP: 28 PSNR: 36.3 dB Compression ratio: 10.6:1 Reference: x264’s PSNR with same QP, medium preset Only I frame. PSNR 40.31 dB, compression ratio 8.3:1 With P, B frame. PSNR 37.0 dB, compression ratio 42.6:1

Design Review A H.264 video encoder IP for Zynq SoC Our design needs to follow the design paradigm of AXI4 based IP Minimum communication is the key to power/performance gain Use high throughput mode (burst), or other optimization (future work: pipelining) to optimize the communication between CPU and FPGA logic Vivado is a powerful tool, but the learning curve is very high (especially for software guys)

Discussion Can we use some general framework to simplify our work? No Example: RSoC framework:http://rsoc-framework.com/ Suitable for “stream” applications

Conclusion Task offloading is not always helpful, communication latency matters Video encoding on CPU+FPGA SoC can be both efficient and flexible Our system can be used to integrate other encoder engines CPU+FPGA SoCs are powerful tools for designers, “do simple thing fast on HW, intelligence on SW”.

Future Works Hardware Software Optimization, e.g. pipelining Video encoder engine improvement, e.g. support P frame and B frame Software Bitrate control logic OS support Integration with multimedia software framework (ffmpeg, gstreamer, etc.)

Thanks! Q&A