Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.

Pipelined Parallel AC-based Approach for Multi-String Matching Department of Computer Science and Information Engineering National Cheng Kung University,

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Outline Introduction Introduction Fast Inter Prediction Mode Decision for H.264 – –Pre-encoding An Efficient Inter Mode Decision Approach for H.264 Video.

Efficient multi-frame motion estimation algorithms for MPEG-4 AVC/JVTH.264 Mei-Juan Chen, Yi-Yen Chiang, Hung- Ju Li and Ming-Chieh Chi ISCAS 2004.

1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO.

CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.

Penn ESE Spring DeHon 1 FUTURE Timing seemed good However, only student to give feedback marked confusing (2 of 5 on clarity) and too fast.

Microelectronic Systems--University of Tennessee 1 1 Music Synthesizer Design Christopher Boyd Ki Shin Electrical & Computer Engineering University of.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

H.264 Deblocking Filter Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.

Computer Architecture Dataflow Machines. Data Flow Conventional programming models are control driven Instruction sequence is precisely specified Sequence.

Trigger design engineering tools. Data flow analysis Data flow analysis through the entire Trigger Processor allow us to refine the optimal architecture.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

1 Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Hsin-Ju Lu IEEE CSVT 2008.

VSIPL++ / FPGA Design Methodology

Efficient FPGA Implementation of QR

Enhancing FPGA Performance for Arithmetic Circuits Philip Brisk 1 Ajay K. Verma 1 Paolo Ienne 1 Hadi Parandeh-Afshar 1,2 1 2 University of Tehran Department.

1 EE 587 SoC Design & Test Partha Pande School of EECS Washington State University

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Implementation of Finite Field Inversion

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

L28:Lower Power Algorithm for Multimedia Systems(2) 성균관대학교 조 준 동

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/

An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.

Motion Estimation Based Frame Rate Conversion Hardware Designs by Özgür Taşdizen PhD Thesis Sabancı University May 2010.

An Efficient FPGA Implementation of IEEE e LDPC Encoder Speaker: Chau-Yuan-Yu Advisor: Mong-Kai Ku.

Performance and Power Analysis of Globally Asynchronous Locally Synchronous Multiprocessor Systems Zhiyi Yu, Bevan M. Baas VLSI Computation Lab, ECE department,

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

DSP Algorithms on FPGA Part II Digital image Processing

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Jason Li Jeremy Fowers 1. Speedups and Energy Reductions From Mapping DSP Applications on an Embedded Reconfigurable System Michalis D. Galanis, Gregory.

A Programmable Single Chip Digital Signal Processing Engine MAPLD 2005 Paul Chiang, MathStar Inc. Pius Ng, Apache Design Solutions.

1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Routing Wire Optimization through Generic Synthesis on FPGA Carry Hadi P. Afshar Joint work with: Grace Zgheib, Philip Brisk and Paolo Ienne.

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

Task Mapping and Partition Allocation for Mixed-Criticality Real-Time Systems Domițian Tămaș-Selicean and Paul Pop Technical University of Denmark.

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

MACHINE VISION GROUP, JANI BOUTELLIER, Architectural Support for the Orchestration of Fine-Grained Multiprocessing for Portable Streaming Applications.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

An FFT for Wireless Protocols Dr. J. Greg Nash Centar ( HAWAI'I INTERNATIONAL CONFERENCE ON SYSTEM SCIENCES Mobile.

A Scalable Pipelined Associative SIMD Array With Reconfigurable PE Interconnection Network For Embedded Applications Hong Wang & Robert A. Walker Computer.

An Area-Efficient VLSI Architecture for Variable Block Size Motion Estimation of H.264/AVC Hoai-Huong Nguyen Le' and Jongwoo Bae 1 1 Department of Information.

Buffering Techniques Greg Stitt ECE Department University of Florida.

Improved Resource Sharing for FPGA DSP Blocks

Exploiting Fast Carry Chains of FPGAs for Designing Compressor Trees

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National.

Nested Loop Structure for Fixed Size ME

Sum of Absolute Differences Hardware Accelerator

Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform

Final Project presentation

Presentation transcript:

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk, Paolo Ienne EPFL 30 April 2009

Fixed Block Size Motion Estimation  Less compression Few motion vectors Current Frame Reference Frame MB MV MV: Motion Vector MB: Macro Block 2

Variable Block Size Motion Estimation More compression  More motion vectors  More computation MB MV Current Frame Reference Frame MV: Motion Vector MB: Macro Block 3

Systolic Arrays and Motion Estimation Data is shared, low memory bandwidth 4 Current Frame Reference Frame MB MV PE 0 PE 1 PE 2 PE n Memory FF Comparator Regfile Pixel(s) Ref.  CS ABS 1 ABS 4 … Pix Ref Pix Ref

Comparator Systolic Arrays for VBSME PE 0 PE 1 PE 2 PE n Memory FF 16-pixel Regfile Comparator SAD MERGE TREE Regfile Comparator SAD BUS NETWORK REUSE UNIT Regfile + Primitive Blocks Yap TCAS 2004 Song IEICE 2006 Chen TCAS 2006 Li FPT

Outline Proposed Design Approach  Array Organization  Processing Element Design  Scheduling Related Work Case Study: H.264 VBSME Experimental Results  VLSI Implementation  FPGA Implementation Conclusion 6

Proposed Approach Basics:  Each PE is augmented by a comparator unit in addition to the reuse unit  Each PE computes the SADs of all sub- blocks within MB considering a specific reference MB  Each PE is one clock cycle prior to its neighbouring PE  Different PEs compute different SADs of the same MB with different reference MBs 7

Proposed Approach SAD B0,R0 PE 2 TiTi T i +1 T i +2 T i +3 PE 0 PE 1 T i +4 SAD B1,R1 SAD B2,R2 SAD B3,R3 SAD B4,R4 SAD B0,R1 SAD B1,R2 SAD B2,R3 SAD B3,R4 SAD B0,R2 SAD B1,R3 SAD B2,R4 R0R0 R1R1 R3R3 B0B0 B1B1 B2B2 R2R2 R4R4 8

Proposed Approach S B2,R2 S B1,R1 S B0,R0 PE 2 PE 0 PE 1 MIN S B0,R1 S B1,R2 S B2,R3 S B0,R2 S B1,R3 S B2,R4 TiTi T i+1 T i+2 T i+3 T i+4 9

Array Organization Memory FF Comparator SAD BUS NETWORK REUSE UNIT PE 0 Compare REUSE UNIT PE 1 Compare REUSE UNIT PE n Compare Min SAD Register File Array Organization - MIN SADs move in the chain and stored in the regfile - Each PE must compute more than one search region - (# of Pes) < (# of Search regions) MIN SAD Reg File 10

PE Design CU output(s) of Previous PE  CS ABS 1 ABS 4 … Pix Ref Pix Ref + FB CU RU MIN Reg Regfile 11

PE Design Optimization To minimize the size of RU register file Each PE should compare and transfer computed SADs ASAP Parallel comparators are required, when multiple SADs are produced in the same cycle Transfer Rate B: # of sub-blocks within MB T: # of cycles required to compute MB SADs 12

PE Design Optimization To minimize the size of RU register file Each PE should compare and transfer computed SADs ASAP Parallel comparators are required, when multiple SADs are produced in the same cycle Transfer Rate B: # of sub-blocks within MB T: # of cycles required to compute MB SADs Uniform generation of B sub-blocks within T cycles, reduces the RU regfile Regular workflow, simplifies controller 13

SAD Scheduling Primitive SADs computations need to be distributed in T cycles Non-primitive SADs  A SAD is generated as soon as its building SADs are ready  Proper scheduling frees SAD registers for other generated building SADs We propose zig-zag pattern for reusing  Also helps to evenly distribute SAD computations 14

SAD Scheduling 15

VLSI H.264 VBSME  Yap [TCAS 2004]: 1-D array with SAD bus network  Song [IEICE 2006]: 1-D array with SAD bus network  Chen [TCAS 2006] : 2-D array with SAD merge tree, use for HDTV applications FPGA H.264 VBSME  Wei [2003]: 1-D array with SAD bus network  Lopez [ISCAS 2005]: 1-D array using SRAMs with SAD bus network  Li [FPT 2006]: Bit-serial architecture with SAD merge tree Related Work 16

Case Study: H.264 VBSME MB = 16x16 pixels, B = 41 sub-blocks, 4x4 primitive blocks 4 PEs  Each PE computes 4 pixel SADs in each cycle Search range: 16x16 pixels for each pixel T = 64 cycles, for each MB Four identical and regular 16-cycles workflows 17

18 SAD Scheduling

Experimental Results H.264 VBSME modelled in VHDL VLSI Implementations  Synopsys DC  CMOS libraries 0.18 µm: 12k gates, 285 MHz 0.13 µm: 18k gates, 400 MHz FPGA Implementations  Altera Quartus, Xilinx ISE  Altera APEX, Xilinx VIRTEX-II & STRATIX-II 19

VLSI Implementation MB Processing Time (MBPT)  SR: Search Range  T: MB SAD cycles  N: # of PEs 20 ~20-25% reduction

VLSI Implementation Gate count (k gates) 21 large area reduction

FPGA Implementation Throughput (MB / sec) 22 lower throughput than best designs, but…

FPGA Implementation 23 …up to 3/4 th area reduction best efficiency

Scalability Stratix-II 24 almost perfect scalability

Conclusion We improved scalability by redesigning the organization of systolic array and the design of PEs in the array  Very low cost design, less area and delay We proposed zig-zag pattern for reusing the primitive SADs  Less registers for maintaining computed SADs  Very regular workflow This approach can be exploited by existing architectures and also can be applied to future standards with different block sizes 25

Thanks! 26

27 SAD Scheduling