Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

Similar presentations


Presentation on theme: "Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,"— Presentation transcript:

1 Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk, Paolo Ienne EPFL 30 April 2009

2 Fixed Block Size Motion Estimation  Less compression Few motion vectors Current Frame Reference Frame MB MV MV: Motion Vector MB: Macro Block 2

3 Variable Block Size Motion Estimation More compression  More motion vectors  More computation MB MV Current Frame Reference Frame MV: Motion Vector MB: Macro Block 3

4 Systolic Arrays and Motion Estimation Data is shared, low memory bandwidth 4 Current Frame Reference Frame MB MV PE 0 PE 1 PE 2 PE n Memory FF Comparator Regfile Pixel(s) Ref.  CS ABS 1 ABS 4 … Pix Ref Pix Ref

5 Comparator Systolic Arrays for VBSME PE 0 PE 1 PE 2 PE n Memory FF 16-pixel Regfile Comparator SAD MERGE TREE + + + + + Regfile Comparator SAD BUS NETWORK REUSE UNIT Regfile + Primitive Blocks Yap TCAS 2004 Song IEICE 2006 Chen TCAS 2006 Li FPT 2006 5

6 Outline Proposed Design Approach  Array Organization  Processing Element Design  Scheduling Related Work Case Study: H.264 VBSME Experimental Results  VLSI Implementation  FPGA Implementation Conclusion 6

7 Proposed Approach Basics:  Each PE is augmented by a comparator unit in addition to the reuse unit  Each PE computes the SADs of all sub- blocks within MB considering a specific reference MB  Each PE is one clock cycle prior to its neighbouring PE  Different PEs compute different SADs of the same MB with different reference MBs 7

8 Proposed Approach SAD B0,R0 PE 2 TiTi T i +1 T i +2 T i +3 PE 0 PE 1 T i +4 SAD B1,R1 SAD B2,R2 SAD B3,R3 SAD B4,R4 SAD B0,R1 SAD B1,R2 SAD B2,R3 SAD B3,R4 SAD B0,R2 SAD B1,R3 SAD B2,R4 R0R0 R1R1 R3R3 B0B0 B1B1 B2B2 R2R2 R4R4 8

9 Proposed Approach S B2,R2 S B1,R1 S B0,R0 PE 2 PE 0 PE 1 MIN S B0,R1 S B1,R2 S B2,R3 S B0,R2 S B1,R3 S B2,R4 TiTi T i+1 T i+2 T i+3 T i+4 9

10 Array Organization Memory FF Comparator SAD BUS NETWORK REUSE UNIT PE 0 Compare REUSE UNIT PE 1 Compare REUSE UNIT PE n Compare Min SAD Register File Array Organization - MIN SADs move in the chain and stored in the regfile - Each PE must compute more than one search region - (# of Pes) < (# of Search regions) MIN SAD Reg File 10

11 PE Design CU output(s) of Previous PE  CS ABS 1 ABS 4 … Pix Ref Pix Ref + FB CU RU MIN Reg Regfile 11

12 PE Design Optimization To minimize the size of RU register file Each PE should compare and transfer computed SADs ASAP Parallel comparators are required, when multiple SADs are produced in the same cycle Transfer Rate B: # of sub-blocks within MB T: # of cycles required to compute MB SADs 12

13 PE Design Optimization To minimize the size of RU register file Each PE should compare and transfer computed SADs ASAP Parallel comparators are required, when multiple SADs are produced in the same cycle Transfer Rate B: # of sub-blocks within MB T: # of cycles required to compute MB SADs Uniform generation of B sub-blocks within T cycles, reduces the RU regfile Regular workflow, simplifies controller 13

14 SAD Scheduling Primitive SADs computations need to be distributed in T cycles Non-primitive SADs  A SAD is generated as soon as its building SADs are ready  Proper scheduling frees SAD registers for other generated building SADs We propose zig-zag pattern for reusing  Also helps to evenly distribute SAD computations 14

15 SAD Scheduling 15

16 VLSI H.264 VBSME  Yap [TCAS 2004]: 1-D array with SAD bus network  Song [IEICE 2006]: 1-D array with SAD bus network  Chen [TCAS 2006] : 2-D array with SAD merge tree, use for HDTV applications FPGA H.264 VBSME  Wei [2003]: 1-D array with SAD bus network  Lopez [ISCAS 2005]: 1-D array using SRAMs with SAD bus network  Li [FPT 2006]: Bit-serial architecture with SAD merge tree Related Work 16

17 Case Study: H.264 VBSME MB = 16x16 pixels, B = 41 sub-blocks, 4x4 primitive blocks 4 PEs  Each PE computes 4 pixel SADs in each cycle Search range: 16x16 pixels for each pixel T = 64 cycles, for each MB Four identical and regular 16-cycles workflows 17

18 18 SAD Scheduling

19 Experimental Results H.264 VBSME modelled in VHDL VLSI Implementations  Synopsys DC  CMOS libraries 0.18 µm: 12k gates, 285 MHz 0.13 µm: 18k gates, 400 MHz FPGA Implementations  Altera Quartus, Xilinx ISE  Altera APEX, Xilinx VIRTEX-II & STRATIX-II 19

20 VLSI Implementation MB Processing Time (MBPT)  SR: Search Range  T: MB SAD cycles  N: # of PEs 20 ~20-25% reduction

21 VLSI Implementation Gate count (k gates) 21 large area reduction

22 FPGA Implementation Throughput (MB / sec) 22 lower throughput than best designs, but…

23 FPGA Implementation 23 …up to 3/4 th area reduction best efficiency

24 Scalability Stratix-II 24 almost perfect scalability

25 Conclusion We improved scalability by redesigning the organization of systolic array and the design of PEs in the array  Very low cost design, less area and delay We proposed zig-zag pattern for reusing the primitive SADs  Less registers for maintaining computed SADs  Very regular workflow This approach can be exploited by existing architectures and also can be applied to future standards with different block sizes 25

26 Thanks! 26

27 27 SAD Scheduling


Download ppt "Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,"

Similar presentations


Ads by Google