Download presentation
Presentation is loading. Please wait.
1
High Speed Systolic Array Structure for Variable Block Size Motion Estimation Vinod Reddy 05/04/2009
2
Nested Loop Structure for Fixed Size ME
For m= 0 to 2p For n= 0 to 2p SAD(m,n)=0 for i= 0 to N-1 for j= 0 to N-1 SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m-p,j+n-p)| End j End i SAD(m,n) = SAD(m,n); End n End m
3
For a 4x4 Block size and p =2 For m= 0 to 3 For n= 0 to 3 SAD(m,n)=0
for i= 0 to 3 for j= 0 to 3 SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m,j+n)| }}}} In matrix form, inner two loops can be represented as SAD(0,0) = x00 x01 x02 x y00 y01 y02 y03 x10 x11 x12 x y10 y11 y12 y13 x20 x21 x22 x y20 y21 y22 y23 x30 x31 x32 x y30 y31 y32 y33 The corresponding DG: i j x00 y00 x01 y01 x02 y02 x03 y03 x10 y10 x11 y11 x12 y12 x13 y13 x20 y20 x21 y21 x22 y22 x23 y23 x30 y30 x31 y31 x32 y32 x33 y33 Sad (0,0)
4
Unrolling and interchanging the third loop for max reuse
For n = 0 to 3 For m = 0 to 3 SAD(m,n)=0 for i= 0 to 3 for j= 0 to 3 SAD(m,n)= SAD(m,n)+|x(i,j)-y(i+m,j+n)| }}}} Representing in matrix form by unrolling third loop iteration SAD(0,0) = x00 x01 x02 x y00 y01 y02 y03 x10 x11 x12 x y10 y11 y12 y13 x20 x21 x22 x y20 y21 y22 y23 x30 x31 x32 x y30 y31 y32 y33 SAD(1,0) = x00 x01 x02 x03 y10 y11 y12 y13 x10 x11 x12 x13 - y20 y21 y22 y23 x20 x21 x22 x23 y30 y31 y32 y33 x30 x31 x32 x33 y40 y41 y42 y43 SAD(2,0) = x00 x01 x02 x y20 y21 y22 y23 x10 x11 x12 x y30 y31 y32 y33 x20 x21 x22 x y40 y41 y42 y43 x30 x31 x32 x y50 y51 y52 y53 m i j x00 x01 x02 x03 x00 x01 x02 x03 y00 y01 y02 y03 x10 x11 x12 x13 y10 y11 y12 y13 x20 x21 x22 x23 y20 y21 y22 y23 x30 x31 x32 x33 Sad (1,0) y43 y30 y31 y32 y33 Sad (0,0)
5
Grouping 4 pixels into a Single large pixel
Previous approaches used too pessimistic pixel level granularity [1,2]. We will work now with large pixel set of four pixels. The same DG shown before can now be conveniently represented in 2D as Where X00 = {x00,x01,x02,x03} Y00 = {y00,y01,y02,y03}…… Y40 = {y40,y41,y42,y43}………. Sad (0,0) Sad (1,0) Sad (2,0) Sad (3,0) x00 x01 x02 x03 X00 x00 x01 x02 x03 Y00 y00 y01 y02 y03 X10 x10 x11 x12 x13 y10 y11 y12 y13 Y10 x20 x21 x22 x23 X20 y20 y21 y22 y23 Y20 x30 x31 x32 x33 Sad (1,0) y43 y30 y31 y32 y33 X30 Sad (0,0) Y30 Y40 Y50 Y60
6
Systolic Array for 4x4 Block Size
j Choosing the Projection direction d : [1 0] i- axis Processor space vector PT : [0 1] j-axis Scheduling vector ST : [1 1] The edge mapping of the DG in systolic array is given by eT PTe STe X(1,0) Y(1,1) Res(0,1) Sad (0,0) Sad (1,0) Sad (2,0) Sad (3,0) X00 Y00 X10 Y10 X20 Y20 X30 Y30 Y40 Y50 Y60 D D D D X30 X20 X10 X00 2D 2D 2D …..Sad(3,0) Sad(2,0) Sad(1,0) Sad (0,0) + Sad4x4 generated every cycle for each Ref Blk + Critical Path is decided by PE Processing time. + X Inputs are stored internally in the PE + Each Y vector is applied once and the systolic array reuses it efficiently without rereading the same input twice. D D D Y30 Y Y40 Y Y50 Y Y60
7
Variable Block Size for 8x8
Lets now consider variable block size motion estimation X & Y 8x8 blocks can be decomposed into above four 4x4 blocks X00 X10 X20 X30 Y00 Y10 Y20 Y30 4x4 4x4 X00 X01 X10 X11 X20 X21 X30 X31 X40 X41 X50 X51 X60 X61 X70 X71 Y00 Y01 Y10 Y11 Y20 Y21 Y30 Y31 Y40 Y41 Y50 Y51 Y60 Y61 Y70 Y71 8x8 8x8
8
Extending DG for Variable Block Size8x8
Sad (0,2) Sad (1,2) Sad (2,2) Sad (3,2) Sad (0,3) Sad (1,3) Sad (2,3) Sad (3,3) Sad 4x8_01 X40 X41 Y40 Y41 X50 X51 Y50 Y51 Sad 8x4_01 X60 X61 Y60 Y61 Sad 8x8_00 X70 X71 Y70 Sad 8x4_00 Y80 Y90 Y71 Y10,0 Y81 Y91 Y10,1 Sad (0,0) Sad (1,0) Sad (2,0) Sad (3,0) Sad (0,1) Sad (1,1) Sad (2,1) Sad (3,1) X00 X01 Sad 4x8_00 Y00 Y01 X10 X11 Y10 Y11 X20 X21 Y20 Y21 X30 X31 Y30 Y40 Y50 Y31 Y60 Y41 Y51 Y61
9
Mapping DG of Variable ME to Systolic Array
X30 X20 X10 X00 2D 2D 2D Sad 4x4_00 Sad 4x8_00 ……Y20 Y10 Y00 D D D X31 X21 X11 X01 2D 2D 2D Sad 4x4_01 ……Y21 Y11 Y01 D D D Sad 8x4_00 X30 X20 X10 X00 2D 2D 2D Sad 8x8_00 Sad 4x4_10 ……Y22 Y12 Y02 D D D Sad 8x4_01 X30 X20 X10 X00 2D 2D 2D Sad 4x4_11 ……Y23 Y13 Y03 D D D Sad 4x8_01 + Generates 4 Sad4x4, 2 Sad8x4, 2 Sad4x8, 1 8x8 Sad in a single cycle + Four large Y pixels {Y00,Y01,Y02,Y03} are read only once and reused internally + Four systolic arrays operating in parallel for variable block ME of size 8x8 + Similarly we can extend it for 16x16, by operating16 systolic arrays in parallel.
10
Large PE Design |x-y| = x + y’ + 1 x > y (x + y’) ‘ x =< y … [3]
PSAD x30 y30 Y30 X30 X30 Y30 PSAD_I C0 ABS0 x30 y30 x31 y31 x32 y32 x33 y33 Y30 X30 PSAD_I Reg Reg Reg X30 Y30 PSAD |x-y| = x + y’ x > y (x + y’) ‘ x =< y … [3]
11
Results Synthesis results using synopsys design compiler.
Target library LSI 90nm. Clock Freq Area PE 333 MhZ um2 PE 500 MhZ um2
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.