Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO.

Similar presentations

Presentation on theme: "A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO."— Presentation transcript:

1 A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO. 1, Feb 1996

2 Introduction  VLSI design phases  Generic processor vs. ASIC  Programmable architectures  Architecture design PE architecture  Parallel architecture  Memory bandwidth Data-flow design  Pipeline flow Control circuit  H/W or Prog. specification behavior register- transfer logic circuit layout Controller unit PE Array Architecture Memory Data


4 Architecture of PE

5 SAD PE element


7 A data-flow design for a full-search block-matching motion estimator

8 The basic ideas  A general-purpose interconnect network whose topology supports arbitrary paths from ME ’ s to PE ’ s.  A memory partitioning scheme that allows the required memory accesses, and  programmable interconnect and PE ’ s controlled by a stored-program controller.

9 An abstract architectural model for the proposed motion-estimator

10 Interconnection Networks  Multistage network Benes, Crossbar, Omega, etc.  A simple combination of multiplexers or a direct connection between the memory and the processing elements.  Each frame memory can be implemented as either an interleaved set of multiple banks or a single block of dual-port RAM.

11 Data-flow design for TSS  Eight processors will be needed for each step  Each of the TSS takes 256 cycles  The size and the cost of a memory increase considerably with the number of ports.  Computer architects and circuit designers usually restrict the # of ports to two or three.  The usage of a 9-port memory for implementing the TSS is highly impractical.

12 Nine shifts tested in step 1 of a three- step search

13 Data-flow for step 1 of a three-step search procedure

14 Two solutions with different memory partitioning schemes  Broadcasting the Previous-Frame Data  Broadcasting the Current-Frame Data

15 Broadcasting the Previous-Frame Data  b(4,12) is required by PE 8 in cycle 0, by PE 5 in cycle 8, by PE 1 in cycle 4, and by other PE ’ s in some other cycles.  Solve the memory-bandwidth problem by aligning the b(.) data carefully.  At most two different b(.) values in a cycle.  Problems TSS could not be completed in 768 cycles. The a(.) data are now misaligned and therefore cause memory-access conflicts.

16 Revised data-flow for step 1 of a three- step search procedure (1)

17 Revised data-flow for step 1 of a three- step search procedure (2)

18 Broadcasting the Previous-Frame Data  16 smaller memory banks  A multistage, 16-port interconnection network  Supplying appropriate memory bandwidth is critical to maintaining the throughput of a BM architecture.

19 Two different conflicts  The memory conflicts Arise when two different a(.) values that reside in the same memory bank are needed in the same cycle.  The path conflicts Arise in an interconnection network when one path ( a connection from a src to a dest through s/w) is blocked by another existing path.

20 Derived of conflict-free schedule  A memory partitioning scheme and a processor assignment scheme are first chosen, through simulation of different memory-partitioning and processor assignment schemes. The number of conflicts is not prohibitively large Cycles which do not have conflicts are left unchanged and the ones that have conflicts are recursively broken into sub-cycles.

21 Motion estimator architecture: broadcasting previous-frame data

22 Broadcasting the Current-Frame Data  To implement the original TSS data-flow.  a(.) is broadcasted.  b(.) is partitioned into 16 memory banks.

23 Motion estimator architecture: broadcasting current-frame data

24 Performance of the motion estimator  The simulator takes as input: A data-flow description of a BMA  specifying the # of PE ’ s and the ideal flow of the pixel data. A memory configuration  Specifying the # of ME ’ s and the # of memory ports. A network characterization  Specifying the topology of the interconnection network between the PE ’ s and the ME ’ s. The pipelining information  Specifying the number of pipeline stages in each PE and the network.  Determines the network-path and memory- access conflicts.

25 Interconnection networks  Completely connected Network N 2 crosspoint switches are needed in a single-stage  Crossbar N port (N in, N out) multistage network May not be possible to free all path conflicts  Generalized Cube and Omega N-port network, log 2 N stages with N/2 switches in each stage  Benes 2 log 2 N-1 switch stages








33 Memory-partitioning scheme without data duplication

34 Data duplicated memory-partitioning

35 Simulation results for different networking and memory-partitioning schemes

36 Simulation results for different pixel distributions

37 Data-flow for step 1 of the conjugate- direction search

38 Data-flow for step 2 of the conjugate- direction search

39 Conclusions  An engine that can be adapted to multiple motion-estimation algorithms.

Download ppt "A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO."

Similar presentations

Ads by Google