A Flexible Parallel Architecture Adapted to Block-Matching Motion-Estimation Algorithms Santanu Dutta, and Wayne Wolf IEEE Trans. On CSVT, vol. 6, NO. 1, Feb 1996
Introduction VLSI design phases Generic processor vs. ASIC Programmable architectures Architecture design PE architecture Parallel architecture Memory bandwidth Data-flow design Pipeline flow Control circuit H/W or Prog. specification behavior register- transfer logic circuit layout Controller unit PE Array Architecture Memory Data
Architecture of PE
SAD PE element
A data-flow design for a full-search block-matching motion estimator
The basic ideas A general-purpose interconnect network whose topology supports arbitrary paths from ME ’ s to PE ’ s. A memory partitioning scheme that allows the required memory accesses, and programmable interconnect and PE ’ s controlled by a stored-program controller.
An abstract architectural model for the proposed motion-estimator
Interconnection Networks Multistage network Benes, Crossbar, Omega, etc. A simple combination of multiplexers or a direct connection between the memory and the processing elements. Each frame memory can be implemented as either an interleaved set of multiple banks or a single block of dual-port RAM.
Data-flow design for TSS Eight processors will be needed for each step Each of the TSS takes 256 cycles The size and the cost of a memory increase considerably with the number of ports. Computer architects and circuit designers usually restrict the # of ports to two or three. The usage of a 9-port memory for implementing the TSS is highly impractical.
Nine shifts tested in step 1 of a three- step search
Data-flow for step 1 of a three-step search procedure
Two solutions with different memory partitioning schemes Broadcasting the Previous-Frame Data Broadcasting the Current-Frame Data
Broadcasting the Previous-Frame Data b(4,12) is required by PE 8 in cycle 0, by PE 5 in cycle 8, by PE 1 in cycle 4, and by other PE ’ s in some other cycles. Solve the memory-bandwidth problem by aligning the b(.) data carefully. At most two different b(.) values in a cycle. Problems TSS could not be completed in 768 cycles. The a(.) data are now misaligned and therefore cause memory-access conflicts.
Revised data-flow for step 1 of a three- step search procedure (1)
Revised data-flow for step 1 of a three- step search procedure (2)
Broadcasting the Previous-Frame Data 16 smaller memory banks A multistage, 16-port interconnection network Supplying appropriate memory bandwidth is critical to maintaining the throughput of a BM architecture.
Two different conflicts The memory conflicts Arise when two different a(.) values that reside in the same memory bank are needed in the same cycle. The path conflicts Arise in an interconnection network when one path ( a connection from a src to a dest through s/w) is blocked by another existing path.
Derived of conflict-free schedule A memory partitioning scheme and a processor assignment scheme are first chosen, through simulation of different memory-partitioning and processor assignment schemes. The number of conflicts is not prohibitively large Cycles which do not have conflicts are left unchanged and the ones that have conflicts are recursively broken into sub-cycles.
Motion estimator architecture: broadcasting previous-frame data
Broadcasting the Current-Frame Data To implement the original TSS data-flow. a(.) is broadcasted. b(.) is partitioned into 16 memory banks.
Motion estimator architecture: broadcasting current-frame data
Performance of the motion estimator The simulator takes as input: A data-flow description of a BMA specifying the # of PE ’ s and the ideal flow of the pixel data. A memory configuration Specifying the # of ME ’ s and the # of memory ports. A network characterization Specifying the topology of the interconnection network between the PE ’ s and the ME ’ s. The pipelining information Specifying the number of pipeline stages in each PE and the network. Determines the network-path and memory- access conflicts.
Interconnection networks Completely connected Network N 2 crosspoint switches are needed in a single-stage Crossbar N port (N in, N out) multistage network May not be possible to free all path conflicts Generalized Cube and Omega N-port network, log 2 N stages with N/2 switches in each stage Benes 2 log 2 N-1 switch stages
Memory-partitioning scheme without data duplication
Data duplicated memory-partitioning
Simulation results for different networking and memory-partitioning schemes
Simulation results for different pixel distributions
Data-flow for step 1 of the conjugate- direction search
Data-flow for step 2 of the conjugate- direction search
Conclusions An engine that can be adapted to multiple motion-estimation algorithms.