Download presentation
Presentation is loading. Please wait.
Published byMolly Hunt Modified over 9 years ago
1
ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform
2
Discrete Cosine Transform Frequency transform Used for pattern recognition, image processing, still and moving image and video processing N-point sequence x(n), N-point DCT and IDCT pair is defined as where
3
N-point DCT/IDCT N-point DCT and IDCT pair can be derived using a 2N- point discrete Fourier transform (DFT) pair, using x(n) and its mirror image Y(n) is symmetric with respect to midpoint at n=N-1/2. The 2N-point DFT of y(n) is given by (for 0<= k <= 2N-1) Substituting n=2N-n’-1 into the second summation, we have
4
N-point DCT (cont.) Now we have Define N-point DCT can be expressed as
5
N-point DCT/IDCT N-point 1D-DCT requires N 2 multiplications and addition. For image compression, N X N blocks need N X N 2D DCT. Direct computation of 2D-DCT of length N requires N 4 multiplications and additions. Using the separability of 2D-DCT, it can be computed by performing N 1D-DCT on the rows of the image block followed by N 1D-DCT on the resulting column. Complexity reduced to 2N 3 multiply-add operations or 4N 3 arithmetic operations.
6
2D DCT The 2-D Discrete Cosine Transform has shown to be separable, i.e., it can be expressed as two consecutive l-D transforms. Observe that in X and x are 2-D (NxN) data matrices. A 2-D transform can now be calculated using an 1-D transform hardware unit twice, making a matrix transposition of the intermediate result in between.
7
Block diagram and timing diagram of DCT core processor
8
Algorithm-Architecture Transformation of DCT A hierarchical way to adapt an architecture to a given algorithm or change the algorithm’s description in a systematic way. The multiplication of DCT can be reduced using this technique, e.g. 8- point DCT Combining a k and the cosine expression into one coefficient b n,k, we have the following dataflow graph
9
Algorithm-Architecture Transformation of DCT We can write the dataflow graph as a matrix form where Transformation in 3 steps 1 st step, systematically modify the DCT algorithm, here using trigonometric properties
10
Algorithm-Architecture Transformation of DCT Then the 8-point DCT can be rewritten as where
11
Algorithm-Architecture Transformation of DCT
12
Step 2 transformation: DCT structure is grouped into different functional units represented by blocks and then the whole DCT structure is transformed into a block diagram. Two major blocks + + - x(0) x(1) x(0)+x(1) x(0)-x(1) + + x(0) x(1) ax(0)+bx(1) bx(0)-ax(1) a a b b
13
Algorithm-Architecture Transformation of DCT The transformed block diagram is:
14
Algorithm-Architecture Transformation of DCT Step 3- reduce complexity of the implementations of the blocks. The block can be realized by using 3 multiplications and 3 additions instead of 4. Define the block with a=sin , and b = cos , and reversed outputs as a rotator block that computes Other transformations
15
Algorithm-Architecture Transformation of DCT Final architecture 13 multiplications, 31 additions
16
Decimation-in Frequency Fast DCT for 2 m -Point IDCT DIF commonly used in DFT. Reduce the # of multiplications to about (N/2)log 2 N by power-of-2 decomposition. For simplicity the 2/N scaling factor is ignore. We have
17
Fast DCT/IDCT (FCT) –Decomposing into even and odd indexes of k For h’(n) we use We have 2cosAcosB=cos(A+B)+cos(A-B)
18
N-point IDCT can be decomposed using N/2-point IDCT
19
N-point IDCT Architecture
20
Since N-point IDCT can be expressed in terms of two N/2-point IDCT. By repeating this process, the IDCT can be decomposed further until it can be expressed in terms of 2-point IDCTs (DCT can be decomposed in a similar fashion) 2-point IDCT butterfly architecture Cos( /4) x(0) x(1)
21
E.g 8-point IDCT
22
8-point IDCT architecture
23
Complexity comparison
24
Multiplier-less DCT architecture Using distributed arithmetic More area-efficient realization of hardware Replacement of multipliers by memory look-up table Regularity of the highly concurrent structure allows modular design of the circuit Bit-serial and bit-parallel structure – saving area and ease of routing
25
Distributed Arithmetic (B. Liu -74) The most-often encountered form of computation in DSP: –Sum of product –Dot-product –Inner-product Distributed arithmetic (DA) is used to design bit-level architectures for vector-vector multiplications (inner products) –Each word in the vectors is represented as a binary number –The multiplications are re-ordered and mixed such that the arithmetic becomes “distributed” through the structure
26
Technical Overview of DA Advantage of DA: Efficiency of computing mechanization A frequently argued: –Slowness because of its inherent bit-serial nature –Some modifications to increase the speed by employing techniques: –Plus more arithmetic operations –expense of exponentially increased memory
27
Conventional distributed arithmetic An inner product between 2 length-N vectors C and X Where {c i }’s are M-bit constants and {x i }s are coded as W-bit 2’s complement numbers as follows –Now substituting the above equation, we have
28
Conventional distributed arithmetic Define Then By interchanging the summing order of i and j, the initial multiplications are now distributed to another computation pattern. Since the term C j depends on x i,j values and has only 2 N possible values, it is possible to pre-compute them and store them in a ROM An input set of N bits (x 0j,x 1j,…,x N-1,j ) is used as an address to get C j values These intermediate results are accumulated in W clock cycles to produce one Y value.
29
Example Content of ROM (N=4)
30
Architecture of computing inner product of two length-N vectors using DA The results is obtained after W clock cycles. This is called bit-serial distributed arithmetic. Speed is limited because it takes W cycles
31
Speeding up bit-serial DA Use digit-serial distributed arithmetic, where a digit containing multiple bits is processed in a clock cycle E.g. if J consecutive bits are processed in a single clock cycle using J ROMs, then the input words are processed in W/J clock cycles. A multi-input shift-accumulator adds the contents of J ROMs and the previous accumulated results
32
DA with Offset-Binary Coding Offset-Binary Coding can be used to reduce the ROM size by a factor of 2. Where Define (eqn.1) Eqn 1 can be rewritten as (eqn.2)
33
DA with Offset-Binary Coding Using eqn. 2, the original Y can be written as Now define We have
34
Content of the ROM with OBC Coding (N=4) Table 13.3 D j values are mirrored, therefore D j has only 2 N-1 possible values depending on the x i,j values and the ROM size is reduced by 2
35
Architecture with OBC coding
36
ROM decomposition for DA ROM size increased exponentially with N –ROM access time can be a bottleneck esp. when N is large –Reducing the size of ROM is important Solution –Divide the N address bits into N/K groups of K bits –Decompose the ROM of size 2N into N/K ROMs of size 2K –Add the outputs of these ROM using a multi-input accumulator –Reduction of the storage size is balanced by a linear increase of the computation complexity of the accumulator –Carry-save arithmetic can be used to realize the multi- input accumulator to minimize the computation time
37
Multi-input accumulator CPA: carry propagate adder CSA: carry-save adder Delay = NT fa Delay = 4T fa Delay = 3T fa More register
38
Architecture with ROM decomposition
39
Conclusion on DA DA is a very efficient mechanism for computations that are dominated by inner products (convolution) A good way to trade combinational logic with memory for high- performance computation. When a many computing methods are compared, DA should be considered. It is not always (but often) best, and never poorly: save gate count around 50% to 80%. Application: “VLSI implementation of a 16*16 discrete cosine transform,” by M.-T. Sun, T.-C. Chen, A. M. Gottlieb, IEEE Transactions on Circuits and Systems, Volume: 36 Issue: 4, April 1989, Page(s): 610 –617, and many other transforms and DSP kernels.
40
DCT architecture using DA For small size DCT, we can use combinational logic (CB) to implement the ROM. This will reduce the critical path delay
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.