1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22.

1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22

2 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion

4 Reference ITRI STC/M310, PACDSP2S0000, “PACDSP v2.0 Instruction Set Menu”, June, 2005.

5 Introduction to PAC DSP PAC is a Very-Long Instruction Word (VLIW) processor  Supporting Single Instruction Multiple Data (SIMD) instructions One scalar unit, and two clusters  There are a Load/Store unit (L/S) and an Arithmetic unit (AU) in each cluster  Ping-Pong Register File architecture

6 The Architecture of PACDSP

8 References [1] S. Sriram and C. Y. Hung, “MPEG-2 video decoding on the TMS320C6X DSP architecture,” IEEE Signal Systems Computer Conf., vol. 2, Nov. 1998, pp.1735-1739 [2] C. Fogg, “Survey of software and hardware VLC architectures,” SPIE vol. 2186, Image and Video Compression, 1994

9 VLD Problem 1: It Does Not Pay to Use Both Clusters Because of uncertain length of the code to be decoded, the benefit of two-cluster architecture cannot be utilized If we do not use the block-based coding type, the program flow is simpler  Because of fewer branches

10 VLD Problem 2: Memory vs. Performance Tradeoff exists between required memory size (for VLD lookup table) and cycles used  Different VLD methods have been proposed in the literature  We have some analysis in later slides Performance is limited for deeply pipelined processors because of significant memory access time [1]

11 VLD Methods [1], [2] Bit-by-bit matching  Read the bitstream bit-by-bit, and check after each reading Multiple-pass lookups  Separate the table into 3 parts, and read the table 3 times at most Bounded multiple-pass lookups  Also 3 tables, but read bitstream only once One-table lookup  Only one table, and only read once

12 Bit-by-bit Matching: Method Test VLC Table from MPEG-4 Standard

13 Comparison

14 Conclusion of VLD on PAC Branch and jump instructions cause degradation of performance Bitsream reading and memory accesses also cost many cycles, so we should try to reduce their frequency The bounded multiple-pass method seems to be the best of all analyzed methods in tradeoff between required memory size and speed performance

16 References [1] D. A. Patterson and H. L. Hennessy, “Computer organization & Design: The Hardware/ Software Interface”, sec. 4.7 ”Division” [2] M. D. Ercegovac and T. Lang, “Digital Arithmetic”, sec. 1.6 “Basic Division Algorithms”

17 Why we need efficient constant division? Disadvantages of a hardware divisor  Larger area and more power consumption  Several cycles required for a division Several DSPs have no hardware divisor  That is, no division instructions supported  Algorithms for completing division with use of addition and multiplication is necessary

18 If we use a table-lookup The most efficient method with multiplication support Disadvantages  Unknown divisor QP is a user-defined value, so a table including all the possible QP value(3 ~ 9 bits)  Precision EX: dividend = 0xFFFF(65535); divisor= 0x1C(28) Result q = 2340 1/28 = 0.035714285  1170 (with scale 32768)  65535 x 1170 / 32768 = 2339  Can be adopted if use rounding to nearest integer rule

19 Simple Idea But Bad Result Idea  For a positive integer, we can just substrate the dividend with the divisor, and check if the dividend is negative or not Result  With dividend 0x8000 and divisor 0x8 There will be 0x1000 (4096) iterations for a division Very inefficient

20 Introduction to The Algorithms Restoring algorithms  “Grammar School Algorithm Ver.1” [1]  “Grammar School Algorithm Ver.2” [1]  “Grammar School Algorithm Ver.3” [1]  “Algorithm Restoring Divide (RD)” [2]  “Algorithm Non-performing Divide (NPD)” [2] Non-Restoring algorithm  “Algorithm Non-restoring Divide (NRD)” [2]

21 Use The Idea of Long-Division -- The Grammar School Algorithm 1 0 0 1 0 1 01 0 0 0 1 0 0 1 - 1 0 0 0 1 0 1 0 1 1 0 1 0 - 1 0 0 0 1 0 Divisor Quotient Remainder Dividend

22 Grammar School Algorithm Ver.1 [1] [Initialize] rem=dividend [Recurrence] for j=0…n rem=rem-d^; if rem>=0 quo  1; quo[0]=1; else rem=rem+d^; quo  1; quo[0]=1; d^  1; endfor d^ means the divisor is MSB half aligned

23 Grammar School Algorithm Ver.2 [Initialize] rem=dividend; rem  1; [Recurrence] for j=0…n-1 rem=rem-d^; if rem>=0 rem  1; quo  1; quo[0]=1; else rem=rem+d^; rem  1; quo  1; quo[0]=1; endfor [Correction] rem = rem  16 d^ means the divisor is MSB half aligned

24 Grammar School Algorithm Ver.3 [Initialize] rem=dividend; rem  1; [Recurrence] for j=0…n-1 LHS(rem)=LHS(rem)-d^; if rem>=0 rem  1; rem[0]=1; else LHS(rem)=LHS(rem)+d^; rem  1; rem[0]=0; endfor [Correction] rem = rem  17; quo = rem & 0xFFFF d^ means the divisor is MSB half aligned

25 Algorithm RD [Initialize] rem=dividend; [Recurrence] for j=0…n-1 rem’=2*rem-d^; if rem’>=0 quo  1; quo[0]=1; else rem=rem’; quo  1; quo[0]=0; endfor d^ means the divisor is MSB half aligned

26 Algorithm NPD [Initialize] rem=dividend; [Recurrence] for j=0…n-1 if (2*rem-d^)>=0 quo  1; quo[0]=1; rem=2*rem-d^; else quo  1; quo[0]=0; endfor d^ means the divisor is MSB half aligned

27 Algorithm NRD [Initialize] rem=dividend; rem=2*rem-d^; [Recurrence] for j=0…n-1 if rem >=0 quo  1; quo[0]=1; rem=2*rem-d^; else quo  1; quo[0]=0; rem=2*rem+d^; endfor [Correction] if rem < 0 quo[0]=0; rem=rem+d^; else quo[0]=0; d^ means the divisor is MSB half aligned

28 Comparison between Different Version of Grammar School Algorithm Simulation Results  Version1: 168 cycles  Version2: 161 cycles  Version3: 162 cycles Why can’t we get a significant improvement with use of the Version3 algorithm?  The limitation is arisen from the latencies and delay slot of PAC  Thus, the other 3 algorithms can never be better

29 Grammar School Algorithm Ver.3 performance: 162 cycles

30 Conclusion The grammar algorithm ver.2 and ver.3 have almost the same performance because of the latencies and delay slot  If the latencies of comparison instructions can be less, the algorithm ver.3 will be better  Ver.3 need one more cycle to get the quotient Algorithm ver.2 is better than ver.1 because of the fewer iterations required The table-lookup method may be adopted when the implementation is still on simulator

1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22.

Similar presentations

Presentation on theme: "1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22.

Similar presentations

Presentation on theme: "1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22."— Presentation transcript:

Similar presentations

About project

Feedback