Download presentation
Presentation is loading. Please wait.
1
1 Implementation of VLD and Constant Division on PAC DSP Platform Student: Chung-Yen Tsai Advisor: Prof. David W. Lin Date: 2005.12.22
2
2 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion
3
3 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion
4
4 Reference ITRI STC/M310, PACDSP2S0000, “PACDSP v2.0 Instruction Set Menu”, June, 2005.
5
5 Introduction to PAC DSP PAC is a Very-Long Instruction Word (VLIW) processor Supporting Single Instruction Multiple Data (SIMD) instructions One scalar unit, and two clusters There are a Load/Store unit (L/S) and an Arithmetic unit (AU) in each cluster Ping-Pong Register File architecture
6
6 The Architecture of PACDSP
7
7 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion
8
8 References [1] S. Sriram and C. Y. Hung, “MPEG-2 video decoding on the TMS320C6X DSP architecture,” IEEE Signal Systems Computer Conf., vol. 2, Nov. 1998, pp.1735-1739 [2] C. Fogg, “Survey of software and hardware VLC architectures,” SPIE vol. 2186, Image and Video Compression, 1994
9
9 VLD Problem 1: It Does Not Pay to Use Both Clusters Because of uncertain length of the code to be decoded, the benefit of two-cluster architecture cannot be utilized If we do not use the block-based coding type, the program flow is simpler Because of fewer branches
10
10 VLD Problem 2: Memory vs. Performance Tradeoff exists between required memory size (for VLD lookup table) and cycles used Different VLD methods have been proposed in the literature We have some analysis in later slides Performance is limited for deeply pipelined processors because of significant memory access time [1]
11
11 VLD Methods [1], [2] Bit-by-bit matching Read the bitstream bit-by-bit, and check after each reading Multiple-pass lookups Separate the table into 3 parts, and read the table 3 times at most Bounded multiple-pass lookups Also 3 tables, but read bitstream only once One-table lookup Only one table, and only read once
12
12 Bit-by-bit Matching: Method Test VLC Table from MPEG-4 Standard
13
13 Comparison
14
14 Conclusion of VLD on PAC Branch and jump instructions cause degradation of performance Bitsream reading and memory accesses also cost many cycles, so we should try to reduce their frequency The bounded multiple-pass method seems to be the best of all analyzed methods in tradeoff between required memory size and speed performance
15
15 Outline Introduction to PAC DSP VLD Implementation on PAC Implementation of Constant Division on PAC Conclusion
16
16 References [1] D. A. Patterson and H. L. Hennessy, “Computer organization & Design: The Hardware/ Software Interface”, sec. 4.7 ”Division” [2] M. D. Ercegovac and T. Lang, “Digital Arithmetic”, sec. 1.6 “Basic Division Algorithms”
17
17 Why we need efficient constant division? Disadvantages of a hardware divisor Larger area and more power consumption Several cycles required for a division Several DSPs have no hardware divisor That is, no division instructions supported Algorithms for completing division with use of addition and multiplication is necessary
18
18 If we use a table-lookup The most efficient method with multiplication support Disadvantages Unknown divisor QP is a user-defined value, so a table including all the possible QP value(3 ~ 9 bits) Precision EX: dividend = 0xFFFF(65535); divisor= 0x1C(28) Result q = 2340 1/28 = 0.035714285 1170 (with scale 32768) 65535 x 1170 / 32768 = 2339 Can be adopted if use rounding to nearest integer rule
19
19 Simple Idea But Bad Result Idea For a positive integer, we can just substrate the dividend with the divisor, and check if the dividend is negative or not Result With dividend 0x8000 and divisor 0x8 There will be 0x1000 (4096) iterations for a division Very inefficient
20
20 Introduction to The Algorithms Restoring algorithms “Grammar School Algorithm Ver.1” [1] “Grammar School Algorithm Ver.2” [1] “Grammar School Algorithm Ver.3” [1] “Algorithm Restoring Divide (RD)” [2] “Algorithm Non-performing Divide (NPD)” [2] Non-Restoring algorithm “Algorithm Non-restoring Divide (NRD)” [2]
21
21 Use The Idea of Long-Division -- The Grammar School Algorithm 1 0 0 1 0 1 01 0 0 0 1 0 0 1 - 1 0 0 0 1 0 1 0 1 1 0 1 0 - 1 0 0 0 1 0 Divisor Quotient Remainder Dividend
22
22 Grammar School Algorithm Ver.1 [1] [Initialize] rem=dividend [Recurrence] for j=0…n rem=rem-d^; if rem>=0 quo 1; quo[0]=1; else rem=rem+d^; quo 1; quo[0]=1; d^ 1; endfor d^ means the divisor is MSB half aligned
23
23 Grammar School Algorithm Ver.2 [Initialize] rem=dividend; rem 1; [Recurrence] for j=0…n-1 rem=rem-d^; if rem>=0 rem 1; quo 1; quo[0]=1; else rem=rem+d^; rem 1; quo 1; quo[0]=1; endfor [Correction] rem = rem 16 d^ means the divisor is MSB half aligned
24
24 Grammar School Algorithm Ver.3 [Initialize] rem=dividend; rem 1; [Recurrence] for j=0…n-1 LHS(rem)=LHS(rem)-d^; if rem>=0 rem 1; rem[0]=1; else LHS(rem)=LHS(rem)+d^; rem 1; rem[0]=0; endfor [Correction] rem = rem 17; quo = rem & 0xFFFF d^ means the divisor is MSB half aligned
25
25 Algorithm RD [Initialize] rem=dividend; [Recurrence] for j=0…n-1 rem’=2*rem-d^; if rem’>=0 quo 1; quo[0]=1; else rem=rem’; quo 1; quo[0]=0; endfor d^ means the divisor is MSB half aligned
26
26 Algorithm NPD [Initialize] rem=dividend; [Recurrence] for j=0…n-1 if (2*rem-d^)>=0 quo 1; quo[0]=1; rem=2*rem-d^; else quo 1; quo[0]=0; endfor d^ means the divisor is MSB half aligned
27
27 Algorithm NRD [Initialize] rem=dividend; rem=2*rem-d^; [Recurrence] for j=0…n-1 if rem >=0 quo 1; quo[0]=1; rem=2*rem-d^; else quo 1; quo[0]=0; rem=2*rem+d^; endfor [Correction] if rem < 0 quo[0]=0; rem=rem+d^; else quo[0]=0; d^ means the divisor is MSB half aligned
28
28 Comparison between Different Version of Grammar School Algorithm Simulation Results Version1: 168 cycles Version2: 161 cycles Version3: 162 cycles Why can’t we get a significant improvement with use of the Version3 algorithm? The limitation is arisen from the latencies and delay slot of PAC Thus, the other 3 algorithms can never be better
29
29 Grammar School Algorithm Ver.3 performance: 162 cycles
30
30 Conclusion The grammar algorithm ver.2 and ver.3 have almost the same performance because of the latencies and delay slot If the latencies of comparison instructions can be less, the algorithm ver.3 will be better Ver.3 need one more cycle to get the quotient Algorithm ver.2 is better than ver.1 because of the fewer iterations required The table-lookup method may be adopted when the implementation is still on simulator
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.