Download presentation
Published by裁刊 梅 Modified over 8 years ago
1
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm
Miaoqing Huang Nov. 5, 2010
2
Outline Background Optimized hardware architecture Conclusion
Avoid the extra clock cycle delay The overall architecture Each PE focuses on the computation of one word of S The data dependency graph of the proposed architecture Comparison with other published architecture Demonstration of computation Resource utilization and performance comparison High-radix architecture Conclusion Reference
3
Background Montgomery Multiplication Algorithm is used in modular exponentiation to avoid the division by modulus, M. Following is one implementation of Montgomery Multiplication, Radix-2 Montgomery Multiplication Algorithm, assuming we want to calculate S = X • Y mod M in which S, X, Y and M are all n-bit long.
4
Background (cont.) Multiple-Word Radix-2 Montgomery Multiplication Algorithm Scan the X bit-by-bit and scan Y and M word-by-word Calculate S word-by-word Easy for hardware implementation because of small propagation Some definitions n : the bit-length of original operands w : the word-length used in real computation e=(n+1)/w : the quantity of words to store S S(j): one word in S
5
Background (cont.) Data dependency in the original architecture [4] of MWR2MM algorithm Task A consists of three steps: Test the parity of least significant bit of S Addition of words from S, xi•Y, and M if applicable One-bit right shift of a S word Task B corresponds to the last two steps of Task A [4] Tenca, A.F. and Koç, Ç. K.: A scalable architecture for Montgomery multiplication, CHES 99, LNCS 1717: , 1999
6
Background (cont.) One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit Xi. The delay between two contiguous PEs is 2 clock cycles. The minimum computation time in terms of clock cycle is 2•n+e given (e+1)/2 PEs are implemented to work in parallel.
7
Avoid the extra clock cycle delay
The origin of the extra clock cycle delay The computation of S(j-1) (of next round) requires one extra bit from S(j) (of current round), S(j)0 Solution Compute the two possible results of S(j) (of next round) in the same clock cycle as computing the S(j+1) (of current round); make a decision at the end of clock cycle
8
Avoid the extra clock cycle delay (cont.)
One singe PE is responsible to update one fixed word in S It has two branches corresponding to two possibilities of S(i+1)0 The correct results, the carry and the S(i)w-1, is selected from two sets of possible results by S(i+1)0, both available and registered at the same moment
9
The overall architecture
Every PE focuses on the computation of one single word of S The computation pattern of the architecture in [4] The computation pattern of the proposed architecture
10
The overall architecture (cont.)
The data dependency graph of the proposed architecture Task D consists of three steps Generate qi Pre-compute two sets of data Select one set from two Task E corresponds to the last steps of Task D Task F (invisible in the graph) is responsible to compute S(e-1) Only has one branch
11
The overall architecture (cont.)
e PEs are required to compute the e words in S respectively. Two shift registers, one providing single bits in X and one providing the parities of S(0)0, parallel these PEs. (n+e-1) clock cycles are required to process the Montgomery multiplication of two n-bit operands.
12
Demonstration of computation
Sequential S(e-1) S(2) S(1) S(0) ←X0 Tenca & Koç’s proposal PE#0 ←X0 S(1) S(3) S(4) S(0) S(2) PE#1 ←X1 S(2) S(1) S(0) PE#2 ←X2 S(0)
13
Demonstration of computation (cont.)
The proposed optimized architecture PE#0 S(0) S(0) S(0) S(0) S(0) ←X1 ←X3 ←X2 ←Xe-1 ←X0 ←X1 ←X2 ←X0 ←Xe-2 PE#1 S(1) S(1) S(1) S(1) ←X0 ←X1 ←Xe-3 PE#2 S(2) S(2) S(2) PE#3 S(3) S(3) ←Xe-4 ←X0 PE#(e-1) S(e-1) ←X0
14
Resource utilization and performance comparison
Test platform: Xilinx Virtex-II 6000 FF1517-4
15
High-radix Architecture
Same optimization concept can be applied to high-radix implementation The number of pre-computation branches is 2k The hardware implementation beyond radix-4 becomes less viable Comparison between radix-2 and radix-4 of proposed architecture (n=1024, w=16)
16
Conclusion An optimized hardware architecture to implement MWR2MM algorithm is proposed The radix-2 version of this architecture takes (n+e-1) clock cycles to process the Montgomery multiplication of two n-bit operands Compared to original architecture by Tenca & Koç, the new approach takes half time for processing and introduces less than 10% area penalty The same optimization technique can be applied onto the original architecture by Tenca & Koç, keeping the scalability while reducing the processing latency to half
17
Reference [1] Rivest, R. L., Shamir, A. and Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, vol.21, no.2, pp , 1978 [2] Montgomery, P. L.: Modular multiplication without trial division. Mathematics of Computation, vol.78, pp , 1985 [3] Gaj, K., et al.: Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware. In CHES 2006, LNCS, vol.4249, pp , 2006 [4] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for Montgomery multiplication. In CHES 99, LNCS, vol.1717, pp , 1999 [5] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for modular multiplication based on Montgomery's algorithm, IEEE Trans. Computers, vol.52, no.9, pp , 2003 [6] Tenca, A.F., Todorov, G., and Koç, Ç.K.: High-radix design of a scalable modular multiplier, In CHES 2001, LNCS, vol.2162, pp , 2001
18
Reference [7] Harris, D., Krishnamurthy, R., Anders, M., Mathew, S. and Hsu, S.: An Improved Unified Scalable Radix-2 Montgomery Multiplier. In Proc. ARITH 17, pp , 2005 [8] Michalski, E. A. and Buell, D. A.: A scalable architecture for RSA cryptography on large FPGAs. In Proc. FPL 2006, pp , 2006 [9] Koç, Ç.K., Acar, T. and Kaliski Jr., B. S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, vol.16, no.3, pp , 1996 [10] McIvor, C., McLoone, M. and McCanny, J.V.: High-Radix Systolic Modular Multiplication on Reconfigurable Hardware. In Proc. FPT 2005, pp , 2005 [11] McIvor, C., McLoone, M. and McCanny, J.V.: Modified Montgomery Modular Multiplication and RSA Exponentiation Techniques. IEE Proceedings -- Computers & Digital Techniques, vol.151, no.6, pp , 2004
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.