An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010.

Slides:



Advertisements
Similar presentations
Computer Organization and Architecture
Advertisements

A HIGH-PERFORMANCE IPV6 LOOKUP ENGINE ON FPGA Author : Thilan Ganegedara, Viktor Prasanna Publisher : FPL 2013.
On Karatsuba Multiplication Algorithm
Radix Conversion Given a value X represented in source system with radix  s, represent the same number in a destination system with radix  d Consider.
Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.
Advanced Information Security 4 Field Arithmetic
Abdullah Sheneamer CS591-F2010 Project of semester Presentation University of Colorado, Colorado Springs Dr. Edward RSA Problem and Inside PK Cryptography.
An Expandable Montgomery Modular Multiplication Processor Adnan Abdul-Aziz GutubAlaaeldin A. M. Amin Computer Engineering Department King Fahd University.
CHES20021 Scalable and Unified Hardware to Compute Montgomery Inverse in GF(p) and GF(2 n ) A. Gutub, A. Tenca, E. Savas, and C. Koc Information Security.
M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.
COE 308: Computer Architecture (T041) Dr. Marwan Abu-Amara Integer & Floating-Point Arithmetic (Appendix A, Computer Architecture: A Quantitative Approach,
1 Montgomery Multiplication David Harris and Kyle Kelley Harvey Mudd College Claremont, CA {David_Harris,
Multiplication.
Montgomery Multipliers & Exponentiation Units
Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.
1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.
VLSI Arithmetic Adders & Multipliers Prof. Vojin G. Oklobdzija University of California
Efficient FPGA Implementation of QR
1HMC VLSI Lab Very High Radix Montgomery Multiplication David Harris, Kyle Kelley and Ted Jiang Harvey Mudd College Claremont, CA Supported by Intel Circuit.
Task Graph Scheduling for RTR Paper Review By Gregor Scott.
Han Liu Supervisor: Seok-Bum Ko Electrical & Computer Engineering Department 2010-Mar-9.
Pipelining and number theory for multiuser detection Sridhar Rajagopal and Joseph R. Cavallaro Rice University This work is supported by Nokia, TI, TATP.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.
Two’s and one’s complement arithmetic CLOCK ARITHMETIC.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.
Arithmetic Intro Computer Organization 1 Computer Science Dept Va Tech February 2008 © McQuain Multiplication Design 1.0 Multiplicand Shift left.
Efficient Montgomery Modular Multiplication Algorithm Using Complement and Partition Techniques Speaker: Te-Jen Chang.
CORDIC Based 64-Point Radix-2 FFT Processor
@Yuan Xue CS 285 Network Security Public-Key Cryptography Yuan Xue Fall 2012.
Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts.
MATH Lesson 2 Binary arithmetic.
CHAPTER 5: Representing Numerical Data
RSA cryptosystem with large key length
Design and Analysis of Low-Power novel implementation of encryption standard algorithm by hybrid method using SHA3 and parallel AES.
Supported in part by NIST/U.S. Department of Commerce
Backprojection Project Update January 2002
Hamming Code In 1950s: invented by Richard Hamming
EKT 221 : Digital 2 Serial Transfers & Microoperations
Attacks on Public Key Encryption Algorithms
Overview on Hardware Security
Public Key Cryptosystem
Montgomery Modular Multiplication
Network Security Design Fundamentals Lecture-13
D. Cheung – IQC/UWaterloo, Canada D. K. Pradhan – UBristol, UK
Computer Architecture & Operations I
Value Range Analysis with Modulo Interval Arithmetic
Conditional-Sum Adders Parallel Prefix Network Adders
Morgan Kaufmann Publishers
Sequential Multipliers
UNIVERSITY OF MASSACHUSETTS Dept
Public Key Encryption and Digital Signatures
EKT 221 : Digital 2 Serial Transfers & Microoperations
Embedded Systems Design
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
IMPLEMENTATION OF MULTIPLE-PRECISION MODULAR MULTIPLICATION ON GPU
Efficient CRT-Based RSA Cryptosystems
Improved Practical Differential Fault Analysis of Grain-128
Real-world Security of Public Key Crypto
QR Code Authentication with Embedded Message Authentication Code
The Application of Elliptic Curves Cryptography in Embedded Systems
Dynamic High-Performance Multi-Mode Architectures for AES Encryption
Reconfigurable Computing University of Arkansas
UNIVERSITY OF MASSACHUSETTS Dept
UNIVERSITY OF MASSACHUSETTS Dept
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
عنوان درس نام استاد
RSA Cryptosystem 電機四 B 游志強 2019/8/25.
Presentation transcript:

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010

Outline Background Optimized hardware architecture Conclusion Avoid the extra clock cycle delay The overall architecture Each PE focuses on the computation of one word of S The data dependency graph of the proposed architecture Comparison with other published architecture Demonstration of computation Resource utilization and performance comparison High-radix architecture Conclusion Reference

Background Montgomery Multiplication Algorithm is used in modular exponentiation to avoid the division by modulus, M. Following is one implementation of Montgomery Multiplication, Radix-2 Montgomery Multiplication Algorithm, assuming we want to calculate S = X • Y mod M in which S, X, Y and M are all n-bit long.

Background (cont.) Multiple-Word Radix-2 Montgomery Multiplication Algorithm Scan the X bit-by-bit and scan Y and M word-by-word Calculate S word-by-word Easy for hardware implementation because of small propagation Some definitions n : the bit-length of original operands w : the word-length used in real computation e=(n+1)/w : the quantity of words to store S S(j): one word in S

Background (cont.) Data dependency in the original architecture [4] of MWR2MM algorithm Task A consists of three steps: Test the parity of least significant bit of S Addition of words from S, xi•Y, and M if applicable One-bit right shift of a S word Task B corresponds to the last two steps of Task A [4] Tenca, A.F. and Koç, Ç. K.: A scalable architecture for Montgomery multiplication, CHES 99, LNCS 1717:94--108, 1999

Background (cont.) One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit Xi. The delay between two contiguous PEs is 2 clock cycles. The minimum computation time in terms of clock cycle is 2•n+e given (e+1)/2 PEs are implemented to work in parallel.

Avoid the extra clock cycle delay The origin of the extra clock cycle delay The computation of S(j-1) (of next round) requires one extra bit from S(j) (of current round), S(j)0 Solution Compute the two possible results of S(j) (of next round) in the same clock cycle as computing the S(j+1) (of current round); make a decision at the end of clock cycle

Avoid the extra clock cycle delay (cont.) One singe PE is responsible to update one fixed word in S It has two branches corresponding to two possibilities of S(i+1)0 The correct results, the carry and the S(i)w-1, is selected from two sets of possible results by S(i+1)0, both available and registered at the same moment

The overall architecture Every PE focuses on the computation of one single word of S The computation pattern of the architecture in [4] The computation pattern of the proposed architecture

The overall architecture (cont.) The data dependency graph of the proposed architecture Task D consists of three steps Generate qi Pre-compute two sets of data Select one set from two Task E corresponds to the last steps of Task D Task F (invisible in the graph) is responsible to compute S(e-1) Only has one branch

The overall architecture (cont.) e PEs are required to compute the e words in S respectively. Two shift registers, one providing single bits in X and one providing the parities of S(0)0, parallel these PEs. (n+e-1) clock cycles are required to process the Montgomery multiplication of two n-bit operands.

Demonstration of computation Sequential S(e-1) S(2) S(1) S(0) ←X0 Tenca & Koç’s proposal PE#0 ←X0 S(1) S(3) S(4) S(0) S(2) PE#1 ←X1 S(2) S(1) S(0) PE#2 ←X2 S(0)

Demonstration of computation (cont.) The proposed optimized architecture PE#0 S(0) S(0) S(0) S(0) S(0) ←X1 ←X3 ←X2 ←Xe-1 ←X0 ←X1 ←X2 ←X0 ←Xe-2 PE#1 S(1) S(1) S(1) S(1) ←X0 ←X1 ←Xe-3 PE#2 S(2) S(2) S(2) PE#3 S(3) S(3) ←Xe-4 ←X0 PE#(e-1) S(e-1) ←X0

Resource utilization and performance comparison Test platform: Xilinx Virtex-II 6000 FF1517-4

High-radix Architecture Same optimization concept can be applied to high-radix implementation The number of pre-computation branches is 2k The hardware implementation beyond radix-4 becomes less viable Comparison between radix-2 and radix-4 of proposed architecture (n=1024, w=16)

Conclusion An optimized hardware architecture to implement MWR2MM algorithm is proposed The radix-2 version of this architecture takes (n+e-1) clock cycles to process the Montgomery multiplication of two n-bit operands Compared to original architecture by Tenca & Koç, the new approach takes half time for processing and introduces less than 10% area penalty The same optimization technique can be applied onto the original architecture by Tenca & Koç, keeping the scalability while reducing the processing latency to half

Reference [1] Rivest, R. L., Shamir, A. and Adleman, L.: A method for obtaining digital signatures and public-key cryptosystems. Communications of the ACM, vol.21, no.2, pp.120--126, 1978 [2] Montgomery, P. L.: Modular multiplication without trial division. Mathematics of Computation, vol.78, pp.315--333, 1985 [3] Gaj, K., et al.: Implementing the Elliptic Curve Method of Factoring in Reconfigurable Hardware. In CHES 2006, LNCS, vol.4249, pp.119--133, 2006 [4] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for Montgomery multiplication. In CHES 99, LNCS, vol.1717, pp.94--108, 1999 [5] Tenca, A.F. and Koç, Ç.K.: A scalable architecture for modular multiplication based on Montgomery's algorithm, IEEE Trans. Computers, vol.52, no.9, pp.1215--1221, 2003 [6] Tenca, A.F., Todorov, G., and Koç, Ç.K.: High-radix design of a scalable modular multiplier, In CHES 2001, LNCS, vol.2162, pp.185--201, 2001

Reference [7] Harris, D., Krishnamurthy, R., Anders, M., Mathew, S. and Hsu, S.: An Improved Unified Scalable Radix-2 Montgomery Multiplier. In Proc. ARITH 17, pp.172--178, 2005 [8] Michalski, E. A. and Buell, D. A.: A scalable architecture for RSA cryptography on large FPGAs. In Proc. FPL 2006, pp.145--152, 2006 [9] Koç, Ç.K., Acar, T. and Kaliski Jr., B. S.: Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, vol.16, no.3, pp.26--33, 1996 [10] McIvor, C., McLoone, M. and McCanny, J.V.: High-Radix Systolic Modular Multiplication on Reconfigurable Hardware. In Proc. FPT 2005, pp.13--18, 2005 [11] McIvor, C., McLoone, M. and McCanny, J.V.: Modified Montgomery Modular Multiplication and RSA Exponentiation Techniques. IEE Proceedings -- Computers & Digital Techniques, vol.151, no.6, pp.402--408, 2004