An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.

Slides:

Advertisements

Similar presentations

Thapliyal 1MAPLD 2005/1011 A High Speed and Efficient Method of Elliptic Curve Encryption Using Ancient Indian Vedic Mathematics Himanshu Thapliyal and.

Advertisements

Fast Modular Reduction

Advanced Information Security 4 Field Arithmetic

1 EFFICIENT ADDERS TO SPEEDUP MODULAR MULTIPLICATION FOR CRYPTOGRAPHY Adnan Gutub Hassan Tahhan Computer Engineering Department KFUPM, Dhahran, SAUDI ARABIA.

UNIVERSITY OF MASSACHUSETTS Dept

Copyright 2008 Koren ECE666/Koren Part.6b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Copyright 2008 Koren ECE666/Koren Part.9b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

An Expandable Montgomery Modular Multiplication Processor Adnan Abdul-Aziz GutubAlaaeldin A. M. Amin Computer Engineering Department King Fahd University.

CHES20021 Scalable and Unified Hardware to Compute Montgomery Inverse in GF(p) and GF(2 n ) A. Gutub, A. Tenca, E. Savas, and C. Koc Information Security.

Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.

1 Montgomery Multiplication David Harris and Kyle Kelley Harvey Mudd College Claremont, CA {David_Harris,

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Multiplication.

1 DSP Implementation on FPGA Ahmed Elhossini ENGG*6090 : Reconfigurable Computing Systems Winter 2006.

GPGPU platforms GP - General Purpose computation using GPU

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Montgomery multiplication Algorithm Mohammad Farmani Under supervision of : Dr. S. Bayat-sarmadi 2 nd. Semister, Sharif University of Technology.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

Montgomery Multipliers & Exponentiation Units

Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

CS 627 Elliptic Curves and Cryptography Paper by: Aleksandar Jurisic, Alfred J. Menezes Published: January 1998 Presented by: Sagar Chivate.

Efficient FPGA Implementation of QR

Parallel Computing Using FPGA ( Field Programmable Gate Arrays ) 15 th May, 2009 Studies in Parallel & Distributed Systems – Sohaib Ahmed.

07/19/2005 Arithmetic / Logic Unit – ALU Design Presentation F CSE : Introduction to Computer Architecture Slides by Gojko Babić.

Follow-up Courses. ECE Department MS in Electrical Engineering MS EE MS in Computer Engineering MS CpE COMMUNICATIONS & NETWORKING SIGNAL PROCESSING CONTROL.

Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),

Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Sequential Multipliers Lecture 9. Required Reading Chapter 9, Basic Multiplication Scheme Chapter 10, High-Radix Multipliers Chapter 12.3, Bit-Serial.

Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.

High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop

Implementation of Finite Field Inversion

1HMC VLSI Lab Very High Radix Montgomery Multiplication David Harris, Kyle Kelley and Ted Jiang Harvey Mudd College Claremont, CA Supported by Intel Circuit.

05/04/06 1 Integrating Logic Synthesis, Tech mapping and Retiming Presented by Atchuthan Perinkulam Based on the above paper by A. Mishchenko et al, UCAL.

Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.

1 Dividers Lecture 10. Required Reading Chapter 13, Basic Division Schemes 13.1, Shift/Subtract Division Algorithms 13.3, Restoring Hardware Dividers.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Unconventional Fixed-Radix Number Systems

ECE 645 Spring 2007 PROJECT 2 Specification. Topic Options.

Copyright 2008 Koren ECE666/Koren Part.7b.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

An automated pipeline balancing in the SRC Reconfigurable Computer and its application to the RC5 cipher breaking Hatim Diab 1, Miaoqing Huang 1, Kris.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

High-Radix Sequential Multipliers Bit-Serial Multipliers Modular Multipliers Lecture 9.

1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.

CSE 8351 Computer Arithmetic Fall 2005 Instructors: Peter-Michael Seidel.

Introduction to Elliptic Curve Cryptography CSCI 5857: Encoding and Encryption.

Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,

Efficient Montgomery Modular Multiplication Algorithm Using Complement and Partition Techniques Speaker: Te-Jen Chang.

Array Multiplier Haibin Wang Qiong Wu. Outlines Background & Motivation Principles Implementation & Simulation Advantages & Disadvantages Conclusions.

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010.

Supported in part by NIST/U.S. Department of Commerce

Montgomery Modular Multiplication

Morgan Kaufmann Publishers

Sequential Multipliers

UNIVERSITY OF MASSACHUSETTS Dept

Elliptic Curve Cryptography over GF(2m) on a Reconfigurable Computer:

Efficient CRT-Based RSA Cryptosystems

Unconventional Fixed-Radix Number Systems

Paul D. Reynolds Russell W. Duren Matthew L. Trumbo Robert J. Marks II

UNIVERSITY OF MASSACHUSETTS Dept

UNIVERSITY OF MASSACHUSETTS Dept

UNIVERSITY OF MASSACHUSETTS Dept

Presentation transcript:

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George Washington University, Washington, D.C., U.S.A. 2 George Mason University, Fairfax, VA, U.S.A. 3 Sungkyunkwan University, Suwon, Korea

Outline Motivation Classical Hardware Architecture for Montgomery Multiplication by Tenca and Koc from CHES 1999 Our Optimized Hardware Architecture Conceptual Comparison Implementation Results Possible Extensions Conclusions

Motivation Fast modular multiplication required in multiple cryptographic transformations RSA, DSA, Diffie-Hellman Elliptic Curve Cryptosystems ECM, p-1, Pollard’s rho methods of factoring, etc. Montgomery Multiplication invented by Peter L. Montgomery in 1985 is most frequently used to implement repetitive sequence of modular multiplications in both software and hardware Montgomery Multiplication in hardware replaces division by a sequence of simple logic operations, conditional additions and right shifts

Montgomery Modular Multiplication (1) Z = X  Y mod M X Integer domain Montgomery domain X’ = X  2 n mod M Y Y’ = Y  2 n mod M Z’ = MP(X’, Y’, M) = = X’  Y’  2 -n mod M = = (X  2 n )  (Y  2 n )  2 -n mod M = = X  Y  2 n mod M Z’ = Z  2 n mod M Z = X  Y mod M X, Y, M – n-bit numbers

Montgomery Modular Multiplication (2) X’ = MP(X, 2 2n mod M, M) = = X  2 2n  2 -n mod M = X  2 n mod M Z = MP(Z’, 1, M) = = (Z  2 n )  1  2 -n mod M = Z mod M = Z X X’X’ Z Z’Z’

Montgomery Product S[0] = 0 S[i+1] = Z = S[n] S[i]+x i  Y 2 S[i]+x i  Y + M 2 if q i = S[i] + x i  Y mod 2= 0 if q i = S[i] + x i  Y mod 2= 1 for i=0 to n-1 M assumed to be odd

Basic version of the Radix-2 Montgomery Multiplication Algorithm

Classical Design by Tenca & Koc CHES 1999 Multiple Word Radix-2 Montgomery Multiplication algorithm (MWR2MM) Main ideas: Use of short precision words (w-bit each): Reduces broadcast problem in circuit implementation Word-oriented algorithm provides the support needed to develop scalable hardware units. Operand Y(multiplicand) is scanned word-by-word, operand X(multiplier) is scanned bit-by-bit.

X = (x n-1, …,x 1,x 0 ) Y = (Y (e-1),…,Y (1),Y (0) ) M = (M (e-1),…,M (1),M (0) ) The bits are marked with subscripts, and the words are marked with superscripts. Classical Design by Tenca & Koc CHES 1999 Each operand has n bits e words e = n+1 w Each word has w bits

MWR2MM Multiple Word Radix-2 Montgomery Multiplication algorithm by Tenca and Koc Task A B C e-1 times

Problem in Parallelizing Computations w w-1 w.. S (0) S (1) w 1.. x0x0 x1x1 w-1 1 w-2 2w-2 w+1 w w 1 w-1 1 w-2

One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit x i. The delay between two adjacent PEs is 2 clock cycles. The minimum computation time is 2n+e-1 clock cycles given (e+1)/2 PEs working in parallel. Data Dependency Graph by Tenca & Koc i=0 i=1 i=2 j=0 j=1 j=2 j=3 j=4 j=5

Example of Operation of the Design by Tenca & Koc Example of the computation executed for 5-bit operands with word-size w = 1 bit - C n = 5 w = 1 e = 5 2n + e – 1 = 2  – 1 = 14 clock cycles (e+1)/2 = (5+1)/2 = 3 PEs sufficient to perform all computations

Main Idea of the New Architecture In the architecture of Tenca & Koc – w-1 least significant bits of partial results S (j) are available one clock cycle before they are used – only one (most significant) bit is missing Let us compute a new partial result under two assumptions regarding the value of the most significant bit of S (j) and choose the correct value one clock cycle later

Idea for a Speed-up w w-1 w.. S (0) S (1) x0x0 x1x1 w-1 1 w-2 2w-2 w w-1 2 choose between the two possible results using missing bit computed at the same time perform two computations in parallel using two possible values of the most-significant-bit

Primary Advantage of the New Approach Reduction in the number of clock cycles from 2 n + e - 1 to n + e – 1 Minimum penalty in terms of the area and clock period

Pseudocode of the Main Processing Element

Main Processing Element Type E

The Proposed Optimized Hardware Architecture

The First and the Last Processing Elements Type DType F

Data Dependency Graph of the Proposed New Architecture PE#0PE#1PE#2 PE#3

The Overall Computation Pattern Tenca & Koc, CHES 1999 Our new proposed architecture Special state of each PE vs. One special PE type simpler structure of each PE

Demonstration of Computations Sequential S (0) S (1) S (2) ←X0←X0 S (e-1) Tenca & Koç’s proposal PE#0 PE#1 PE#2 ←X0←X0 ←X1←X1 ←X2←X2 S (0) S (1) S (2) S (3) S (0) S (1) S (2) S (0) S (4)

Demonstration of Computations (cont.) The proposed optimized architecture PE#0 PE#1 PE#2 PE#3 PE#(e-1) S (0) S (1) S (2) S (3) S (e-1) S (3) S (2) S (1) S (0) ←X0←X0 ←X0←X0 ←X0←X0 ←X0←X0 ←X0←X0 ←X1←X1 ←X1←X1 ←X1←X1 ←X2←X2 ←X2←X2 ←X e-1 ←X3←X3 ←X e-3 ←X e-2 ←X e-4

Related Work Tenca, A. and Koç, Ç.K.: A scalable architecture for Montgomery multiplication, CHES 1999, LNCS, vol.1717, pp , Springer, Heidelberg, 1999 –The original paper proposing the MWR2MM algorithm Tenca, A., Todorov, G., and Koç. Ç.K.: High-radix design of a scalable modular multiplier, CHES 2001, LNCS, vol.2162, pp , Springer, Heidelberg, 2001 –Scan several bits of X one time instead of one bit Harris, D., Krishnamurthy, R., Anders, M., Mathew, S. and Hsu, S.: An Improved Unified Scalable Radix-2 Montgomery Multiplier, ARITH 17, pp , 2005 –Left shift Y and M instead of right shift S (i) Michalski, E. A. and Buell, D. A.: A scalable architecture for RSA cryptography on large FPGAs, FPL 2006, pp , 2006 –FPGA-specific, assumes the use of of built-in multipliers McIvor, C., McLoone, M. and McCanny, J.V.: Modified Montgomery Modular Multiplication and RSA Exponentiation Techniques, IEE Proceedings – Computers & Digital Techniques, vol.151, no.6, pp , 2004 –Use carry-save addition on full-size n-bit operands

Implementation, Verification, and Experimental Testing New architecture and two previous architectures: Modeled in Verilog HDL and/or VHDL Functionally verified by comparison with a reference software implementation of the Montgomery Multiplication Implemented using Xilinx Virtex-II FPGA Experimentally tested using SRC 6 reconfigurable computer based on microprocessors and FPGAs with the 100 MHz maximum clock frequency for the FPGA part

Implementation Results Test platform: Xilinx Virtex-II 6000 FF1517-4

Normalized Latency New Architecture vs. Previous Architectures 1024Operand size Our design Tenca & Koc McIvor et al

Normalized Product Latency Times Area New Architecture vs. Previous Architectures 1024Operand size Our design Tenca & Koc McIvor et al

Radix-4 Architecture The same optimization concept can be applied to a radix-4 implementation Two bits of X processed in each iteration Latency in clock cycles reduced by a factor of almost 2 However, the clock period and the area increase Radix higher than 4 not viable because of the large area and the requirement of multiplication of words by digits in the range from 0 to radix-1

Comparison between radix-2 and radix-4 versions of the proposed architecture (n=1024, w=16) Xilinx Virtex-II 6000 FF FPGA Latency * area [for radix 4] Latency * area [for radix 2] = 1.28

Conclusions New optimized architecture for the word-based Montgomery Multiplier Compared to the classical design by Tenca & Koc: Minimum latency smaller by a factor of about 1.8, in terms of both clock cycles and absolute time units Comparable circuit area for minimum latency Improvement in terms of the product of latency times area by a factor of about 1.6 Reduced scalability (fixed vs. variable number of processing elements required for the given operand size) [ to be fixed in the new architecture under development ] Compared to the newer design by McIvor et al.: Area smaller by about 50% Improvement in terms of the product of latency times area by a factor between 1.14 and 1.55 Similar scalability

Questions??? Thank you!