Presentation is loading. Please wait.

Presentation is loading. Please wait.

Montgomery Multipliers & Exponentiation Units

Similar presentations


Presentation on theme: "Montgomery Multipliers & Exponentiation Units"— Presentation transcript:

1 Montgomery Multipliers & Exponentiation Units
Lecture 7 Montgomery Multipliers & Exponentiation Units

2 Motivation: Public-key ciphers

3 Secret-key (Symmetric) Cryptosystems
key of Alice and Bob - KAB key of Alice and Bob - KAB Network Encryption Decryption Alice Bob

4 Key Distribution Problem
Users Keys N · (N-1) N - Users Keys 100 5,000 2 1000 500,000

5 Digital Signature Problem
Both corresponding sides have the same information and are able to generate a signature There is a possibility of the receiver falsifying the message sender denying that he/she sent the message

6 Public Key (Asymmetric) Cryptosystems
Public key of Bob - KB Private key of Bob - kB Network Encryption Decryption Alice Bob

7 Non-repudiation Alice Bob Alice’s private key Alice’s public key
Message Signature Message Signature Hash function Hash function Hash value 1 Hash value yes no Hash value 2 Public key cipher Public key cipher Alice’s private key Alice’s public key

8 RSA as a trap-door one-way function
PUBLIC KEY message ciphertext M C = f(M) = Me mod N C M = f-1(C) = Cd mod N PRIVATE KEY N = P  Q P, Q - large prime numbers e  d  1 mod ((P-1)(Q-1))

9 RSA keys { e, N } { d, P, Q } P, Q: P, Q - large prime numbers N:
PUBLIC KEY PRIVATE KEY { e, N } { d, P, Q } P, Q: P, Q - large prime numbers N: N = P  Q e: gcd(e, P-1) = 1 and gcd(e, Q-1) = 1 d: e  d  1 mod ((P-1)(Q-1))

10 Mini-RSA keys { e, N } { d, P, Q } P, Q: P = 5 Q = 11 N = P  Q = 55
PUBLIC KEY PRIVATE KEY { e, N } { d, P, Q } P, Q: P = Q = 11 N = P  Q = 55 N: e: gcd(e, 5-1) = 1 and gcd(e, 11-1) = 1 e=3 d: 3  d  1 mod 40 d=27

11 Mini-RSA as a trap-door one-way function
PUBLIC KEY message ciphertext M=2 C = f(2) = 23 mod 55 = 8 C=8 M = f-1(C) = 827 mod 55 = 2 PRIVATE KEY N = 5  11 5, prime numbers 3  27  1 mod ((5-1)(11-1))

12 Basic Operations of RSA
Encryption L < k e public key exponent C = M mod N ciphertext plaintext public key modulus k-bits k-bits k-bits Decryption L=k private key exponent d M = C mod N ciphertext private key modulus plaintext k-bits k-bits k-bits

13 Modular arithmetic

14 Quotient and remainder
Given integers a and n, n>0 ! q, r  Z such that a = q n + r and  r < n a q – quotient r – remainder (of a divided by n) q = = a div n n a r = a - q n = a –  n = n = a mod n

15 mod 5 = -32 mod 5 =

16 Two integers a and b are congruent modulo n
Integers coungruent modulo n Two integers a and b are congruent modulo n (equivalent modulo n) written a  b iff a mod n = b mod n or a = b + kn, k  Z n | a - b

17 Rules of addition, subtraction and multiplication
modulo n a + b mod n = ((a mod n) + (b mod n)) mod n a - b mod n = ((a mod n) - (b mod n)) mod n a  b mod n = ((a mod n)  (b mod n)) mod n

18 9 · 13 mod 5 = 25 · 25 mod 26 =

19 Laws of modular arithmetic
Regular addition Modular addition a+b = a+c iff b=c a+b  a+c (mod n) iff b  c (mod n) Regular multiplication Modular multiplication If a  b  a  c (mod n) and gcd (a, n) = 1 then b  c (mod n) If a  b = a  c and a  0 then b = c

20 Modular Multiplication: Example
x 6  x mod 8 x 5  x mod 8

21 Modular Exponentiation
Basic Modular Exponentiation

22 How to perform exponentiation efficiently?
Y = XE mod N = X  X  X  X  X …  X  X mod N E-times E may be in the range of  10308 Problems: 1. huge storage necessary to store XE before reduction 2. amount of computations infeasible to perform Solutions: 1. modulo reduction after each multiplication 2. clever algorithms 200 BC, India, “Chandah-Sûtra”

23 Exponentiation: Y = XE mod N
Right-to-left binary exponentiation Left-to-right binary exponentiation E = (eL-1, eL-2, …, e1, e0)2 Y = 1; S = X; for i=0 to L-1 { if (ei == 1) Y = Y  S mod N; S = S2 mod N; } Y = 1; for i=L-1 downto 0 { Y = Y2 mod N; if (ei == 1) Y = Y  X mod N; }

24 Right-to-Left Binary Exponentiation in Hardware
1 X enable Y S E SQR MUL output

25 Left-to-Right Binary Exponentiation in Hardware
1 Y X Control Logic E MUL output

26 Modular Multiplication

27 Algorithms for Modular Multiplication
Multiplication combined with modular reduction Classical Karatsuba Schönhage-Strassen (FFT) (k2) (klg 3) 2 Montgomery algorithm (k  ln(k)) (k2) Modular Reduction Classical Barrett Selby-Mitchell (k2) complexity same as multiplication used (k2)

28 Montgomery Multiplication

29 Montgomery Modular Multiplication (1)
Z = X  Y mod M X, Y, M – (n-1)-bit numbers Integer domain Montgomery domain X X’ = X  2n mod M Y Y’ = Y  2n mod M Z’ = MP(X’, Y’, M) = = X’  Y’  2-n mod M = = (X  2n)  (Y  2n)  2-n mod M = = X  Y  2n mod M Z = X  Y mod M Z’ = Z  2n mod M

30 Montgomery Modular Multiplication (2)
X X’ X’ = MP(X, 22n mod M, M) = = X  22n  2-n mod M = X  2n mod M Z Z’ Z = MP(Z’, 1, M) = = (Z  2n)  1  2-n mod M = Z mod M = Z

31 Basic version of the Radix-2 Montgomery Multiplication Algorithm

32

33 Montgomery Product S[0] = 0 for i=0 to n-1 S[i+1] = S[i]+xiY 2
Z = S[n] for i=0 to n-1 S[i]+xiY 2 if qi = S[i] + xiY mod 2= 0 S[i]+xiY + M 2 if qi = S[i] + xiY mod 2= 1 M assumed to be odd

34 Basic version of the Radix-2 Montgomery Multiplication Algorithm

35 Project 2 Rules Groups consisting of 2 students (preferred)
or a single student (if needed) Each group works on different architectures Each group of two works on two similar architectures. Members of the group can freely exchange VHDL code and ideas with each other. Students working individually work on a single architecture. They must not exchange code with other students. Members of the group of two are graded jointly, unless they agree to split no later than two weeks before the Project deadline.

36 Investigated Montgomery Multipliers
Non-Scalable Scalable G1 G2 McIvor, et al. based on 5-to-2 CSA based on 4-to-2 CSA Koc & Tenca radix 2 radix 4 G3 Huang, et al. Architecture 2 Huang, et al. Architecture 1 G4 G5 Savas et al. radix 2 radix 4 Harris, et al. radix 2 radix 4 G6 Suzuki Virtex 5 DSP Stratix III DSP

37 Investigated Montgomery Multipliers
Non-Scalable Scalable dedicated to one particular operand size operand size is described by a generic, and can be changed only after reconfiguration size of the circuit varies as a function of the operand size flexible, can handle multiple operand sizes operand size is described by a special input, and can be changed during run-time size of the circuit is constant

38 Assumptions (1) Operand sizes: Evaluated parameters:
Max. Clock Frequency [MHz] Min. Latency [clock cycles] Min. Latency [μs] Resource Utilization (CLB slices/ALUTs, DSP Units, Block Memories) Latency x Area [μs x CLB slices/ALUTs]

39 Project 2 Rules Montgomery Multiplier - required
Montgomery Exponentiation Unit – bonus Virtex 5 and Stratix III – required Virtex 6 and Stratix IV - bonus 1024 and 2048 bit operand sizes required 3072 and 4096 bit operand sizes bonus

40 Assumptions (2) Uniform Interface
(to be provided, but may need to be tweaked depending on the architecture) Test vectors generated using reference software implementation (may need to be extended to generate intermediate results) Your own testbench.

41 Montgomery Multipliers based on Carry Save Adders

42 Carry Save Adder (CSA) an-1 bn-1 cn-1 a2 b2 c2 a1 b1 c1 a0 b0 c0 FA
. . . FA FA FA cn sn-1 cn-1 s3 c3 s2 c2 s1 c1 s0

43 Operation of a Carry Save Adder (CSA)
Example 24 23 22 21 20 x y z s c x+y+z = s + c

44 Carry-save adder for four operands
x3 x2 x1 x0 y3 y2 y1 y0 z3 z2 z1 z0 w3 w2 w1 w0 s3 s2 s1 s0 c4 c3 c2 c1 c4 s3 s2 s1 s0 S5 S4 S3 S2 S1 S0

45 Carry-save adder for four operands
c4 s3 c3 s2 c2 s1 c1

46 Carry-save adder for four operands
x y z w 4 4 4 4 CSA c s CSA c’ s’ CPA S

47 Radix-2 Montgomery Multiplication with Carry Save Addition

48 Carry Save Reduction 4-to-2
U+V+W+Y = S+C

49 Radix-2 Montgomery Multiplier Based on Carry Save Reduction 4-to-2

50 Montgomery Multipliers and Exponentiation Units
by Mc Ivor, et al.

51 5-to-2 CSA X1+X2+X3+X4+X5 = SUM + CARRY

52 5-to-2 CSA Montgomery Multiplication

53 On the fly calculation of Ai
based on the Carry Save Representation of A A = A1 + A2

54 Montgomery Exponentiation

55 Montgomery Exponentiation
based on the 5-to-2 CSA Montgomery Multiplier

56 4-to-2 CSA X1+X2+X3+X4 = SUM + CARRY

57 4-to-2 CSA Montgomery Multiplication

58 Montgomery Multipliers
Scalable Montgomery Multipliers by Koc & Tenca

59 Classical Design by Tenca & Koc CHES 1999
Multiple Word Radix-2 Montgomery Multiplication algorithm (MWR2MM) Main ideas: Use of short precision words (w-bit each): Reduces broadcast problem in circuit implementation Word-oriented algorithm provides the support needed to develop scalable hardware units. Operand Y(multiplicand) is scanned word-by-word, operand X(multiplier) is scanned bit-by-bit.

60 Classical Design by Tenca & Koc CHES 1999
Each word has w bits Each operand has n bits e words n+1 e = w X = (xn-1, …,x1,x0) Y = (Y(e-1),…,Y(1),Y(0)) M = (M(e-1),…,M(1),M(0)) The bits are marked with subscripts, and the words are marked with superscripts.

61 MWR2MM Multiple Word Radix-2 Montgomery Multiplication algorithm by Tenca and Koc
Task A Task B e-1 times Task C

62 after the calculation dependent on x0 (xi in general)
Problem x0 x1 S(0)[0] w-1 w-2 1 S(1)[0] 2w-1 2w-2 w+1 w S(2)[0] w w-1 3w-1 3w-2 2w+1 2w 2 1 S(0)[1] S(3)[0] 4w-1 4w-2 3w+1 3w 2w 2w-1 w+2 w+1 S(1)[1] Calculation dependent on x1 (xi+1 in general) can start only two clock cycles after the calculation dependent on x0 (xi in general)

63 Data Dependency Graph by Tenca & Koc
i=0 j=0 One PE is in charge of the computation of one column that corresponds to the updating of S with respect to one single bit xi. The delay between two adjacent PEs is 2 clock cycles. The minimum computation time is 2•n+e-1 clock cycles given (e+1)/2 PEs working in parallel. j=1 i=1 j=2 j=3 i=2 j=4 j=5

64 Data Dependency Graph by Tenca & Koc

65 Example of Operation of the Design by Tenca & Koc
Example of the computation executed for 5-bit operands with word-size w = 1 bit - C n = 5 w = 1 e = 5 2n + e – 1 = 25 + 5 – 1 = 14 clock cycles (e+1)/2 = (5+1)/2 = 3 PEs sufficient to perform all computations

66 Example of Operation of the Design by Tenca & Koc
Example of the computation executed for 5-bit operands with word-size w = 1 bit n = 5 w = 1 e = 5 2PEs

67 Pipelined Organization with Two Processing Elements

68 Montgomery Multiplier
Non-Scalable Montgomery Multiplier by Huang et al.

69 Main Idea of the New Architecture
In the architecture of Tenca & Koc w-1 least significant bits of partial results S(j) are available one clock cycle before they are used only one (most significant) bit is missing Let us compute a new partial result under two assumptions regarding the value of the most significant bit of S(j) and choose the correct value one clock cycle later

70 Idea for a Speed-up S(0) S(1) . . . . . . . . x0 . . . . x1
w-1 w-2 1 2w-1 2w-2 x0 w+1 w w-1 2 1 x1 choose between the two possible results using missing bit computed at the same time 1 w-1 2 1 perform two computations in parallel using two possible values of the most-significant-bit

71 Primary Advantage of the New Approach
Reduction in the number of clock cycles from 2 n + e - 1 to n + e – 1 Minimum penalty in terms of the area and clock period

72 Pseudocode of the Main Processing Element

73 Main Processing Element Type E

74 The Proposed Optimized Hardware Architecture

75 The First and the Last Processing Elements
Type D Type F

76 Data Dependency Graph of the Proposed New Architecture

77 The Overall Computation Pattern
Special state of each PE vs One special PE type simpler structure of each PE Our new proposed architecture Tenca & Koc, CHES 1999

78 Demonstration of Computations
Sequential S(e-1) S(2) S(1) S(0) ←X0 Tenca & Koç’s proposal PE#0 ←X0 S(1) S(3) S(4) S(0) S(2) PE#1 ←X1 S(2) S(1) S(0) PE#2 ←X2 S(0)

79 Demonstration of Computations (cont.)
The proposed optimized architecture PE#0 S(0) S(0) S(0) S(0) S(0) ←X1 ←X3 ←X2 ←Xe-1 ←X0 ←X1 ←X2 ←X0 ←Xe-2 PE#1 S(1) S(1) S(1) S(1) ←X0 ←X1 ←Xe-3 PE#2 S(2) S(2) S(2) PE#3 S(3) S(3) ←Xe-4 ←X0 PE#(e-1) S(e-1) ←X0

80 Normalized Latency Huang et al. Tenca & Koc McIvor et al. 2.00 1.76
1.80 1.60 1.40 1.20 1.01 1.00 0.85 0.81 0.76 0.80 0.60 0.40 0.20 0.00 1024 2048 3072 4096 Operand size

81 Normalized Product Latency Times Area
Huang et al. Tenca & Koc McIvor et al. 2.00 1.80 1.66 1.64 1.63 1.63 1.55 1.60 1.40 1.28 1.21 1.20 1.14 1.00 0.80 0.60 0.40 0.20 0.00 1024 2048 3072 4096 Operand size

82 Montgomery Multiplier
Scalable Montgomery Multiplier by Huang et al.

83

84

85

86

87 Computations for 5-bit operands using a) 3 PEs, b) 2 PEs

88 Modular Exponentiation
Faster Modular Exponentiation

89

90

91

92


Download ppt "Montgomery Multipliers & Exponentiation Units"

Similar presentations


Ads by Google