Presentation is loading. Please wait.

Presentation is loading. Please wait.

Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk.

Similar presentations


Presentation on theme: "Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk."— Presentation transcript:

1

2 Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk

3 CHES 99, WPIC.D. Walter, UMIST2 Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A  B) mod M without obtaining digits q  (A  B) / M

4 CHES 99, WPIC.D. Walter, UMIST3 Motivation Faster RSA Cryptosystem  through pipelined array  Safer encryption  against timing or DPA attacks

5 CHES 99, WPIC.D. Walter, UMIST4 Overview  RSA & Notation  Classical Algorithm  Montgomery’s Version  Comparison  carry propagation  digit distribution  communication  timing/power attacks  Conclusion

6 CHES 99, WPIC.D. Walter, UMIST5 Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US.

7 CHES 99, WPIC.D. Walter, UMIST6 RSA  Modulus M of around 1024 bits  Two keys d and e such that A de  A mod M  A encrypted to C = A e mod M  C decrypted by A = C d mod M  M = PQ, a product of two large primes  e is often small (e.g. a Fermat prime)  d satisfies de  1 mod (P–1)(Q–1)

8 CHES 99, WPIC.D. Walter, UMIST7 Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~2 10 bits) Work to en/decrypt : ((1024+15)/1024) 2 per multiplication ((1024+15)/1024) 3 per exponentiation, i.e. only 5% extra!

9 CHES 99, WPIC.D. Walter, UMIST8 Number representations X =  i=0 x i r i  r = 2 k is the radix (prime to M)  x i is the ith digit (usually 0  x i < r)  n  max no. of digits in any number  Redundant reps: wider digit range than 0.. r  1  H/W is built from k  k-bit multipliers  n fixed by H/W register size n–1

10 CHES 99, WPIC.D. Walter, UMIST9 Redundancy Digits x j split into carry-save parts: x j = x j,s + rx j,c X  X+Y is performed by digit-parallel addition: x j  x j,s + x j  1,c + y j No carry propagation: only old carries on right side

11 CHES 99, WPIC.D. Walter, UMIST10 Multiplication A  B Use n digit multipliers to form a i  B and add to a partial product P: P := 0 ; For i := n  1 downto 0 do P := r  P + a i  B { Post-condition: P = A  B }

12 CHES 99, WPIC.D. Walter, UMIST11 Either: Use redundancy in P and parallel digit addition to add a i B in one clock cycle Cell j computes a i b j in cycle i P := P + a i  B (digit-parallel) P in Carry-Save form: p j = p j,s + r  p j,c cell j cell j-1 cell j+1 p j,c p j+1,c p j+1,s p j,s p j-1,c p j-1,s p j+1,s p j,c p j,s p j-1,c p j-1,s b j+1 bjbj b j-1 aiai aiai

13 CHES 99, WPIC.D. Walter, UMIST12 or: Pipeline the addition of a i  B over n cycles and propagate carries with no redundancy: Cell j computes a i b j in cycle i+j carry P := P + a i  B (digit-serial) time j+1 time j time j-1 carry p j+1 pjpj p j-1 p j+1 b j+1 p j b j p j-1 b j-1 aiai aiai aiai aiai

14 CHES 99, WPIC.D. Walter, UMIST13 Multiplier Complexity Assume wires take area but not time (or power). Area  Time 2 complexity for un-pipelined k-bit multiplication is bounded below by k 2 This can be achieved for time in [log k..  k ]  Discrete Fourier Transform has large constants for time and area.  Better, but asymptotically poorer designs for k expected here.

15 CHES 99, WPIC.D. Walter, UMIST14 Cross-over point ? 10 7 transistors available for RSA  k  64 to accommodate a i  B Speed by using at least n multipliers to perform a full length a i  B (or equivalent) in one cycle.

16 CHES 99, WPIC.D. Walter, UMIST15 Real-Time ? Assume :  bus is one k-bit digit per cycle  k-bit multiplier operates in one cycle Then :  A  B takes n cycles using n multipliers  Throughput is one digit per cycle for mult n.  Need O(nk) multiplications for decryption Conclude :  Need O(nk) rows of n multipliers.

17 CHES 99, WPIC.D. Walter, UMIST16 Classical Mod Mult n Algorithm { Pre-condition: 0  A < r n } P := 0 ; For i := n  1 downto 0 do Begin P := r × P + a i × B ; q i := P div M ; P := P  q i × M ; End { Post-conditions:P = A×B  Q×M, P  (A×B) mod M }

18 CHES 99, WPIC.D. Walter, UMIST17  Carry propagation a problem (it slows finding q) ;  Use only top digits of M and P to determine a good multiple of M to remove ;  P is bounded by small multiple of M ;  Clean up only at end ;  Critical path is finding q. Comments

19 CHES 99, WPIC.D. Walter, UMIST18 Disadvantages  Redundant rep. for digit-parallel operation  Global broadcast of q to each digit position

20 CHES 99, WPIC.D. Walter, UMIST19 Montgomery’s Mod Mult n Alg m { Pre-condition: 0  A < r n } P := 0 ; For i := 0 to n  1 do Begin q i := (p 0 +a i b 0 )(-m 0 -1 ) mod r ; P := (P + a i × B + q i × M) div r ; { Invariant: 0  P < M+B } End ; { Post-condition: Pr n = A×B + Q×M, P  (ABr –n ) mod M }

21 CHES 99, WPIC.D. Walter, UMIST20 Peter Montgomery :  reverses the multiplication order  chooses digits from least to most significant  shifts down on each iteration.  uses the least significant digits to determine multiple of M to subtract.  Computes (AB×r –n ) mod M

22 CHES 99, WPIC.D. Walter, UMIST21  The factor r –n is cleared up in post-processing  Any extra multiple of M is removed then  q i has no carries to wait for  Pipelining of the digits can now take place :  compute a i b j+1 on the cycle after a i b j  use a non-redundant representation  no broadcasting of q i

23 CHES 99, WPIC.D. Walter, UMIST22 The Post-Condition  m 0  1 exists  q i chosen so division by r is exact Define A i =  j=0 a j r j and Q i analogously Then A i = A i  1 +r i a i and A n = A So r i+1 P= A i ×B + Q i ×Mat end of ith iteration Hence r n P = A×B + Q×M at end. i

24 CHES 99, WPIC.D. Walter, UMIST23 The Bounds  A converted on-line to non-redundant form  Can assume a i  r  1  So loop invariant P < M+B

25 CHES 99, WPIC.D. Walter, UMIST24 If critical path length is computing q:  Scale M to ensure (  m 0  1 ) mod r = 1  Shift B up to make b 0 = 0 Result:  q i = p 0 mod r is simple  Critical path in repeated cell. Cost:  Increase n by 2

26 CHES 99, WPIC.D. Walter, UMIST25 Removing r n The Montgomery class of A is A  r n A mod M Montgomery mod r mult n is denoted . Montgomery product of A and B is A  B  A B r  n  ABr n  AB mod M. Applying  to A instead of  to A produces A e in an exp n algorithm _ _ _ ___ _ _ _ __ ___

27 CHES 99, WPIC.D. Walter, UMIST26 Process: A  A  A e  A e Precompute: R 2 = r n  r 2n mod M  Start with: A  R 2  Ar n  A mod M  Exponentiate to obtain A e  End with: A e  1  A e mod M _ _ ___ _ Encryption Process

28 CHES 99, WPIC.D. Walter, UMIST27 Outputs are re-used as inputs. So need to bound I/O: Suppose a n  1 = 0 Then P < M+B at end of loop n  2 yields P < M+r  1 B at very end. e.g. If B < 2M then P < 2M 2M Bound

29 CHES 99, WPIC.D. Walter, UMIST28 Suppose 2rM < r n, A < 2M and R 2 < 2M Then A < 2M, A e < 2M and P = A e  1 < 2M Final output P satisfies Pr n = A e + QM where Q  r n  1. Here A e < 2M yields Pr n < (r n +1)M So P  M P = M  A e  0 mod M  A  0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. ___ _

30 CHES 99, WPIC.D. Walter, UMIST29 Digit-Parallel Implementation Classical vs Montgomery: Similarities:  Broadcasting of q i and a i  Redundant representations  Computing q i takes time Differences:  Bits to determine q i

31 CHES 99, WPIC.D. Walter, UMIST30 a i, q i P := P + a i  B (digit-parallel, not modular) cell j+1 P in Carry-Save form: p j = p j,s + r  p j,c cell j cell j-1 a i, q i mjbjmjbj m j+1 b j+1 m j-1 b j-1 p j+1,c p j,c p j-1,c p j+1,s p j-1,s p j,s p j-1,c p j-1,s p j,s p j,c p j+1,s

32 CHES 99, WPIC.D. Walter, UMIST31 jj+1j-1 n-1 Digit-Parallel P := rP + a i  B - q i  M (Classical) p j-1,c p j-1,s p j,s p j,c p j+1,c p j+1,s p n-2,s p n-3,c p j-2,s p j-3,c p j,c m n-1 b n-1 m j+1 b j+1 m j-1 b j-1 m j b j aiqiaiqi aiqiaiqi qiqi q i+1 p j-2,c

33 CHES 99, WPIC.D. Walter, UMIST32 (Montgomery) j+1 jj-1 0 Data Flow for P (i+1) := (P (i) + a i  B + q i  M)/r p j-1 (n) p j-1 pjpj pjpj p j+1 pjpj p j-2 p0p0 (n) (i) (i+1) c i,j+2 c i,j+1 c i,j c i,j-1 c i,1 aiai aiai aiai aiai aiai aiai qiqi qiqi qiqi qiqi qiqi mimi m i+1 m i-1 m0m0 b0b0 b j-1 b j+1 bjbj

34 CHES 99, WPIC.D. Walter, UMIST33 Systolic Array (Montgomery) Write ith value of P as P (i) =  j=0 p (i  1) r j Cells in col j compute p (i) j at time 2i+j : p (i) j + rc (i) j  p (i  1) j+1 + c (i) j  1 + a i b j + q i m j Cells in col 0 compute q i at time 2i : q i  (p (i  1) 1 +a i ×b 0 )(  m 0  1 ) mod r  Any number of rows may be constructed  Different timing schedules are possible n1n1

35 CHES 99, WPIC.D. Walter, UMIST34 cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry aiai aiai aiai aiai a i+1 q i+1 qiqi qiqi qiqi qiqi m j b j m j+1 b j+1 m j-1 b j-1 m j b j m j+1 b j+1 p (i+1) p (i) p (i+2) j-1 j-2 j+1 j j j Systolic Array for P := (A  B + Q  M)r -n

36 CHES 99, WPIC.D. Walter, UMIST35 j+1 jj-1 0 Data Flow for P (i+1) := (P (i) + a i  B + q i  M)/r p j-1 (n) p j-1 pjpj pjpj p j+1 pjpj p j-2 p0p0 (n) (i) (i+1) c i,j+2 c i,j+1 c i,j c i,j-1 c i,1 aiai aiai aiai aiai aiai aiai qiqi qiqi qiqi qiqi qiqi mimi m i+1 m i-1 m0m0 b0b0 b j-1 b j+1 bjbj

37 CHES 99, WPIC.D. Walter, UMIST36 Digit-Serial Implementation (Montgomery) Advantages:  Local communication  Shorter critical path  Critical path easily in repeated cell  Non-redundant representation  Digit serial I/O  Different digits q i and a i re DPA

38 CHES 99, WPIC.D. Walter, UMIST37 Disadvantage:  H/W only half used Solutions:  Interleave two multiplications  E.g. configure exponentiation  75% use  Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery)

39 CHES 99, WPIC.D. Walter, UMIST38  Other cell boundaries/groupings are possible;  Timing front angles in the data dependency graph can be altered;  For current speed of array implementations see Blum and Paar [99];  Vuillemin et al. [97] constructed an array;  Design is parametrised: by k and no. of rows.

40 CHES 99, WPIC.D. Walter, UMIST39 Data Dependency Diagrams

41 CHES 99, WPIC.D. Walter, UMIST40 Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3

42 CHES 99, WPIC.D. Walter, UMIST41 Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2t=1t=0 t=7 1 tick 2 ticks...

43 CHES 99, WPIC.D. Walter, UMIST42 Data Dependency Diagrams Kornerup [94] t=0 t=1t=2 t=3 t=4 t=5 t=6

44 CHES 99, WPIC.D. Walter, UMIST43 Data Integrity P = A×B  Q×M or Pr n = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r.

45 CHES 99, WPIC.D. Walter, UMIST44 Timing & Power Attacks  Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm.  With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them.  Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult.

46 CHES 99, WPIC.D. Walter, UMIST45 Conclusions  For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal.  For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA;  Pipeline is more complex for 100% use, but faster clock.  Parameters can be chosen for specific purposes.

47 CHES 99, WPIC.D. Walter, UMIST46 Go forth and Multiply


Download ppt "Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk."

Similar presentations


Ads by Google