Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk.

Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk

CHES 99, WPIC.D. Walter, UMIST2 Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A  B) mod M without obtaining digits q  (A  B) / M

CHES 99, WPIC.D. Walter, UMIST3 Motivation Faster RSA Cryptosystem  through pipelined array  Safer encryption  against timing or DPA attacks

CHES 99, WPIC.D. Walter, UMIST4 Overview  RSA & Notation  Classical Algorithm  Montgomery’s Version  Comparison  carry propagation  digit distribution  communication  timing/power attacks  Conclusion

CHES 99, WPIC.D. Walter, UMIST5 Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US.

CHES 99, WPIC.D. Walter, UMIST6 RSA  Modulus M of around 1024 bits  Two keys d and e such that A de  A mod M  A encrypted to C = A e mod M  C decrypted by A = C d mod M  M = PQ, a product of two large primes  e is often small (e.g. a Fermat prime)  d satisfies de  1 mod (P–1)(Q–1)

CHES 99, WPIC.D. Walter, UMIST7 Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~2 10 bits) Work to en/decrypt : ((1024+15)/1024) 2 per multiplication ((1024+15)/1024) 3 per exponentiation, i.e. only 5% extra!

CHES 99, WPIC.D. Walter, UMIST8 Number representations X =  i=0 x i r i  r = 2 k is the radix (prime to M)  x i is the ith digit (usually 0  x i < r)  n  max no. of digits in any number  Redundant reps: wider digit range than 0.. r  1  H/W is built from k  k-bit multipliers  n fixed by H/W register size n–1

CHES 99, WPIC.D. Walter, UMIST9 Redundancy Digits x j split into carry-save parts: x j = x j,s + rx j,c X  X+Y is performed by digit-parallel addition: x j  x j,s + x j  1,c + y j No carry propagation: only old carries on right side

CHES 99, WPIC.D. Walter, UMIST10 Multiplication A  B Use n digit multipliers to form a i  B and add to a partial product P: P := 0 ; For i := n  1 downto 0 do P := r  P + a i  B { Post-condition: P = A  B }

CHES 99, WPIC.D. Walter, UMIST11 Either: Use redundancy in P and parallel digit addition to add a i B in one clock cycle Cell j computes a i b j in cycle i P := P + a i  B (digit-parallel) P in Carry-Save form: p j = p j,s + r  p j,c cell j cell j-1 cell j+1 p j,c p j+1,c p j+1,s p j,s p j-1,c p j-1,s p j+1,s p j,c p j,s p j-1,c p j-1,s b j+1 bjbj b j-1 aiai aiai

CHES 99, WPIC.D. Walter, UMIST12 or: Pipeline the addition of a i  B over n cycles and propagate carries with no redundancy: Cell j computes a i b j in cycle i+j carry P := P + a i  B (digit-serial) time j+1 time j time j-1 carry p j+1 pjpj p j-1 p j+1 b j+1 p j b j p j-1 b j-1 aiai aiai aiai aiai

CHES 99, WPIC.D. Walter, UMIST13 Multiplier Complexity Assume wires take area but not time (or power). Area  Time 2 complexity for un-pipelined k-bit multiplication is bounded below by k 2 This can be achieved for time in [log k..  k ]  Discrete Fourier Transform has large constants for time and area.  Better, but asymptotically poorer designs for k expected here.

CHES 99, WPIC.D. Walter, UMIST14 Cross-over point ? 10 7 transistors available for RSA  k  64 to accommodate a i  B Speed by using at least n multipliers to perform a full length a i  B (or equivalent) in one cycle.

CHES 99, WPIC.D. Walter, UMIST15 Real-Time ? Assume :  bus is one k-bit digit per cycle  k-bit multiplier operates in one cycle Then :  A  B takes n cycles using n multipliers  Throughput is one digit per cycle for mult n.  Need O(nk) multiplications for decryption Conclude :  Need O(nk) rows of n multipliers.

CHES 99, WPIC.D. Walter, UMIST16 Classical Mod Mult n Algorithm { Pre-condition: 0  A < r n } P := 0 ; For i := n  1 downto 0 do Begin P := r × P + a i × B ; q i := P div M ; P := P  q i × M ; End { Post-conditions:P = A×B  Q×M, P  (A×B) mod M }

CHES 99, WPIC.D. Walter, UMIST17  Carry propagation a problem (it slows finding q) ;  Use only top digits of M and P to determine a good multiple of M to remove ;  P is bounded by small multiple of M ;  Clean up only at end ;  Critical path is finding q. Comments

CHES 99, WPIC.D. Walter, UMIST18 Disadvantages  Redundant rep. for digit-parallel operation  Global broadcast of q to each digit position

CHES 99, WPIC.D. Walter, UMIST19 Montgomery’s Mod Mult n Alg m { Pre-condition: 0  A < r n } P := 0 ; For i := 0 to n  1 do Begin q i := (p 0 +a i b 0 )(-m 0 -1 ) mod r ; P := (P + a i × B + q i × M) div r ; { Invariant: 0  P < M+B } End ; { Post-condition: Pr n = A×B + Q×M, P  (ABr –n ) mod M }

CHES 99, WPIC.D. Walter, UMIST20 Peter Montgomery :  reverses the multiplication order  chooses digits from least to most significant  shifts down on each iteration.  uses the least significant digits to determine multiple of M to subtract.  Computes (AB×r –n ) mod M

CHES 99, WPIC.D. Walter, UMIST21  The factor r –n is cleared up in post-processing  Any extra multiple of M is removed then  q i has no carries to wait for  Pipelining of the digits can now take place :  compute a i b j+1 on the cycle after a i b j  use a non-redundant representation  no broadcasting of q i

CHES 99, WPIC.D. Walter, UMIST22 The Post-Condition  m 0  1 exists  q i chosen so division by r is exact Define A i =  j=0 a j r j and Q i analogously Then A i = A i  1 +r i a i and A n = A So r i+1 P= A i ×B + Q i ×Mat end of ith iteration Hence r n P = A×B + Q×M at end. i

CHES 99, WPIC.D. Walter, UMIST23 The Bounds  A converted on-line to non-redundant form  Can assume a i  r  1  So loop invariant P < M+B

CHES 99, WPIC.D. Walter, UMIST24 If critical path length is computing q:  Scale M to ensure (  m 0  1 ) mod r = 1  Shift B up to make b 0 = 0 Result:  q i = p 0 mod r is simple  Critical path in repeated cell. Cost:  Increase n by 2

CHES 99, WPIC.D. Walter, UMIST25 Removing r n The Montgomery class of A is A  r n A mod M Montgomery mod r mult n is denoted . Montgomery product of A and B is A  B  A B r  n  ABr n  AB mod M. Applying  to A instead of  to A produces A e in an exp n algorithm _ _ _ ___ _ _ _ __ ___

CHES 99, WPIC.D. Walter, UMIST26 Process: A  A  A e  A e Precompute: R 2 = r n  r 2n mod M  Start with: A  R 2  Ar n  A mod M  Exponentiate to obtain A e  End with: A e  1  A e mod M _ _ ___ _ Encryption Process

CHES 99, WPIC.D. Walter, UMIST27 Outputs are re-used as inputs. So need to bound I/O: Suppose a n  1 = 0 Then P < M+B at end of loop n  2 yields P < M+r  1 B at very end. e.g. If B < 2M then P < 2M 2M Bound

CHES 99, WPIC.D. Walter, UMIST28 Suppose 2rM < r n, A < 2M and R 2 < 2M Then A < 2M, A e < 2M and P = A e  1 < 2M Final output P satisfies Pr n = A e + QM where Q  r n  1. Here A e < 2M yields Pr n < (r n +1)M So P  M P = M  A e  0 mod M  A  0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. ___ _

CHES 99, WPIC.D. Walter, UMIST29 Digit-Parallel Implementation Classical vs Montgomery: Similarities:  Broadcasting of q i and a i  Redundant representations  Computing q i takes time Differences:  Bits to determine q i

CHES 99, WPIC.D. Walter, UMIST30 a i, q i P := P + a i  B (digit-parallel, not modular) cell j+1 P in Carry-Save form: p j = p j,s + r  p j,c cell j cell j-1 a i, q i mjbjmjbj m j+1 b j+1 m j-1 b j-1 p j+1,c p j,c p j-1,c p j+1,s p j-1,s p j,s p j-1,c p j-1,s p j,s p j,c p j+1,s

CHES 99, WPIC.D. Walter, UMIST31 jj+1j-1 n-1 Digit-Parallel P := rP + a i  B - q i  M (Classical) p j-1,c p j-1,s p j,s p j,c p j+1,c p j+1,s p n-2,s p n-3,c p j-2,s p j-3,c p j,c m n-1 b n-1 m j+1 b j+1 m j-1 b j-1 m j b j aiqiaiqi aiqiaiqi qiqi q i+1 p j-2,c

CHES 99, WPIC.D. Walter, UMIST32 (Montgomery) j+1 jj-1 0 Data Flow for P (i+1) := (P (i) + a i  B + q i  M)/r p j-1 (n) p j-1 pjpj pjpj p j+1 pjpj p j-2 p0p0 (n) (i) (i+1) c i,j+2 c i,j+1 c i,j c i,j-1 c i,1 aiai aiai aiai aiai aiai aiai qiqi qiqi qiqi qiqi qiqi mimi m i+1 m i-1 m0m0 b0b0 b j-1 b j+1 bjbj

CHES 99, WPIC.D. Walter, UMIST33 Systolic Array (Montgomery) Write ith value of P as P (i) =  j=0 p (i  1) r j Cells in col j compute p (i) j at time 2i+j : p (i) j + rc (i) j  p (i  1) j+1 + c (i) j  1 + a i b j + q i m j Cells in col 0 compute q i at time 2i : q i  (p (i  1) 1 +a i ×b 0 )(  m 0  1 ) mod r  Any number of rows may be constructed  Different timing schedules are possible n1n1

CHES 99, WPIC.D. Walter, UMIST34 cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry aiai aiai aiai aiai a i+1 q i+1 qiqi qiqi qiqi qiqi m j b j m j+1 b j+1 m j-1 b j-1 m j b j m j+1 b j+1 p (i+1) p (i) p (i+2) j-1 j-2 j+1 j j j Systolic Array for P := (A  B + Q  M)r -n

CHES 99, WPIC.D. Walter, UMIST35 j+1 jj-1 0 Data Flow for P (i+1) := (P (i) + a i  B + q i  M)/r p j-1 (n) p j-1 pjpj pjpj p j+1 pjpj p j-2 p0p0 (n) (i) (i+1) c i,j+2 c i,j+1 c i,j c i,j-1 c i,1 aiai aiai aiai aiai aiai aiai qiqi qiqi qiqi qiqi qiqi mimi m i+1 m i-1 m0m0 b0b0 b j-1 b j+1 bjbj

CHES 99, WPIC.D. Walter, UMIST36 Digit-Serial Implementation (Montgomery) Advantages:  Local communication  Shorter critical path  Critical path easily in repeated cell  Non-redundant representation  Digit serial I/O  Different digits q i and a i re DPA

CHES 99, WPIC.D. Walter, UMIST37 Disadvantage:  H/W only half used Solutions:  Interleave two multiplications  E.g. configure exponentiation  75% use  Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery)

CHES 99, WPIC.D. Walter, UMIST38  Other cell boundaries/groupings are possible;  Timing front angles in the data dependency graph can be altered;  For current speed of array implementations see Blum and Paar [99];  Vuillemin et al. [97] constructed an array;  Design is parametrised: by k and no. of rows.

CHES 99, WPIC.D. Walter, UMIST39 Data Dependency Diagrams

CHES 99, WPIC.D. Walter, UMIST40 Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3

CHES 99, WPIC.D. Walter, UMIST41 Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2t=1t=0 t=7 1 tick 2 ticks...

CHES 99, WPIC.D. Walter, UMIST42 Data Dependency Diagrams Kornerup [94] t=0 t=1t=2 t=3 t=4 t=5 t=6

CHES 99, WPIC.D. Walter, UMIST43 Data Integrity P = A×B  Q×M or Pr n = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r.

CHES 99, WPIC.D. Walter, UMIST44 Timing & Power Attacks  Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm.  With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them.  Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult.

CHES 99, WPIC.D. Walter, UMIST45 Conclusions  For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal.  For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA;  Pipeline is more complex for 100% use, but faster clock.  Parameters can be chosen for specific purposes.

CHES 99, WPIC.D. Walter, UMIST46 Go forth and Multiply

Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk.

Similar presentations

Presentation on theme: "Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk.

Similar presentations

Presentation on theme: "Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk."— Presentation transcript:

Similar presentations

About project

Feedback