Download presentation
Presentation is loading. Please wait.
Published byLeslie Booker Modified over 9 years ago
2
Montgomery’s Multiplication Technique: How to make it Smaller and Faster Colin D. Walter Computation Department, UMIST, UK www.co.umist.ac.uk
3
CHES 99, WPIC.D. Walter, UMIST2 Peter Montgomery “Modular Multiplication without Trial Division” Math. Computation, vol. 44 (1985) 519-521 (A B) mod M without obtaining digits q (A B) / M
4
CHES 99, WPIC.D. Walter, UMIST3 Motivation Faster RSA Cryptosystem through pipelined array Safer encryption against timing or DPA attacks
5
CHES 99, WPIC.D. Walter, UMIST4 Overview RSA & Notation Classical Algorithm Montgomery’s Version Comparison carry propagation digit distribution communication timing/power attacks Conclusion
6
CHES 99, WPIC.D. Walter, UMIST5 Enigma Special Purpose Colossus (1943-44) Tommy Flowers, Bletchley Park, England. General Purpose ENIAC (1943-46) John Eckert & John Mauchly Philadelphia, US.
7
CHES 99, WPIC.D. Walter, UMIST6 RSA Modulus M of around 1024 bits Two keys d and e such that A de A mod M A encrypted to C = A e mod M C decrypted by A = C d mod M M = PQ, a product of two large primes e is often small (e.g. a Fermat prime) d satisfies de 1 mod (P–1)(Q–1)
8
CHES 99, WPIC.D. Walter, UMIST7 Faster H/W = More Secure Encryption Work to factorize M doubles for every extra ~15 bits (for key lengths ~2 10 bits) Work to en/decrypt : ((1024+15)/1024) 2 per multiplication ((1024+15)/1024) 3 per exponentiation, i.e. only 5% extra!
9
CHES 99, WPIC.D. Walter, UMIST8 Number representations X = i=0 x i r i r = 2 k is the radix (prime to M) x i is the ith digit (usually 0 x i < r) n max no. of digits in any number Redundant reps: wider digit range than 0.. r 1 H/W is built from k k-bit multipliers n fixed by H/W register size n–1
10
CHES 99, WPIC.D. Walter, UMIST9 Redundancy Digits x j split into carry-save parts: x j = x j,s + rx j,c X X+Y is performed by digit-parallel addition: x j x j,s + x j 1,c + y j No carry propagation: only old carries on right side
11
CHES 99, WPIC.D. Walter, UMIST10 Multiplication A B Use n digit multipliers to form a i B and add to a partial product P: P := 0 ; For i := n 1 downto 0 do P := r P + a i B { Post-condition: P = A B }
12
CHES 99, WPIC.D. Walter, UMIST11 Either: Use redundancy in P and parallel digit addition to add a i B in one clock cycle Cell j computes a i b j in cycle i P := P + a i B (digit-parallel) P in Carry-Save form: p j = p j,s + r p j,c cell j cell j-1 cell j+1 p j,c p j+1,c p j+1,s p j,s p j-1,c p j-1,s p j+1,s p j,c p j,s p j-1,c p j-1,s b j+1 bjbj b j-1 aiai aiai
13
CHES 99, WPIC.D. Walter, UMIST12 or: Pipeline the addition of a i B over n cycles and propagate carries with no redundancy: Cell j computes a i b j in cycle i+j carry P := P + a i B (digit-serial) time j+1 time j time j-1 carry p j+1 pjpj p j-1 p j+1 b j+1 p j b j p j-1 b j-1 aiai aiai aiai aiai
14
CHES 99, WPIC.D. Walter, UMIST13 Multiplier Complexity Assume wires take area but not time (or power). Area Time 2 complexity for un-pipelined k-bit multiplication is bounded below by k 2 This can be achieved for time in [log k.. k ] Discrete Fourier Transform has large constants for time and area. Better, but asymptotically poorer designs for k expected here.
15
CHES 99, WPIC.D. Walter, UMIST14 Cross-over point ? 10 7 transistors available for RSA k 64 to accommodate a i B Speed by using at least n multipliers to perform a full length a i B (or equivalent) in one cycle.
16
CHES 99, WPIC.D. Walter, UMIST15 Real-Time ? Assume : bus is one k-bit digit per cycle k-bit multiplier operates in one cycle Then : A B takes n cycles using n multipliers Throughput is one digit per cycle for mult n. Need O(nk) multiplications for decryption Conclude : Need O(nk) rows of n multipliers.
17
CHES 99, WPIC.D. Walter, UMIST16 Classical Mod Mult n Algorithm { Pre-condition: 0 A < r n } P := 0 ; For i := n 1 downto 0 do Begin P := r × P + a i × B ; q i := P div M ; P := P q i × M ; End { Post-conditions:P = A×B Q×M, P (A×B) mod M }
18
CHES 99, WPIC.D. Walter, UMIST17 Carry propagation a problem (it slows finding q) ; Use only top digits of M and P to determine a good multiple of M to remove ; P is bounded by small multiple of M ; Clean up only at end ; Critical path is finding q. Comments
19
CHES 99, WPIC.D. Walter, UMIST18 Disadvantages Redundant rep. for digit-parallel operation Global broadcast of q to each digit position
20
CHES 99, WPIC.D. Walter, UMIST19 Montgomery’s Mod Mult n Alg m { Pre-condition: 0 A < r n } P := 0 ; For i := 0 to n 1 do Begin q i := (p 0 +a i b 0 )(-m 0 -1 ) mod r ; P := (P + a i × B + q i × M) div r ; { Invariant: 0 P < M+B } End ; { Post-condition: Pr n = A×B + Q×M, P (ABr –n ) mod M }
21
CHES 99, WPIC.D. Walter, UMIST20 Peter Montgomery : reverses the multiplication order chooses digits from least to most significant shifts down on each iteration. uses the least significant digits to determine multiple of M to subtract. Computes (AB×r –n ) mod M
22
CHES 99, WPIC.D. Walter, UMIST21 The factor r –n is cleared up in post-processing Any extra multiple of M is removed then q i has no carries to wait for Pipelining of the digits can now take place : compute a i b j+1 on the cycle after a i b j use a non-redundant representation no broadcasting of q i
23
CHES 99, WPIC.D. Walter, UMIST22 The Post-Condition m 0 1 exists q i chosen so division by r is exact Define A i = j=0 a j r j and Q i analogously Then A i = A i 1 +r i a i and A n = A So r i+1 P= A i ×B + Q i ×Mat end of ith iteration Hence r n P = A×B + Q×M at end. i
24
CHES 99, WPIC.D. Walter, UMIST23 The Bounds A converted on-line to non-redundant form Can assume a i r 1 So loop invariant P < M+B
25
CHES 99, WPIC.D. Walter, UMIST24 If critical path length is computing q: Scale M to ensure ( m 0 1 ) mod r = 1 Shift B up to make b 0 = 0 Result: q i = p 0 mod r is simple Critical path in repeated cell. Cost: Increase n by 2
26
CHES 99, WPIC.D. Walter, UMIST25 Removing r n The Montgomery class of A is A r n A mod M Montgomery mod r mult n is denoted . Montgomery product of A and B is A B A B r n ABr n AB mod M. Applying to A instead of to A produces A e in an exp n algorithm _ _ _ ___ _ _ _ __ ___
27
CHES 99, WPIC.D. Walter, UMIST26 Process: A A A e A e Precompute: R 2 = r n r 2n mod M Start with: A R 2 Ar n A mod M Exponentiate to obtain A e End with: A e 1 A e mod M _ _ ___ _ Encryption Process
28
CHES 99, WPIC.D. Walter, UMIST27 Outputs are re-used as inputs. So need to bound I/O: Suppose a n 1 = 0 Then P < M+B at end of loop n 2 yields P < M+r 1 B at very end. e.g. If B < 2M then P < 2M 2M Bound
29
CHES 99, WPIC.D. Walter, UMIST28 Suppose 2rM < r n, A < 2M and R 2 < 2M Then A < 2M, A e < 2M and P = A e 1 < 2M Final output P satisfies Pr n = A e + QM where Q r n 1. Here A e < 2M yields Pr n < (r n +1)M So P M P = M A e 0 mod M A 0 mod M A = M should never arise; A = 0 yields P = 0. So no final modular adjustment is necessary. ___ _
30
CHES 99, WPIC.D. Walter, UMIST29 Digit-Parallel Implementation Classical vs Montgomery: Similarities: Broadcasting of q i and a i Redundant representations Computing q i takes time Differences: Bits to determine q i
31
CHES 99, WPIC.D. Walter, UMIST30 a i, q i P := P + a i B (digit-parallel, not modular) cell j+1 P in Carry-Save form: p j = p j,s + r p j,c cell j cell j-1 a i, q i mjbjmjbj m j+1 b j+1 m j-1 b j-1 p j+1,c p j,c p j-1,c p j+1,s p j-1,s p j,s p j-1,c p j-1,s p j,s p j,c p j+1,s
32
CHES 99, WPIC.D. Walter, UMIST31 jj+1j-1 n-1 Digit-Parallel P := rP + a i B - q i M (Classical) p j-1,c p j-1,s p j,s p j,c p j+1,c p j+1,s p n-2,s p n-3,c p j-2,s p j-3,c p j,c m n-1 b n-1 m j+1 b j+1 m j-1 b j-1 m j b j aiqiaiqi aiqiaiqi qiqi q i+1 p j-2,c
33
CHES 99, WPIC.D. Walter, UMIST32 (Montgomery) j+1 jj-1 0 Data Flow for P (i+1) := (P (i) + a i B + q i M)/r p j-1 (n) p j-1 pjpj pjpj p j+1 pjpj p j-2 p0p0 (n) (i) (i+1) c i,j+2 c i,j+1 c i,j c i,j-1 c i,1 aiai aiai aiai aiai aiai aiai qiqi qiqi qiqi qiqi qiqi mimi m i+1 m i-1 m0m0 b0b0 b j-1 b j+1 bjbj
34
CHES 99, WPIC.D. Walter, UMIST33 Systolic Array (Montgomery) Write ith value of P as P (i) = j=0 p (i 1) r j Cells in col j compute p (i) j at time 2i+j : p (i) j + rc (i) j p (i 1) j+1 + c (i) j 1 + a i b j + q i m j Cells in col 0 compute q i at time 2i : q i (p (i 1) 1 +a i ×b 0 )( m 0 1 ) mod r Any number of rows may be constructed Different timing schedules are possible n1n1
35
CHES 99, WPIC.D. Walter, UMIST34 cell i,j+1 cell i,j cell i,j-1 cell i+1,j+1 cell i+1,j cell i+1,j-1 carry aiai aiai aiai aiai a i+1 q i+1 qiqi qiqi qiqi qiqi m j b j m j+1 b j+1 m j-1 b j-1 m j b j m j+1 b j+1 p (i+1) p (i) p (i+2) j-1 j-2 j+1 j j j Systolic Array for P := (A B + Q M)r -n
36
CHES 99, WPIC.D. Walter, UMIST35 j+1 jj-1 0 Data Flow for P (i+1) := (P (i) + a i B + q i M)/r p j-1 (n) p j-1 pjpj pjpj p j+1 pjpj p j-2 p0p0 (n) (i) (i+1) c i,j+2 c i,j+1 c i,j c i,j-1 c i,1 aiai aiai aiai aiai aiai aiai qiqi qiqi qiqi qiqi qiqi mimi m i+1 m i-1 m0m0 b0b0 b j-1 b j+1 bjbj
37
CHES 99, WPIC.D. Walter, UMIST36 Digit-Serial Implementation (Montgomery) Advantages: Local communication Shorter critical path Critical path easily in repeated cell Non-redundant representation Digit serial I/O Different digits q i and a i re DPA
38
CHES 99, WPIC.D. Walter, UMIST37 Disadvantage: H/W only half used Solutions: Interleave two multiplications E.g. configure exponentiation 75% use Group digits as per Peter Kornerup [94] Digit-Serial Implementation (Montgomery)
39
CHES 99, WPIC.D. Walter, UMIST38 Other cell boundaries/groupings are possible; Timing front angles in the data dependency graph can be altered; For current speed of array implementations see Blum and Paar [99]; Vuillemin et al. [97] constructed an array; Design is parametrised: by k and no. of rows.
40
CHES 99, WPIC.D. Walter, UMIST39 Data Dependency Diagrams
41
CHES 99, WPIC.D. Walter, UMIST40 Data Dependency Diagrams Parallel Digit Implementation t = 0 t = 1 t = 2 t = 3
42
CHES 99, WPIC.D. Walter, UMIST41 Data Dependency Diagrams Walter [93] t=4 t=5 t=3 t=6 t=2t=1t=0 t=7 1 tick 2 ticks...
43
CHES 99, WPIC.D. Walter, UMIST42 Data Dependency Diagrams Kornerup [94] t=0 t=1t=2 t=3 t=4 t=5 t=6
44
CHES 99, WPIC.D. Walter, UMIST43 Data Integrity P = A×B Q×M or Pr n = A×B + Q×M These are easily checked mod m. e.g. m a prime just above the maximum cell output. Cost: ~ one cell in the array i.e. increasing n by 1. On error, abort or re-compute by another route: e.g. M replaced by dM for a digit d prime to r.
45
CHES 99, WPIC.D. Walter, UMIST44 Timing & Power Attacks Most attacks which succeed on the classical algorithm have equivalents which will succeed on corresponding implementation of Montgomery’s algorithm. With parallel digit processing, the same digits of A and Q are used in every digit slice in the same cycle. So DPA might reveal them. Pipelined version has no equivalent (see data dependency graph). It uses many different digits of A and Q in each cycle. DPA is more difficult.
46
CHES 99, WPIC.D. Walter, UMIST45 Conclusions For single k-bit multiplier or array of n parallel cells, classical and Montgomery algorithms are almost equal. For pipelined array, Montgomery method has advantages: smaller time & area constants, better I/O, better against DPA; Pipeline is more complex for 100% use, but faster clock. Parameters can be chosen for specific purposes.
47
CHES 99, WPIC.D. Walter, UMIST46 Go forth and Multiply
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.