Download presentation
Presentation is loading. Please wait.
Published byGavin Wiggins Modified over 9 years ago
1
Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004 Boston, MA Full version of the paper is at: http://www.hars.us/Papers/ModMult.pdf
2
Outline Background ( need, algorithms, complexity… ) Target: occasional PK crypto ( smartcard, OSD… ) Optimizations –Hardware architecture General purpose, support fast modular reduction Speed: Parallel operation: multiply || add / load… Memory: In-place update –Algorithmic improvements Multiply with short Reciprocal (~trial division) –Precision – scaling of reciprocals –Drop insignificant terms Modulus scaling
3
Modular Multiplication a × b mod m = remainder of (a × b) ÷ m Used in RSA, ECC, ElGamal, Diffie- Hellman, Primality tests, BBS-PRNG… Assume a,b,m are n-digit numbers ▫m normalized: ½ d n ≤ m < d n ▫Digit size (machine word) = 16 bits (8…64) ▫n = 64 for RSA-1024 (10…256) Squaring ~twice faster Conserve memory Divide after Multiply: double length product
4
Modular Multiplication Interleaved multiplication and division Barrett multiplication –Multiply with reciprocal ([d 2n / m]: extra n digits) Quisquater's multiplication –Scaling the modulus for many MS 1-bits (S: extra n digits storage) Montgomery multiplication –Number representation: a → a × d n mod m –Right-to-left (simple) interleaved division –Needs pre- and post processing
5
Sub-Quadratic time algorithms Fast multiplications Complicated algorithms ▫ Pays for very long numbers ▫ Karatsuba: O(n log 2 3 ) – faster if n > 10…30 ▫ Toom-Cook 3,4…way O(n α ) ▫ 3FT ( Finite Field Fourier Transform ) O(n·logn·loglogn) Division = multiplication with reciprocal Long Reciprocal [d 2n / m] –Newton iteration: 0.6…2 multiplication time Speed-ups for PKC www.hars.us/Papers/Truncated Products.pdf
6
Quadratic time algorithms School multiplication: n 2 digit products School division: k·n 2 digit operations –Quotient digits estimated with short divisions Digit-Multiplications || other operations +Simple structure +No extra storage when interleaved –Slower –Quotient digits with trial-and-error Goal: reduce # correction steps
7
Multiply-Accumulate DSP: multiplication parallel to load / store / add / compare… Order of the digit-product calculation ▫ Row-order (use input digits sequentially) for i = 0 … | a | -1 for j = 0 … | b | -1 …a i b j … –More memory access ▪ Column-order (output digits sequentially) for k = 0 … | a | + | b | -2 for i,j: i+j = k …a i b j … –Longer accumulator (can be split)
8
HW Architecture General purpose µP with enhancements –Circuit utilization: Multi-use DSP structure: multiplication || others –Multiplier is large and slow Long accumulator Split adder / counter In-Accumulator instructions Quotient-digit correction circuit Updateable memory –circular offset write
9
HW Architecture 16-bit digits || Shift-add = 17.5-bit mult In-Accumulator ▫ Shift ▫ Add
10
Quotient Digits No need to store q q ← multiplication with short reciprocal µ –µ is used many times –µ ← Newton iteration, look-up table… –All bits - 2 MS digits and 1 bit: error = 0 or 1 (-1) –More than 1-digit reciprocal: quotient often OK –Most economical: µ = [d n +2 / 2m] = {µ 1,µ 0 } scale: ÷2m, making µ exact 2-digit Special case m = ½ d n µ : = d 2 −1 –Usable: µ = [d n +1.5 / m], µ = [2d n +1 / m]…
11
The basic algorithm LRL4 R n-1 … n-3 = a n a -1 b n b -1 d + a n a -1 b n b -2 + a n a -2 b n b -1 // Col 1, 2 for k = n a +n b -4 … n-3 // Columns to left R n … n-4 += Σ i+j=k a i b j // Loop-1 to right if (overflow) R -= m q =(R n-1 µ 1 d 2 + R n-1 µ 0 d + R n-2 µ 1 d + R n-2 µ 0 )/d 3 ·2 R =(R–q·m)d // Loop-2 for k = 0 … n-4 // LS digits to left R n … k += Σ i+j=k a i b j // Loop-3 ~ 1 while( R n > 0 ) R -= m // fix overflow Left-Right-Left (military step) algorithm 1234
12
Q = 0 // 50-bit accumulator for k = 0 … n-4 Q = MS(Q) + r k for j = max(0,k+1-n a )… min(k+1,n b ) Q += a k-j b j r k = D0(Q) for i = n-3 … n // storing digits Q = MS(Q) + r i r i = D0(d) Inner Loops (multiply-add) c = 0 // 1-digit temp store Q = 0 // 33-bit accumulator for k = 0 … n-1 Q = MS(Q) + c – q·m k c = r k r k = D0(Q) Σ i+j=k a i b j (R–q·m)d
13
Improvements Probability of an overflow < n / d. –When a, b and m uniform random (?) DSP SW mod reduction time = 1.0001n 2 + 4n –multiply time = 10 additions: 1.000 01n 2 + 4n HW assisted time = n 2 + 4n Variants (Accumulator = x n d 3 + x n−1 d 2 + … ) –LRL4: q = [ 2 ( µ 1 x n d 2 + (µ 1 x n−1 +µ 0 x n ) d + µ 0 x n−1 ) / d 3 ] –LRL3: q = [ 2 ( µ 1 x n d + (µ 1 x n−1 +µ 0 x n ) ) / d 2 ] ? LRL2: q = [ ( µ 1 x n d + µ 0 x n ) / d 2+δ ], many corrections ε 2 Sequential quotient correction
14
Shorter reciprocal 1 digit → error explosion 1 digit + 2 bits OK: µ = ½ [2d n+1 / m] = d + µ 0 + δ, with δ = 0 or ½ 50-bit Accumulator with carry c = 0 or 1 R = c d 3 + x n d 2 + x n−1 d + x n−2 Estimated quotient-digit q = [( R + R δ /d ) / d 2 + µ 0 c + µ 0 x n / d ] ≈ µR / d Mod reduction time –SW: 1.25n 2 + n (mult = 10 adds: 1.025n 2 + n) –HW: n 2 + n µ 1 =1 Quotient correction
15
Modulus Scaling Special m: NO multiplication for quotient-digit –Quotient digit: q = r n +1 –(0F) MS digit of m = d −1 = 11…1 2 –(10 ) MS 2 digits of m = {1,0} Transform m: 1-digit scaling factor S –mS is n+1-digit –Last reduction step is with m → n-digit result Need to store m and mS Faster than Montgomery: n 2 + const Montgomery with modulus scaling: n 2 + const –LS digit of m = d −1 = 11…1 2 (xF) –Last reduction step is with m → n-digit result
16
Summary
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.