Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.

Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004 Boston, MA Full version of the paper is at: http://www.hars.us/Papers/ModMult.pdf

Outline Background ( need, algorithms, complexity… ) Target: occasional PK crypto ( smartcard, OSD… ) Optimizations –Hardware architecture General purpose, support fast modular reduction Speed: Parallel operation: multiply || add / load… Memory: In-place update –Algorithmic improvements Multiply with short Reciprocal (~trial division) –Precision – scaling of reciprocals –Drop insignificant terms Modulus scaling

Modular Multiplication a × b mod m = remainder of (a × b) ÷ m Used in RSA, ECC, ElGamal, Diffie- Hellman, Primality tests, BBS-PRNG… Assume a,b,m are n-digit numbers ▫m normalized: ½ d n ≤ m < d n ▫Digit size (machine word) = 16 bits (8…64) ▫n = 64 for RSA-1024 (10…256) Squaring ~twice faster Conserve memory  Divide after Multiply: double length product

Modular Multiplication Interleaved multiplication and division Barrett multiplication –Multiply with reciprocal ([d 2n / m]: extra n digits) Quisquater's multiplication –Scaling the modulus for many MS 1-bits (S: extra n digits storage) Montgomery multiplication –Number representation: a → a × d n mod m –Right-to-left (simple) interleaved division –Needs pre- and post processing

Sub-Quadratic time algorithms Fast multiplications  Complicated algorithms ▫ Pays for very long numbers ▫ Karatsuba: O(n log 2 3 ) – faster if n > 10…30 ▫ Toom-Cook 3,4…way O(n α ) ▫ 3FT ( Finite Field Fourier Transform ) O(n·logn·loglogn) Division = multiplication with reciprocal Long Reciprocal [d 2n / m] –Newton iteration: 0.6…2 multiplication time Speed-ups for PKC www.hars.us/Papers/Truncated Products.pdf

Quadratic time algorithms School multiplication: n 2 digit products School division: k·n 2 digit operations –Quotient digits estimated with short divisions Digit-Multiplications || other operations +Simple structure +No extra storage when interleaved –Slower –Quotient digits with trial-and-error Goal: reduce # correction steps

Multiply-Accumulate  DSP: multiplication parallel to load / store / add / compare…  Order of the digit-product calculation ▫ Row-order (use input digits sequentially) for i = 0 … | a | -1 for j = 0 … | b | -1 …a i b j … –More memory access ▪ Column-order (output digits sequentially) for k = 0 … | a | + | b | -2 for i,j: i+j = k …a i b j … –Longer accumulator (can be split)

HW Architecture General purpose µP with enhancements –Circuit utilization: Multi-use DSP structure: multiplication || others –Multiplier is large and slow Long accumulator  Split adder / counter In-Accumulator instructions Quotient-digit correction circuit Updateable memory –circular offset write

HW Architecture 16-bit digits || Shift-add = 17.5-bit mult In-Accumulator ▫ Shift ▫ Add

Quotient Digits No need to store q q ← multiplication with short reciprocal µ –µ is used many times –µ ← Newton iteration, look-up table… –All bits - 2 MS digits and 1 bit: error = 0 or 1 (-1) –More than 1-digit reciprocal: quotient often OK –Most economical: µ = [d n +2 / 2m] = {µ 1,µ 0 } scale: ÷2m, making µ exact 2-digit Special case m = ½ d n  µ : = d 2 −1 –Usable: µ = [d n +1.5 / m], µ = [2d n +1 / m]…

The basic algorithm LRL4 R n-1 … n-3 = a n a -1 b n b -1 d + a n a -1 b n b -2 + a n a -2 b n b -1 // Col 1, 2 for k = n a +n b -4 … n-3 // Columns to left R n … n-4 += Σ i+j=k a i b j // Loop-1 to right if (overflow) R -= m q =(R n-1 µ 1 d 2 + R n-1 µ 0 d + R n-2 µ 1 d + R n-2 µ 0 )/d 3 ·2 R =(R–q·m)d // Loop-2 for k = 0 … n-4 // LS digits to left R n … k += Σ i+j=k a i b j // Loop-3 ~ 1 while( R n > 0 ) R -= m // fix overflow Left-Right-Left (military step) algorithm 1234

Q = 0 // 50-bit accumulator for k = 0 … n-4 Q = MS(Q) + r k for j = max(0,k+1-n a )… min(k+1,n b ) Q += a k-j b j r k = D0(Q) for i = n-3 … n // storing digits Q = MS(Q) + r i r i = D0(d) Inner Loops (multiply-add) c = 0 // 1-digit temp store Q = 0 // 33-bit accumulator for k = 0 … n-1 Q = MS(Q) + c – q·m k c = r k r k = D0(Q) Σ i+j=k a i b j (R–q·m)d

Improvements Probability of an overflow < n / d. –When a, b and m uniform random (?) DSP SW mod reduction time = 1.0001n 2 + 4n –multiply time = 10 additions: 1.000 01n 2 + 4n HW assisted time = n 2 + 4n Variants (Accumulator = x n d 3 + x n−1 d 2 + … ) –LRL4: q = [ 2 ( µ 1 x n d 2 + (µ 1 x n−1 +µ 0 x n ) d + µ 0 x n−1 ) / d 3 ] –LRL3: q = [ 2 ( µ 1 x n d + (µ 1 x n−1 +µ 0 x n ) ) / d 2 ] ? LRL2: q = [ ( µ 1 x n d + µ 0 x n ) / d 2+δ ], many corrections ε 2 Sequential quotient correction

Shorter reciprocal 1 digit → error explosion 1 digit + 2 bits OK: µ = ½ [2d n+1 / m] = d + µ 0 + δ, with δ = 0 or ½ 50-bit Accumulator with carry c = 0 or 1 R = c d 3 + x n d 2 + x n−1 d + x n−2 Estimated quotient-digit q = [( R + R δ /d ) / d 2 + µ 0 c + µ 0 x n / d ] ≈ µR / d Mod reduction time –SW: 1.25n 2 + n (mult = 10 adds: 1.025n 2 + n) –HW: n 2 + n µ 1 =1 Quotient correction

Modulus Scaling Special m: NO multiplication for quotient-digit –Quotient digit: q = r n +1 –(0F) MS digit of m = d −1 = 11…1 2 –(10 ) MS 2 digits of m = {1,0} Transform m: 1-digit scaling factor S –mS is n+1-digit –Last reduction step is with m → n-digit result  Need to store m and mS Faster than Montgomery: n 2 + const  Montgomery with modulus scaling: n 2 + const –LS digit of m = d −1 = 11…1 2 (xF) –Last reduction step is with m → n-digit result

Summary

Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.

Similar presentations

Presentation on theme: "Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.

Similar presentations

Presentation on theme: "Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004."— Presentation transcript:

Similar presentations

About project

Feedback