Download presentation
Presentation is loading. Please wait.
Published bySolomon Bishop Modified over 8 years ago
1
Optimizing Modular Multiplication for NVIDIA's Maxwell GPUs by Niall Emmart 1, Justin Luitjens 2, Charles Weems 1 and Cliff Woolley 2 1 University of Massachusetts 2 NVIDIA Corporation
2
Modular Multiplication Computes: A * B mod M Where A, B and M are hundreds to thousands of bits in length.
3
Motivating Problems Modular multiplication is a key primitive, used especially in cryptographic operations : RSA Diffie-Hellman Digital signature algorithms Prime generation Factorization
4
Preliminaries Number Representation GPU is a batched device One problem instance per thread. Separate multiply and Montgomery reduce phases. Small sizes, at most 512 bits.
5
Achieving High Performance Algorithms: use fast squaring, use “Almost Montgomery” (a redundant representation) Don’t use asymptotically faster algorithms Keep everything on chip Minimize register usage Performance is dominated by the low level techniques used to sum the product terms Good utilization (instructions/cycle)
6
Multiplication – Product Terms n word by n word product is computed by summing terms
7
Prior Work – NVIDIA 1.x architectures Two hardware instructions: mad24.lo D, A, B, C and mad24.hi D, A, B, C Note, no carry in or carry out
8
Prior Work – NVIDIA 1.x architectures (cont) Bernstein et al. [1]: – a and b sampled into 15 limbs of 14-bits – column oriented approach – uses mad24.lo as a 32-bit accumulator – achieves 461M 210-bit mod muls op/sec. Emmart and Weems [2]: – a and b sampled into 22 bit values – column oriented approach – uses pairs of mad24.lo / mad24.hi ops as a 48-bit accumulator – achieves 822K 256-bit mod exps ops/sec – equivalent to 816M 210-bit mod muls ops/sec
9
Prior Work – 2.x and 3.x architectures Two hardware instructions: imad{c}.hi.{cc} D, A, B, C imad{c}.lo.{cc} D, A, B, C Note the carry in and carry out options
10
L (A 0 B 0 ) L (A 1 B 0 ) L (A 2 B 0 ) L (A 3 B 0 ) Prior Work – 2.x and 3.x architectures (cont) A3A2A1A0A3A2A1A0 B3B2B1B0B3B2B1B0 H (A 0 B 0 ) H (A 3 B 1 ) H (A 2 B 1 ) H (A 1 B 1 ) H (A 0 B 1 ) ADDL (A 3 B 1 ) L (A 2 B 1 ) L (A 1 B 1 ) L (A 0 B 1 ) H (A 3 B 2 ) H (A 2 B 2 ) H (A 1 B 2 ) H (A 0 B 2 ) ADDL (A 3 B 2 ) L (A 2 B 2 ) L (A 1 B 2 ) L (A 0 B 2 ) H (A 3 B 3 ) H (A 2 B 3 ) H (A 1 B 3 ) H (A 0 B 3 ) ADDL (A 3 B 3 ) L (A 2 B 3 ) L (A 1 B 3 ) L (A 0 B 3 ) H (A 1 B 0 ) H (A 2 B 0 ) H (A 3 B 0 ) Uses an accumulator for each column and ripples the carry 2n^2 + n – 1 instructions
11
Prior Work – 2.x and 3.x architectures (cont) Zheng et al. [3]: – row oriented approach – rippled carries – 3.412B 256-bit mod mul ops/sec Emmart and Weems [2]: – same approach – 3.469B 256-bit mod mul ops/sec – noted this approach does not work on Maxwell
12
Maxwell – 5.x Single hardware instruction xmad{.x}{.cc} D, A.{h0|h1}, B.{h0|h1}, C Note, this instruction also supports carry in and carry out
13
Maxwell – 5.x (cont) Consider computing A*B where A and B are each 32-bits, using a 16-bit multiplier: On Maxwell, a 32-bit madc.lo.cc and madc.hi.cc are emulated and take 4 and 6 instructions respectively. Thus row oriented multiply takes ~10*n 2 instructions! AL * BLAH * BH AL * BH AH * BL These two products are half word aligned A * B =
14
Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1LA1H * B1LA2H * B1LA3H * B1L B0 Terms B1 Terms Green terms are full word aligned Red terms are half word aligned
15
Maxwell – 5.x (cont) A0L * B0LA1L * B0LA2L * B0LA3L * B0L A0H * B0LA1H * B0LA2H * B0LA3H * B0L A0L * B0HA1L * B0HA2L * B0HA3L * B0H A0H * B0HA1H * B0HA2H * B0HA3H * B0H A 3 A 2 A 1 A 0 B 1 B 0 A0L * B1LA1L * B1LA2L * B1LA3L * B1L A0H * B1LA1H * B1LA2H * B1LA3H * B1L A0L * B1HA1L * B1HA2L * B1HA3L * B1H A0H * B1HA1H * B1HA2H * B1HA3H * B1H SUM THE RED TERMS AND SHIFT LEFT 16 BITS USING PRMT ADD IN THE GREEN TERMS 4n^2 + 4n – 4 instructions add
16
Montgomery Reduction on Maxwell MontgomeryReduce(MP X, MP M) { MP U[n]=0; REPEAT n TIMES … use Montgomery’s technique to zero out low 16-bits of X … U=U + (X[0]>>16);... use Montgomery’s technique to zero out low 16-bits of U … X=(X>>32) + (U[0]>>16); U=U>>32; return X=X+(U<<16); } X 0 X 1 X 2 U 0 U 1 U 2 X n-1 X n X 2n-1 U n-1... X = U =
17
Results –Mod Mul Performance
18
Results – Mod Square Performance
19
Instructions Per Cycle / Utilization
20
Results – per SM per MHz
21
Thank you! Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.