Download presentation
Presentation is loading. Please wait.
Published byRodger Alexander Modified over 9 years ago
1
FPT 2006 Bangkok A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular Multipliers Ralf Laue, Sorin A. Huss Integrated Circuits and Systems Lab, Computer Science Dept. Technische Universität Darmstadt, Germany {laue|huss}@iss.tu-darmstadt.de December 14 th, 2006 FPT 2006, Bangkok
2
Page 2FPT 2006 Bangkok Introduction Speed-up of todays hardware stems increasingly from parallelization. Cryptographical implementations should take ad- vantage of this by using parallel algorithm versions. We begin with an survey about parallelization on dif- ferent abstraction levels of public key cryptography. Then, we present a novel parallel memory architecture for elliptic curve cryptography in GF(P). –Allows the execution time to scale with the number of parallel modular multipliers. –Direct memory connection leads to low resource usage.
3
Page 3FPT 2006 Bangkok Overview Parallelization on Different Abstraction Levels Novel Memory Architecture –Design Considerations –Proposed Memory Architecture Experimental Results –Number of Parallel Multipliers –Prototype Implementation –Application to Another EC Arithmetic Algorithm
4
Page 4FPT 2006 Bangkok Parallelization on Different Abstraction Levels In general, parallelization yields greater benefit on lower levels (as less control logic needs to be duplicated) Parallelization on higher levels allows further speed-up and offers advantages not available on lower levels. Parallelization methods on different levels do not exclude each other. Finite Field Modular Arithmetic Elliptic Curve Group Point Addition and Doubling Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Cryptographic Scheme System RSA ECC/HECC
5
Page 5FPT 2006 Bangkok Parallelization on Finite Field Level Modular multi-word multiplication is the most critical operation. Thus, paralleliza- tion on this level is a popular strategy. The approaches on this level do not exclude each other. Data-paths of full bit-width: –Allow for linear time complexity at cost of proportional increase of resources (e.g. systolic array). –Usual bit-widths: ECC: >100 bit, RSA: >1000 bit –Problem: Design for maximum bit-width. For smaller word counts resources stay unused, higher may be infeasible. Finite Field Modular Arithmetic Elliptic Curve Group Point Addition and Doubling Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Cryptographic Scheme System RSA ECC/HECC
6
Page 6FPT 2006 Bangkok Parallelization on Finite Field Level (cont.) Pipelining –Allows for linear time complexity, too. –More flexible as buses of full bit-width, because number of pipeline stages may be chosen freely. –Problem: calculated bit-width always corresponds to a multiple of the number of stages in words. Resources may still stay unused. ECC/RSA-combination allows only for pipeline lengths designed for ECC, as those designed for RSA would waste resources and execution time, if used with ECC.
7
Page 7FPT 2006 Bangkok Parallelization on Finite Field Level (cont.) Karatsuba multiplication: –Multiplying two numbers with two words each can be done with three word multiplications. –Recursion leads to approx. O(n 1,585 ). –As recursion is difficult in hardware, this is usually used for multiplications in full bit-width (requires less resources). Residue Number Systems: –Long numbers are represented relative to a base consisting of multiple smaller moduli, relatively prime to each other. The Chinese Remainder Theorem ensures a unique mapping. –Multiplication, addition and subtraction may be executed in parallel. –Can be interpreted as special case of buses of full bit-width.
8
Page 8FPT 2006 Bangkok Parallelization on Elliptic Group Level EC doubling and addition may be sped up by using multiple modular units in paral- lel. Literature suggests a maximum of two or three modular multipliers (data depen- dencies limit further improvements). One instance of the remaining modular arithmetic is sufficient, because it is very fast in comparison. This abstraction level is well-suited for parallelization in SIMD implementations. Note that this level does not exist for RSA. Finite Field Modular Arithmetic Elliptic Curve Group Point Addition and Doubling Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Cryptographic Scheme System ECC/HECC RSA
9
Page 9FPT 2006 Bangkok Parallelization on Discrete Logarithm/ Integer Factorization Level Both point multiplication and expo- nentiation allows parallel use of two instances of group operations. –E.g. with Montgomery Ladder (paral- lel point doubling/addition for ECC; parallel square/multiply for RSA). Parallelization on this abstraction level is (in addition to further speed-ups) often used as countermeassure against side channel attacks. Finite Field Modular Arithmetic Elliptic Curve Group Point Addition and Doubling Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Cryptographic Scheme System ECC/HECC RSA
10
Page 10FPT 2006 Bangkok Parallelization on Cryptographic Primitive/ System Level Cryptographic Schermes usually only use one point multiplication/exponentiation. –We know of no proposal for parallelization on this level. Possible scenario: Flexible coprocessor for RSA/ECC –Parallelization on lower abstraction levels is only possible to a certain degree, as long as unused resources should be avoided. –Further parallelization may be done on the level of the cryptographic primitive to increase throughput. Finite Field Modular Arithmetic Elliptic Curve Group Point Addition and Doubling Discrete Logarithm/ Integer Factorization Point Multiplication/Exponentiation Cryptographic Scheme System ECC/HECC RSA
11
Page 11FPT 2006 Bangkok Overview Parallelization on Different Abstraction Levels Novel Memory Architecture –Design Considerations –Proposed Memory Architecture Experimental Results –Number of Parallel Multipliers –Prototype Implementation –Application to Another EC Arithmetic Algorithm
12
Page 12FPT 2006 Bangkok Design Goals ECC implementation for GF(P) on FPGAs. Ability to support different key lengths. Resource requirements should be relatively low, thus allowing integration of further functions on the FPGA. –E.g. other cryptographic modules, something unrelated to cryptography. Thus, minimum execution time was less important than a high utilization of the allocated resources.
13
Page 13FPT 2006 Bangkok Design Decisions No parallelization on finite field level –Would lead to unused resources, at least for some key lengths. Instead, parallelization on elliptic group level –Depends on data dependencies, independent from key length. Modular multiplication is more complex and time consuming than remaining modular operations. –Chosen architecture consists of multiple modular multipliers parallel to each other and the module for the remaining modular arithmetic parallel to the multipliers.
14
Page 14FPT 2006 Bangkok Conventional Memory Architecure Memory architecture must allow all operations to be continuously supplied with data. Conventional memory architecure consists of one memory and modules with input and output registers. Registers take up FPGA resources, but contain only redundant data copied from memory. Mult 1 RAM... Mult nALU... Square
15
Page 15FPT 2006 Bangkok Novel Memory Architecture Each modular multiplier is assigned its own memory block via a direct connection. –Supports continuous data supply. –Low general resource usage, slightly increased memory usage. Remaining modular arithmetic may access memory blocks via the second port. Execution time scales with the number of modular multpliers. Modular arithmetic copies data between local memory blocks, as multipliers only can access “their“ memory block. –Does not hinder scalability, as remaining modular arithmetic can access all memory blocks simultaneously in parallel.
16
Page 16FPT 2006 Bangkok Novel Memory Architecture (cont.) Usual memory blocks lack third port. Cryptographic primitive and modular arithemtic share second memory port. –Access from cryptographic primitive only while no computation is executed. –Else: access from the modular arithmetic. Elliptic curve arithmetic does not directly access the data, but only indirectly via the modular arithmetic. ModMult BRAM MUX... ModMult BRAM ModMult BRAM Modular Arithmetic Elliptic Curve Arithmetic Cryptographic Primitive data statuscommands busy commands data
17
Page 17FPT 2006 Bangkok Overview Parallelization on Different Abstraction Levels Novel Memory Architecture –Design Considerations –Proposed Memory Architecture Experimental Results –Number of Parallel Multipliers –Prototype Implementation –Application to Another EC Arithmetic Algorithm
18
Page 18FPT 2006 Bangkok Number of Parallel Multipliers Determine number of multipliers to be used (IEEE 1363): –ECDbl can utilize only two parallel modular multipliers because of data dependecies. –Utilization of modular multipliers for ECAdd (16 multiplications). Table highlights scalability. –(#multipliers * #consecutive multiplications) is smallest multiple of the number of multipliers larger or equal than overall number of multiplications. #multipliers multiplier utilization #consecutive multiplications 2approx. 98%8 3approx. 82%6 4approx. 74%5
19
Page 19FPT 2006 Bangkok Data Flow Graph ECAdd, IEEE Consecutive multiplications are always executed on same multiplier. –No copying between memory blocks. –Dark and light grey multiplications are executed on different modular multipliers. Longest path contains 5 modular multiplications. –No speed-up by using more than 4 multipliers possible.
20
Page 20FPT 2006 Bangkok Schedule ECAdd, IEEE Schedule for two modular multipliers. Mapping to multipliers as shown in data flow graph on last slide. Quad1Mult1Mult2Mult3Quad3Mult12Mult11Mult15 Quad2Mult5Mult4Mult6Mult9Quad4Mult10Mul14 Sub1Mult8_Add Mult7_Add Sub3 Sub2 Sub4Sub5Sub6 Mult13_Add Sub7 Div1 ModMultB ModMultA ModArith
21
Page 21FPT 2006 Bangkok Prototype Implementation - Results Taking its smaller resource usage into account, the execution time of our solution is comparable to previous work. However, because of the high resource usage, none of the previous designs fulfills the given requirements. Reference [5] uses GF(2 m ) as finite field, thus execution time is not comparable. But its memory architecture is similar, but not easily applicable to GF(P) and it does not scale as well. FlipFlopsLUTsSlicesBRAMsCycle PeriodPoint Multiplication this work 11283015180639.898ns 12.716ms (160 Bit) [16]695911227n/a 10.952ns14.414ms (160 Bit) [30]573511416n/a3525nsestimated 3ms (192 Bit) [5]n/a 1831424100.1ns114.71µs (191 Bit GF(2 m ))
22
Page 22FPT 2006 Bangkok Application to Alternative EC Arithmetic Application of our memory architecture to an algorithm for atomic point doubling and addition. Algorithms consists of more modular multiplications, thus, allowing the better utilization for more modular multipliers. Our architecture allows the parallel execution of modular additions. With three multipliers atomic algorithm is faster as IEEE point addition with only two parallel multipliers. #multipliers multiplier utilization #consecutive multiplications #consecutive additions [21]2approx. 90%108 this work 2approx. 94%101 3approx. 90%71 4approx. 89%55 5approx. 75%51
23
Page 23FPT 2006 Bangkok Schedule for Atomic ECAdd&Dbl Schedule for three modular multipliers. Mult6 Add26 Add33 Sub18Add25 Sub4 Add3 Add12Add16 Sub32Add15 ModMultC ModMultB ModArith Mult1Mult22Mult27Mult29Mult20Mult28Mult30 Mult21Mult9Mult2Mult23Mult5Mult14Mult31 Mult7Mult8Mult13Mult10 Add19Add17 Add11 Sub24 ModMultA
24
Page 24FPT 2006 Bangkok Conclusions Novel memory architecture for ECC implementations over GF(P) on FPGAs features the following advantages: –Low register usage, because of direct memory access. –Execution time scales with the number of modular multipliers, as long as data dependencies allow this. –Remaining modular arithmetic is executed in parallel to all the modular multiplications.
25
Page 25FPT 2006 Bangkok Thank you for the attention. Any questions?
26
Page 26FPT 2006 Bangkok References [5] N. A. Saqib, F. Rodríguez-Henríquez, A. Díaz-Pérez, „A Parallel Architecture for Computing Scalar Multiplication on Hessian Elliptic Curves.“ in ITCC, vol. 2, 2004, pp.493-497. [16] A. B. Örs, L. Batina, B. Preneel, J. Vandewalle, „Hardware Implementation of an Elliptic Curve Processor over GF(p).“ in ASAP. IEEE Computer Society, 2003, pp. 433-443. [21] W. Fischer, C. Giraud, E. W. Knudsen, „Parallel scalar multiplication on general elliptic curves over F p hedged against Non-Differential Side-Channel Attacks.“, Jan 2002. [30] G. Orlando, C. Paar, „A Scalable GF(p) Ellitpic Curve Processor Architecture for Programmable Hardware.“ in CHES, ser. LNCS, vol 2162, 2001, pp. 348-363.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.