Presentation is loading. Please wait.

Presentation is loading. Please wait.

Unified Architectures for Efficient and Compact Crypto-Processing

Similar presentations


Presentation on theme: "Unified Architectures for Efficient and Compact Crypto-Processing"— Presentation transcript:

1 Unified Architectures for Efficient and Compact Crypto-Processing
Erkay Savaş Sabancı University 11/15/2018 Erkay Savaş

2 Outline Research Motivation Public Key Cryptography Unified Arithmetic
High-Radix Multiplication Dual-Radix Multiplication Support for GF(3n) Arithmetic Implementation Results Future Research 11/15/2018 Erkay Savaş

3 Motivation Compatibility Saving in Area Algorithm Agility
support for fast arithmetic in different finite fields and groups Saving in Area Improve {time  area} metric Algorithm Agility NTRU  ECC 11/15/2018 Erkay Savaş

4 Public Key Cryptography (PKC)
Each user has a pair of keys: Private Key - known only to the owner Public Key - known to everyone in the systems with assurance Encryption: Encryption with the Public Key of the receiver Decryption: Only the receiver can decrypt the message by her/his Private Key 11/15/2018 Erkay Savaş

5 Public Key Cryptography in Use
RSA, Rabin’s scheme Integer factorization, Square root of modulo a composite number Discrete Logarithm Based Algorithms Diffie-Helman Key Exchange, El Gamal Elliptic curve DH Key Exchange, ECDSA Discrete logarithm over elliptic curves IBE pairings over elliptic curve points 11/15/2018 Erkay Savaş

6 RSA Most popular PKC Invented by Rivest/Shamir/Adleman in 1977 at MIT.
Its patent expired in 2000. Based on Integer Factorization problem Each user has public and private key pair. 11/15/2018 Erkay Savaş

7 RSA Encryption & Decryption
Encryption done by using public key y  xe mod n, where x, y < n Decryption done by using private key x  yd mod n 11/15/2018 Erkay Savaş

8 DL Based Cryptosystems
Fundamental operation gx mod p, where x, g < p and g is primitive 11/15/2018 Erkay Savaş

9 Elliptic Curve Cryptography 1/2
Emerging public key cryptography standard for constrained devices. 160 bit key length is equivalent in cryptographic strength to 1024-bit RSA. 313 bit ECC is equivalent to 4096 bit RSA As algebraic/geometric entities have been studied extensively for the past 150 years. Rich and deep theory suitable to cryptography First proposed for cryptographic usage in 1985 independently by Neal Koblitz and Victor Miller 11/15/2018 Erkay Savaş

10 Elliptic Curve Cryptography 2/2
Dominant fundamental operations Multiplication in GF(q) where q = pk and p is prime Alternatives GF(p) k = 1 GF(2k) p = 2 GF(pk) GF(3k) p = 3 11/15/2018 Erkay Savaş

11 Identity Based Encryption (IBE)
Public key can be any string address, name, etc. No need for certificates Anonymity achieved users can choose any public key without revealing their ID It can easily change it 11/15/2018 Erkay Savaş

12 IBE – Bilinear Mapping e(xP, yQ) = e(P, Q)xy = e(yP, xQ) = g
g is in an (extension of) the underlying field. Bilinear mapping over elliptic curves Weil pairing Tate pairing Resource consuming Most efficient bilinear mappings defined on curves over GF(3k) 11/15/2018 Erkay Savaş

13 An Introduction to Unified Arithmetic
Types of finite fields are heavily used Prime fields, GF(p) Binary extension fields, GF(2k) Ternary extension fields GF(3k) (recently, due to IBE schemes) These finite fields feature dissimilar properties Different implementations on specialized hardware 11/15/2018 Erkay Savaş

14 Unified Arithmetic Unified hardware design methodology requires
A single (unified) datapath A single (unified) control Insignificant overhead in the area Insignificant overhead in the time complexity (e.g. critical path delay) Good {timearea} metric 11/15/2018 Erkay Savaş

15 Unified Arithmetic (GF(p) + GF(2k))
A unified hardware design methodology for both field is possible since: the elements of either field are represented using almost the same data structures in digital systems the algorithms for basic arithmetic operations in both fields have structural similarities (i.e. the steps of the algorithms are almost identical) Hence, eventually unified arithmetic is possible 11/15/2018 Erkay Savaş

16 Finite Field Operations in ECC
Addition in GF(p) and GF(2k) Relatively inexpensive in area and time complexity Multiplicative inversion in GF(p) and GF(2k) Prohibitively expensive in terms of time Possible to avoid some of them Multiplication in GF(p) and GF(2k) Expensive in terms of time and area Usually most important operation Our focus 11/15/2018 Erkay Savaş

17 Montgomery Multiplication
Very efficient way of doing multiplication in GF(p) and GF(2k) (now also in GF(3k)) Faster (replaces division by shifts) Suitable for unified design Suitable for scalable design Highly parallel Suitable for pipelining 11/15/2018 Erkay Savaş

18 Montgomery Multiplication
Definition: Given a, b  GF(p), MonMul(a, b) = a·b·R-1 mod p, where R = 2k mod p and k = log2p. Algorithm c := 0 for i = 0 to k-1 c := (c + ai · b) c := (c + c0 · p)/2 if c > p then c := c-p (final subtraction) 11/15/2018 Erkay Savaş

19 Algorithm for GF(2k) Input : a(x), b(x)  GF(2k), p(x) and k
Output: c(x) = a(x)·b(x)·xk GF(2k) c(x) := 0 for i = 0 to k-1 c(x) := (c(x)  ai · b(x)) c(x) := (c(x)  c0 · p(x))/x No final subtraction Note that c/2 and c(x)/x are implemented in an identical way in SW and HW 11/15/2018 Erkay Savaş

20 Representation Addition Unified addition Carry-save representation
Atomic operation: multiplication is performed as a repeated addition Unified addition most efficient when carry-save representation is used for elements of GF(p) Carry-save representation an integer is represented as the sum of two other integers x := xs + xc (sum and carry parts, resp.) 11/15/2018 Erkay Savaş

21 Scalability Original Montgomery multiplication algorithm performs full-precision integer additions Not scalable Instead, long integers are divided into words Addition of words are handled separately on word adders. Choice of word length depends on the precision, area and speed requirements 11/15/2018 Erkay Savaş

22 Word-Based Multiplication
ai b(j+1) p(j+1) c(j+1) PUi+1 ai+1 b(j) p(j) c(j) b(j) p(j) c(j) PUi c(j+1)w-1 c(j)w-1 c(j+1)1 c(j)1 c(j+1)0 c(j)0 c(j) 11/15/2018 Erkay Savaş

23 Dependency Graph 11/15/2018 Erkay Savaş

24 Processing Unit (PU) with w=2
C1(j) C0(j) Dual-Field Adder Dual-Field Adder FSEL 11/15/2018 Erkay Savaş

25 Dual-Field Adder (DFA) 1/2
Almost identical to a full-adder (FA) Difference it has and additional (control) input (FSEL) which suppress the carry output of the adder when it is set to logic-0 Namely, when FSEL = 0 then the adder operates in GF(2k), otherwise it becomes a regular FA 11/15/2018 Erkay Savaş

26 DFA 2/2 B S A C FSEL Cout 11/15/2018 Erkay Savaş

27 Pipeline Organization with two PUs
RAM-b RAM-p RAM-a SR-a PU-1 PU-2 SR-C s: the number of PUs 11/15/2018 Erkay Savaş

28 Total Computation Time (in clock cycles)
w: word size, k: precision, e := k/w, s: the number of PUs 11/15/2018 Erkay Savaş

29 Example Execution Times
Example: k = 1024, w = 32 s = 17  T = 2105 s = 15  T = 2305 s = 10  T = 3415 s = 1  T = 33792 Example: k = 2048, w = 32 s = 33  T = 4221 s = 30  T = 4543 s = 10  T = 13343 s = 1  T = 11/15/2018 Erkay Savaş

30 Comparison to the single-field (GF(p)) design
Unified Overhead Cell Area 47.2w 48.5w 2.75% Cell Propagation Time 11 ns 0% w: word size 1.2 m CMOS technology 11/15/2018 Erkay Savaş

31 Design Alternatives Higher Radix Original design is radix 2
Namely, multiplier bits are scanned one bit in each clock cycle Possible to scan two or more bits of the multiplier a Radix-4: two bits Radix-8: three bits More Complex Design: lower clock frequency, higher area Less clock cycle count  Faster execution of multiplication 11/15/2018 Erkay Savaş

32 Comparison Higher radix vs. single radix Metric area  time
For small total area (i.e. <10000 equivalent NAND gates) the performances of radix-2 and radix-8 are comparable Radix-8 multiplier outperforms radix-2 multiplier more than 3 times when the total area is around NAND gates 11/15/2018 Erkay Savaş

33 Dual-Radix Multiplier
Radix-2 for GF(p) and radix-4 for GF(2k) MUX-1 MUX-2 Selection Logic 3x2 Dual Field Adder 11/15/2018 Erkay Savaş

34 Dual-Radix Multiplier
Three multipliers A1: GF(p)-only multiplier A2: single-radix unified multiplier (with precomp.) A3: dual-radix multiplier Performance (area  time) A3 performs slightly worse than A1 and A2 (between 7% to 19%) in GF(p) mode A3 outperforms A2 by 38% to 46% in GF(2k)-mode 11/15/2018 Erkay Savaş

35 Unified Arithmetic? Unified multiplier
carry-save adders used in multiplier It is not easy to perform other arithmetic operations with carry-save representation such as subtraction and comparison (essential in inversion) 11/15/2018 Erkay Savaş

36 New Redundant Representation
Recall: Carry-save representation X = xs + xc. New redundant representation Redundant signed representation (RSD) X = xp - xn. Subtraction is equivalent to the addition X-Y = (xp - xn) - (yp - yn) = (xp - xn) + (yn - yp) Comparison is relatively easy 11/15/2018 Erkay Savaş

37 RSD All previous multipliers require a reverse transformation to non-redundant for after each multiplication There are thousands multiplication in ECC With RSD, all the computation can be done in RSD form without any reverse transformation a single transformation is necessary if the result is needed in non-redundant form. 11/15/2018 Erkay Savaş

38 Support for GF(3n) Arithmetic
RSD lends itself to a unified arithmetic architecture that efficiently supports GF(3n) arithmetic 11/15/2018 Erkay Savaş

39 Analysis A1: GF(p)-only architecture A2: GF(2k)-only architecture
A3: GF(3n)-only architecture A4: Unified architecture (GF(p) + GF(2k)) A5: Unified architecture (GF(p) + GF(2k) + GF(3n)) A1 + A2: Hypothetical architecture that has separate datapath for GF(p) and GF(2k) 11/15/2018 Erkay Savaş

40 Analysis Metric: area  time A4 over A1 + A2: 7.94%
A5 over A1 + A2 + A3: 33.54% A5 over A4 + A3: % 11/15/2018 Erkay Savaş

41 Implementation Results
2.38 GHz, 0.13 m CMOS # of PUs 160-bit ECC s 1024-bit RSA ms Tate pairing GF(397) 4 315 21.0 508 8 210 10.5 334 16 189 5.25 32 2.12 4 PUs  ~11,000, 8 PUs  ~15,000 NAND gates 11/15/2018 Erkay Savaş

42 Research Directions Embed the unified architectures into common general-purpose processors Unified inversion using RSD Unified architectures for other PKC 11/15/2018 Erkay Savaş

43 Ending… Questions Contact Erkay Savaş erkays@sabanciuniv.edu
11/15/2018 Erkay Savaş


Download ppt "Unified Architectures for Efficient and Compact Crypto-Processing"

Similar presentations


Ads by Google