2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea.

2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea

Ho Won Kim 2 Contents Introduction Design Philosophy for Fast HEC coprocessors  Parallelism  Pipelining  Loop unfolding on inversion operation Design Methodology Arithmetic Unit HECC coprocessor Architecture  Various HECC types : from high performance to low area Performance Results Conclusions

Ho Won Kim 3 Introduction (1/4)  

Ho Won Kim 4 Introduction (2/4) Group Cardinality  HEC of genus g over F q  The cardinality of J C (F q ) is given by Hasse-Weil:  Major implication : group size  (field size) g  Don ’ t choose genus ≥ 4 (5) because of possible attacks [Frey/Rück, Gaudry, Theriault, …] Group size vs. Field size  Group size of 2 160 (commercial security level)  ECC (g=1): field size = 160 bit  HECC (g=2): field size = 80 bit  HECC (g=3): field size = 56 bit  HECC (g=4): field size = 52 bit

Ho Won Kim 5 Introduction (3/4) Explicit Formulae of HECC t1 = a*e; t2 = b*d; t3 = b*f; t4 = c*e; t5 = a*f; t6 = c*d; t7 = sqr(c+f); t8 = sqr(b+e); t9 = (a+d)*(t3+t4); t10= (a+d)*(t5+t6); r =(f+c+t1+t2)*(t7+t9) + t10*(t5+t6) + t8*(t3+t4); t11 = (b+e)*(c+f); inv2 = (t1+t2+c+f)*(a+d)+t8; inv1 = inv2*d + t10 + t11; inv0 = inv2*e + d*(t10+t11) + t9 + t7; t12 = (inv1+inv2)*(k+n+l+o); t13 = (l+o)*inv1; t14 = (inv0+inv2)*(k+n+m+p); t15 = (m+p)*inv0; t16 = (inv0+inv1)*(l+o+m+p); t17 = (k+n)*inv2; rs0 = t15; rs1 = t13+t15+t16; rs2 = t13+t14+t15+t17; rs3 = t12+t13+t17; rs4 = t17; t18 = rs3+rs4*d; s0s = rs0 + f*t18; s1s = rs1 + rs4*f + e*t18; s2s = rs2 + rs4*e + d*t18; w1 = inv(r*s2s); w2 = r*w1; w3 = w1*sqr(s2s); w4 = r*w2; w5 = sqr(w4); Input:D 1 = div(a 1,b 1 ), D 2 = div(a 2,b 2 ) Output:D 3 = D 1 + D 2 = div(a 3,b 3 ) Composition:d = gcd(a 1,a 2,b 1 +b 2 +h)=s 1 a 1 +s 2 a 2 +s 3 (b 1 +b 2 +h) a‘ 3 = a 1 a 2 /d b‘ 3 = [s 1 a 1 b 2 +s 2 a 2 b 1 +s 3 (b 1 b 2 +f)]/f mod a‘ 3 Reduction:WHILE deg(a‘ k ) > g, DO a‘ k = f – b‘ k-1 mod a‘ k b‘ k = (-h-b‘ k-1 ) mod a‘ k END WHILE a 3 = a‘ k b 3 = b‘ k s0 = w2*s0s; s1 = w2*s1s; s2 = w2*s2s; z0 = s0*c; z1 = s1*c+s0*b; z2 = s0*a+s1*b+c; z3 = s1*a+s0+b; z4 = a+s1; z5 = to_GF2E(1L); t1 = w4*h2; t2 = w4*h3; u3s = d + z4 + s1; u2s = d*u3s + e + z3 + s0 + t2 + s1*z4; u1s = d*u2s + e*u3s + f + z2 + t1 + s1*(z3+t2) + s0*z4 + w5; u0s = d*u1s + e*u2s + f*u3s + z1 + w4*h1 + s1*(z2+t1) + s0*(z3+t2) + w5*(a+f6); t1 = u3s+z4; v0s = w3*(u0s*t1 + z0) + h0 + m; v1s = w3*(u1s*t1 + u0s + z1) + h1 + l; v2s = w3*(u2s*t1 + u1s + z2) + h2 + k; v3s = w3*(u3s*t1 + u2s + z3) + h3; a3 = f6 + u3s + v3s*(v3s+h3); b3 = u2s + a3*u3s + f5 + v3s*h2 + v2s*h3; c3 = u1s + a3*u2s + b3*u3s + f4 + v2s*(v2s+h2) + v3s*h1 + v1s*h3; k3 = v2s + (v3s+h3)*a3 + h2; l3 = v1s + (v3s+h3)*b3 + h1; m3 = v0s + (v3s+h3)*c3 + h0; Explicit formulae (field arithmetic only): Polynomial arithmetic: Explicit formulae : ITCC04 [PWP04] Group doubling: 1inv, 9 mults Group Addition: 1 inv, 21 mults Harley’s explicit method

Ho Won Kim 6 Introduction (4/4) Pros & Cons of the HECC  Pros  Short field size : for genus 2 HECC, the size of the underlying field size is a half of that of ECC –So, It has room to adopt high speed implementation techniques such as parallelism and loop unfolding  Cons  There are many multiplication stages in Explicit formulae –So, when HECC is implemented as a hardware, its interconnect network and buffer allocation will be complicated Purpose of this work  To check its applicability as a high performance public key crypto system  To check its applicability at the resource constrained environment such as PDA & Smart Cards from practical point of view

Ho Won Kim 7 Design Philosophy (1/2) To make HECC coprocessor faster, we have used the following techniques:  Parallelism  Multiple number of field operation units to execute the explicit formulae as fast as possible  The number of multipliers is decided by drawing data dependency graph (DDG) for explicit formulae –For genus-2 HECC explicit formulae, we can see two multipliers are good choice for implementation –The usage rate of two multipliers is about 90 % group addition operation in affine coord.

Ho Won Kim 8 Design Philosophy (2/2)  Pipelining  Field operations(field addition, field squaring) and data copy operation between buffers are performed at the same clock cycle  And can be overlapped with multiplication and inversion  Loop Unfolding  “ Loop unfolding ” is the process of unfolding a loop so that several iterations(clock cycles) are unrolled into the same iteration(one clock cycle)  Is applied to the MAIA inversion algorithm to boost the performance with reasonable hardware increases

Ho Won Kim 9 Fast Inversion Block (1/2) Maximally 4 loops are executed in one clock cycle MAIA algorithm with 4 loops are unfolded Can be realized by simple XOR, rewiring

Ho Won Kim 10 Fast Inversion Block (2/2) Types Unfolding level # of Slices Frequency (MHz) Clock Cycles TTC ( ) MAIA with loop unfolding 1 (original alg.)30367.5178 2.63 254765.0120 1.85 367970.0103 1.47 473374.6791.30 Four loops are unfolded  We get two times better performance !! Features of the Inversion Block of the HECC coprocessors

Ho Won Kim 11 Design Methodology  Architecture design  VHDL coding  synthesis & implementation to FPGA Main Points toward high performance HECC coprocessor Design  Make the H/W complexity of the Interconnect Network as small as possible  Is done by carefully designed arithmetic units and data path, etc.  Make the number of registers as small as possible  Is done by careful buffer allocation  Make efficient AUs  By using parallelism, pipelining, loop unfolding techniques, etc.

Ho Won Kim 12 Arithmetic Unit AU (Arithmetic Unit)  Field addition : simple XOR (done on the data-path)  Field squaring : XOR and rewiring (done on the data-path)  Field multiplication : scalable, high performance multiplication logic is implemented (digit serial multiplier)  Field inversion : high performance inversion logic is implemented (modified almost inverse algorithm with a loop unfolding technique) AU Block Diagram

Ho Won Kim 13 HEC Architecture (1/3) Various HECC Coprocessor Types from High Performance to Moderate Size  Type 1 : for high performance  Parallel execution of the group addition & doubling  2 multipliers & 1 inversion logic for group addition  1 multiplier & 1 inversion logic for group doubling (Affine case)  Fast execution of the addition & doubling is possible. but, it causes high hardware complexity

Ho Won Kim 14 HEC Architecture (2/3)  Type 2  Use only registers for RF and multiplexers as an interconnect network  Parallel execution of data read & write is possible. but, it causes high complexity at the interconnect network  Multipliers and inversion logic are shared for group ops.  Technology independent design as Type 1 (portable to any FPGA and ASIC)  Type 3 : low hardware complexity  Uses memory to reduce hardware complexity  Uses buses to reduce the complexity of interconnect network  Incurs more latencies to perform explicit formulae, but, reduces hardware complexity

Ho Won Kim 15 HEC Architecture (3/3) TypesLogic Interconnect Network Scalar mult Method Storage For RF Affine Coord Type 1 Addition:2 MUL,1INV Mux Right to Left (parallel binary) 13 REGs Doubling:1MUL,1INV10 REGs Type 2 2MUL,1 INV(shared)MuxLeft to Right (binary)14 REGs Type 3 2MUL,1 INV(shared)Mux, BusLeft to Right (binary) Memory (14 entries) Architectural characteristics of HECC coprocessors

Ho Won Kim 16 Performance Results (1/2) Types Coord. Type Scalar Mult. Key size # Slices Freq. (MHz) TTC (ms) Area X Time Cla03 Projective Parallel Bin D=1 D=4 166 Bits 22,000-10.0 50.74 60,000-9.0124.54 Elias GF(2 113 ) Projective NAF,D=1 NAF,D=4 226 bits 21,55045.67.3936.73 25,27145.32.0311.83 Type 1, GF(2 89 ) Affine Parallel Bin, Binary, Binary 178 bits 9,95062.900.4361.00 Type 2, GF(2 89 ) 7,09650.080.7911.30 Type 3, GF(2 89 ) 4,99550.541.0201.18 Type 1,GF(2 113 ) Affine Parallel Bin, Binary, Binary 226 Bits 11,36159.070.7221.89 Type 2,GF(2 113 ) 8,93442.431.4593.01 Type 3,GF(2 113 ) 6,43643.471.7672.62 ECC Orlando et al. 167bits1,50176.70.210- Gura et al163bits11,84566.40.1430.4 Performance of the HECC coprocessors (scalar mult.) Target platform : Xilinx FPGA XC2V4000 -6

Ho Won Kim 17 Performance Results (2/2) Performance of the HECC coprocessors (scalar mult.) Xilinx Virtex II FPGA (XC2V4000ff1517-6) Normalized to the best AT product  Performance (TTC)  Area-Time Product

Ho Won Kim 18 Conclusions The high performance of the HECC coprocessor is due to  fast inversion algorithm  High operating frequency of multiplier in spite of its large digit size (D=32)  Reduced interconnect network latency by using carefully designed buffer allocation and Arithmetic Units  Parallel execution of field operations  Pipelined execution of the field operations and data movement between register files We can say that HECC coprocessor can be used at high performance & resource constrained security environments  Since the performance is about 0.436 ms with moderate H/W size (Type 1, GF(2 89 ))  However, more research works are still necessary to surpass the ECC

2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea.

Similar presentations

Presentation on theme: "2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea.

Similar presentations

Presentation on theme: "2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea."— Presentation transcript:

Similar presentations

About project

Feedback