BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware
2 Contents Applications of ECC Hardware Existing Solutions Design of ECC Hardware Details of ECC Hardware
3 Motivation ECC Hardware: What for? Acceleration Power efficiency Implementation security Side-channel resistance Competitors of ECC hardware RSA hardware Software implementation Very fast on PC But very slow on 8-bit µC Application: Server High throughput > 100 signatures / sec Application: Smartcard Low latency 100 ms per signature Low die size Application: RFID Low power consumption Low die size
4 ECC Hardware: Application Different Requirements for ECC applications Smartcard Acceptable latency Implementation security One EC curve sufficient Server acceleration Throughput (not latency) Complete offloading
5 ECC Hardware: Server Acceleration GF(2 191 ) Hardware Accelerator No GF(2 m ) support in processors (x86, PPC, …) FPGA (programmable HW) as platform Optimized for one curve Complete EC operation in HW GF(2 191 ),f Clk = 66 MHz Multipl. [Radix] k·P [Takte] f CLK,max [MHz] k·P / sec [Ops] W = 8-Bit ,61641 W = 16-Bit ,32770 W = 32-Bit ,44224
6 ECC Hardware: Smartcards Infineon SLE88CFX4000P SLE 88 32-Bit Platform 1408-Bit RSA co-processor RSA coprocessor Local memory (704 bytes) Scalable word width Support for ECC: GF(p), GF(2 m ) Photo © Infineon Technologies
7 ECC Hardware: Smartcards NXP Smart MX P5CC072 Smart MX 8-bit smartcard FameXE coprocessor FameXE RSA, ECC: GF(p), GF(2 m ) 2.5 kB local RAM Word width < 4096 bits Photo © NXP
8 ECC Hardware: RFID Authentication Challenge-response authentication in RFID Minimization of power consumption Trading performance for power Lower clock speed Reduced word size
9 Hardware Design: CMOS Circuits CMOS complementary metal-oxide semiconductor Silicon circuit: up to 2*10 6 transistors per IC Digital hardware: standard-cell circuits Flipflops, full adders, muxes, gates: xor, and, …
10 Hardware Design: Top → Down Top-down design methodology From specification To working silicon „First time right“ Design process Refinement of models Early estimates of area, power, performance Design iterations when constraints are not met
11 Hardware Design: Design Flow Abstraction level and tools 1.System level Defining functionality and constraints 2.Algorithmic level High-level model 3.Architectural level Paper + pencil 4.Register-transfer level HDL description 5.Circuit level Schematic + layout
12 Challenges of ECC Hardware EC Algorithms (ladder, EC point operation, point representation) Defines number of multiplications Defines storage requirements Defines implementation security Multiplication Determines performance Storage Determines circuit size Control Determines HDL complexity Do’s Fix EC parameters Fixed field size Separate storage and computation Dont’s Trading increased storage for lower computation Optimization of negligible things Inversion
13 Approaches to ECC Hardware EC-processor Computing full point multiplication No external interaction necessary Co-processor Acceleration of finite-field operation (Limited local memory) External interaction needed For point ladder and point operation ISE Enhancement of existing instruction set Acceleration of core operations Multiply-Accumulate instructions Support of polynomial arithmetic ?
14 Algorithms for ECC Bitserial multiplication a in full precision; b bitwise Faster: digit-serial (w bits of b) Modular reduction Without division: NIST reduction For trinomial / pentanomials For Mersenne-like primes Montgomery Multiplication Combines a*b and mod p For arbitrary moduli MulSer(a, b) = a*b c = 0 for i = n-1 to 0 do c = 2·c + a·bi Pre-comp: R = 2 n+2 mod p, R 2 mod p, p’ = (-p) -1 mod 2 MonMul(a, b) = a·b·R -1 mod p c = 0 for i = 0 to n+1 do q = ((c 0 + a 0 ·b i ) mod 2)·p’ c = c + p·q + a·b i
15 Modular Multiplication in HW GF(2 191 ) Example Digit-serial multiplication c(x) = a(x)*m(x) mod f(x) a(x): full precision m(x): w-bit digits –Digit size w = 8, 16, 32 Alignment of intermediate result Interleaved NIST reduction small intermediate results Squaring as own operation Simple when irred. poly f(x) fixed
16 Multiplier in HW Partial product generation a(x) * m i Simply 191 AND gates Amplification of m i crucial Aligning intermediate results Simple: Fixed shift operation Accumulation of PP Array or Tree adder Modular reduction 200 bits -> 191 bits
17 GF(p) Multiplier Radix-4 multiplier A in full precision B: 2 bits / cycle Montgomery multiplic. Orup’s optimization Redundant number representation Carry-save (CS) More storage Shorter crit. Path Red2bin: CSA reuse Booth recoding (Benc)
18 Dual-field Support Application: e.g. ECDSA ECC over GF(2 m ) Protocol: GF(p) Mul, Add, Inv mod n –n … base point order Architecture ~GF(p) mult. CSA for GF(p) XOR for GF(2 m ) Carries blocked GF(p) versus GF(2 m ) GF(2 m ) faster … GF(p) needs reg. C
19 ECC for RFID Problem: Very constrained power budget P = E/t = I*U = f clk *C L *Vdd*Vdd Problem analysis: where is power consumed? Mostly for storage: clocking of registers New idea Less registers; more comb. logic Smaller datapaths No computation at full wordsize Adoption of ISE techniques –MAC-operation Simple HDL implementation
20 Control Task of control logic Generate control signals For – 6 Mio clock cycles Separation of control and datapath Registered control signals For performance and power efficiency Avoiding critical path Hierarchical control Complex control Options Hardwired State machine Micro-program Counter + ROM Micro-controller Software
21 Results Server Acceleration For GF(2 191 ) Size: 1500 slices On Xilinx FPGA > 1000 EC ops / sec 66 MHz clock Smartcard Coprocessor Dual-Field capability 192-bit ECC: 23k GE 400k – 700k cycles 256-bit ECC: 31k GE 600k - 900k cycles ECC for RFID 163-bit ECC: 12k GE 400k cycles 192-bit ECC: 18k GE 850k cycles Storage 75% of area ISE-datapath 75% of power Realistic on <130 nm CMOS Power constraint ~15µA
22 Conclusions Different applications require different ECC hardware Fixed parameters (EC params, field) allow more efficient implementation Squaring in GF(2 m ) NIST reduction ECC for RFID Seems possible