BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.

Slides:



Advertisements
Similar presentations
Are standards compliant Elliptic Curve Cryptosystems feasible on RFID?
Advertisements

IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 Digital Integrated Circuits A Design Perspective Arithmetic Circuits Jan M. Rabaey Anantha.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.
1 CONSTRUCTING AN ARITHMETIC LOGIC UNIT CHAPTER 4: PART II.
Synchronous Digital Design Methodology and Guidelines
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 [Adapted from Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.]
EECS Components and Design Techniques for Digital Systems Lec 18 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Design of a Reconfigurable Hardware For Efficient Implementation of Secret Key and Public Key Cryptography.
A Dual Field Elliptic Curve Cryptographic Processor Laboratory for Reliable Computing (LaRC) Electrical Engineering Department National Tsing Hua University.
UNIVERSITY OF MASSACHUSETTS Dept
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
Digital Integrated Circuits© Prentice Hall 1995 Arithmetic Arithmetic Building Blocks.
ST. ALOYSIUS INSTITUTE OF TECHNOLOGY COMPUTER SCIENCE ENGINEERING 4 th SEMESTER COMPUTER SYSTEM ORGAINZATION SUBMITTED TO :- SHWETA AGRAWAL SUBMITTED BY.
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.
1 Montgomery Multiplication David Harris and Kyle Kelley Harvey Mudd College Claremont, CA {David_Harris,
Chapter 6 Memory and Programmable Logic Devices
1 An Elliptic Curve Processor Suitable for RFID-Tags L. Batina 1, J. Guajardo 2, T. Kerins 2, N. Mentens 1, P. Tuyls 2 and I. Verbauwhede 1 Katholieke.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
Long Modular Multiplication for Cryptographic Applications Laszlo Hars Seagate Research Workshop on Cryptographic Hardware and Embedded Systems, CHES 2004.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Institute for Applied Information Processing and Communications (IAIK) – VLSI & Security Dr. Johannes Wolkerstorfer IAIK – Graz University of Technology.
Cosc 2150: Computer Organization
Digital Integrated Circuits Chpt. 5Lec /29/2006 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Chapter 1 An Introduction to Processor Design 부산대학교 컴퓨터공학과.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Description and Analysis of MULTIPLIERS using LAVA.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
EECS Components and Design Techniques for Digital Systems Lec 16 – Arithmetic II (Multiplication) David Culler Electrical Engineering and Computer.
Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea.
Welcome CSC 480/580 – Digital Logic & Computer Design Term: Winter 2002 Instructor: William T Krieger.
Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.
A Low-Power Design for an Elliptic Curve Digital Signature Chip Rich Schroeppel, Tim Draelos, Russell Miller, Rita Gonzales, Cheryl Beaver
Computer Architecture And Organization UNIT-II General System Architecture.
CDA 3101 Fall 2013 Introduction to Computer Organization The Arithmetic Logic Unit (ALU) and MIPS ALU Support 20 September 2013.
EE141 © Digital Integrated Circuits 2nd Arithmetic Circuits 1 Digital Integrated Circuits A Design Perspective Arithmetic Circuits Jan M. Rabaey Anantha.
Cost/Performance Tradeoffs: a case study
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,
Digital Integrated Circuits© Prentice Hall 1995 Arithmetic Arithmetic Building Blocks.
Cryptographic coprocessor
Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.
Integrated Microsystems Lab. EE372 VLSI SYSTEM DESIGNE. Yoon 1-1 Panorama of VLSI Design Fabrication (Chem, physics) Technology (EE) Systems (CS) Matel.
A Reconfigurable System on Chip Implementation for Elliptic Curve Cryptography over GF(2 n ) Michael Jung 1, M. Ernst 1, F. Madlener 1, S. Huss 1, R. Blümel.
CSE477 L21 Multiplier Design.1Irwin&Vijay, PSU, 2002 CSE477 VLSI Digital Circuits Fall 2002 Lecture 21: Multiplier Design Mary Jane Irwin (
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003 Rev /05/2003.
BASIC COMPUTER ARCHITECTURE HOW COMPUTER SYSTEMS WORK.
EE141 Arithmetic Circuits 1 Chapter 14 Arithmetic Circuits Rev /12/2003.
George Mason University Finite State Machines Refresher ECE 545 Lecture 11.
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
CSE477 VLSI Digital Circuits Fall 2003 Lecture 21: Multiplier Design
Instructor: Dr. Phillip Jones
Elliptic Curve Cryptography over GF(2m) on a Reconfigurable Computer:
CSE 370 – Winter 2002 – Comb. Logic building blocks - 1
EFFICIENT ADDERS TO SPEEDUP MODULAR MULTIPLICATION FOR CRYPTOGRAPHY
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
HIGH LEVEL SYNTHESIS.
Final Project presentation
UNIVERSITY OF MASSACHUSETTS Dept
Description and Analysis of MULTIPLIERS using LAVA
Arithmetic Building Blocks
UNIVERSITY OF MASSACHUSETTS Dept
Instructor: Michael Greenbaum
Presentation transcript:

BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

2 Contents Applications of ECC Hardware Existing Solutions Design of ECC Hardware Details of ECC Hardware

3 Motivation ECC Hardware: What for?  Acceleration  Power efficiency  Implementation security  Side-channel resistance Competitors of ECC hardware  RSA hardware  Software implementation  Very fast on PC  But very slow on 8-bit µC Application: Server  High throughput  > 100 signatures / sec Application: Smartcard  Low latency  100 ms per signature  Low die size Application: RFID  Low power consumption  Low die size

4 ECC Hardware: Application Different Requirements for ECC applications  Smartcard  Acceptable latency  Implementation security  One EC curve sufficient  Server acceleration  Throughput (not latency)  Complete offloading

5 ECC Hardware: Server Acceleration GF(2 191 ) Hardware Accelerator  No GF(2 m ) support in processors (x86, PPC, …)  FPGA (programmable HW) as platform  Optimized for one curve  Complete EC operation in HW GF(2 191 ),f Clk = 66 MHz Multipl. [Radix] k·P [Takte] f CLK,max [MHz] k·P / sec [Ops] W = 8-Bit ,61641 W = 16-Bit ,32770 W = 32-Bit ,44224

6 ECC Hardware: Smartcards Infineon SLE88CFX4000P SLE 88  32-Bit Platform  1408-Bit RSA co-processor RSA coprocessor  Local memory (704 bytes)  Scalable word width  Support for ECC: GF(p), GF(2 m ) Photo © Infineon Technologies

7 ECC Hardware: Smartcards NXP Smart MX P5CC072 Smart MX  8-bit smartcard  FameXE coprocessor FameXE  RSA, ECC: GF(p), GF(2 m )  2.5 kB local RAM  Word width < 4096 bits Photo © NXP

8 ECC Hardware: RFID Authentication Challenge-response authentication in RFID  Minimization of power consumption  Trading performance for power  Lower clock speed  Reduced word size

9 Hardware Design: CMOS Circuits CMOS  complementary metal-oxide semiconductor  Silicon circuit: up to 2*10 6 transistors per IC  Digital hardware: standard-cell circuits  Flipflops, full adders, muxes, gates: xor, and, …

10 Hardware Design: Top → Down Top-down design methodology  From specification  To working silicon  „First time right“ Design process  Refinement of models  Early estimates of  area, power, performance  Design iterations  when constraints are not met

11 Hardware Design: Design Flow Abstraction level and tools 1.System level  Defining functionality and constraints 2.Algorithmic level  High-level model 3.Architectural level  Paper + pencil 4.Register-transfer level  HDL description 5.Circuit level  Schematic + layout

12 Challenges of ECC Hardware EC Algorithms (ladder, EC point operation, point representation)  Defines number of multiplications  Defines storage requirements  Defines implementation security Multiplication  Determines performance Storage  Determines circuit size Control  Determines HDL complexity Do’s  Fix EC parameters  Fixed field size  Separate storage and computation Dont’s  Trading increased storage for lower computation  Optimization of negligible things  Inversion

13 Approaches to ECC Hardware EC-processor  Computing full point multiplication  No external interaction necessary Co-processor  Acceleration of finite-field operation  (Limited local memory)  External interaction needed  For point ladder and point operation ISE  Enhancement of existing instruction set  Acceleration of core operations  Multiply-Accumulate instructions  Support of polynomial arithmetic ?

14 Algorithms for ECC Bitserial multiplication  a in full precision; b bitwise  Faster: digit-serial (w bits of b) Modular reduction  Without division: NIST reduction  For trinomial / pentanomials  For Mersenne-like primes Montgomery Multiplication  Combines a*b and mod p  For arbitrary moduli MulSer(a, b) = a*b c = 0 for i = n-1 to 0 do c = 2·c + a·bi Pre-comp: R = 2 n+2 mod p, R 2 mod p, p’ = (-p) -1 mod 2 MonMul(a, b) = a·b·R -1 mod p c = 0 for i = 0 to n+1 do q = ((c 0 + a 0 ·b i ) mod 2)·p’ c = c + p·q + a·b i

15 Modular Multiplication in HW GF(2 191 ) Example  Digit-serial multiplication  c(x) = a(x)*m(x) mod f(x)  a(x): full precision  m(x): w-bit digits –Digit size w = 8, 16, 32  Alignment of intermediate result  Interleaved NIST reduction  small intermediate results  Squaring as own operation  Simple when irred. poly f(x) fixed

16 Multiplier in HW Partial product generation  a(x) * m i  Simply 191 AND gates  Amplification of m i crucial Aligning intermediate results  Simple: Fixed shift operation Accumulation of PP  Array or Tree adder Modular reduction  200 bits -> 191 bits

17 GF(p) Multiplier Radix-4 multiplier  A in full precision  B: 2 bits / cycle Montgomery multiplic.  Orup’s optimization Redundant number representation  Carry-save (CS)  More storage  Shorter crit. Path  Red2bin: CSA reuse Booth recoding (Benc)

18 Dual-field Support Application: e.g. ECDSA  ECC over GF(2 m )  Protocol: GF(p)  Mul, Add, Inv mod n –n … base point order Architecture ~GF(p) mult.  CSA for GF(p)  XOR for GF(2 m )  Carries blocked GF(p) versus GF(2 m )  GF(2 m ) faster …  GF(p) needs reg. C

19 ECC for RFID Problem: Very constrained power budget  P = E/t = I*U = f clk *C L *Vdd*Vdd  Problem analysis: where is power consumed?  Mostly for storage: clocking of registers New idea  Less registers; more comb. logic  Smaller datapaths  No computation at full wordsize  Adoption of ISE techniques –MAC-operation  Simple HDL implementation

20 Control Task of control logic  Generate control signals  For – 6 Mio clock cycles Separation of control and datapath  Registered control signals  For performance and power efficiency  Avoiding critical path Hierarchical control  Complex control Options  Hardwired  State machine  Micro-program  Counter + ROM  Micro-controller  Software

21 Results Server Acceleration  For GF(2 191 )  Size: 1500 slices  On Xilinx FPGA  > 1000 EC ops / sec 66 MHz clock Smartcard Coprocessor  Dual-Field capability  192-bit ECC: 23k GE  400k – 700k cycles  256-bit ECC: 31k GE  600k - 900k cycles ECC for RFID  163-bit ECC: 12k GE  400k cycles  192-bit ECC: 18k GE  850k cycles  Storage  75% of area  ISE-datapath  75% of power  Realistic on <130 nm CMOS  Power constraint ~15µA

22 Conclusions Different applications  require different ECC hardware Fixed parameters (EC params, field)  allow more efficient implementation  Squaring in GF(2 m )  NIST reduction ECC for RFID  Seems possible