2004. 8. 24. Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea.

Slides:



Advertisements
Similar presentations
MEMOCode 2007 Design Contest – MIT Submission N. Dave, K. Fleming, M. King, M. Pellauer, M. Vijayaraghavan.
Advertisements

© 2003 Xilinx, Inc. All Rights Reserved Course Wrap Up DSP Design Flow.
Computer Organization and Architecture
Spartan-3 FPGA HDL Coding Techniques
Are standards compliant Elliptic Curve Cryptosystems feasible on RFID?
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CryptoBlaze: 8-Bit Security Microcontroller. Quick Start Training Agenda What is CryptoBlaze? KryptoKit GF(2 m ) Multiplier Customize CryptoBlaze Attacks.
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
Chapter 9 Computer Design Basics. 9-2 Datapaths Reminding A digital system (or a simple computer) contains datapath unit and control unit. Datapath: A.
ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)
1/1/ /e/e eindhoven university of technology Microprocessor Design Course 5Z008 Dr.ir. A.C. (Ad) Verschueren Eindhoven University of Technology Section.
Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design.
Advanced Information Security 4 Field Arithmetic
Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.
A Dual Field Elliptic Curve Cryptographic Processor Laboratory for Reliable Computing (LaRC) Electrical Engineering Department National Tsing Hua University.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
CHES20021 Scalable and Unified Hardware to Compute Montgomery Inverse in GF(p) and GF(2 n ) A. Gutub, A. Tenca, E. Savas, and C. Koc Information Security.
CPEN Digital System Design Chapter 9 – Computer Design
IHP Im Technologiepark Frankfurt (Oder) Germany IHP Im Technologiepark Frankfurt (Oder) Germany ©
GCSE Computing - The CPU
Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
GPGPU platforms GP - General Purpose computation using GPU
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 13/10/2006 1/26 Superscalar Coprocessor for High-speed Curve-based Cryptography K.
A Compact and Efficient FPGA Implementation of DES Algorithm Saqib, N.A et al. In:International Conference on Reconfigurable Computing and FPGAs, Sept.
Matrix Multiplication on FPGA Final presentation One semester – winter 2014/15 By : Dana Abergel and Alex Fonariov Supervisor : Mony Orbach High Speed.
Institute for Applied Information Processing and Communications (IAIK) – VLSI & Security Dr. Johannes Wolkerstorfer IAIK – Graz University of Technology.
Efficient FPGA Implementation of QR
Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),
SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.
Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.
BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.
EKT221 ELECTRONICS DIGITAL II CHAPTER 4: Computer Design Basics
Chap 7. Register Transfers and Datapaths. 7.1 Datapaths and Operations Two types of modules of digital systems –Datapath perform data-processing operations.
Chapter 4 Computer Design Basics. Chapter Overview Part 1 – Datapaths  Introduction  Datapath Example  Arithmetic Logic Unit (ALU)  Shifter  Datapath.
Sub-Nyquist Sampling Algorithm Implementation on Flex Rio
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
EKT 221 : Chapter 4 Computer Design Basics
Cryptographic coprocessor
Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.
NISC set computer no-instruction
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 23 Introduction Computer Specification –Instruction Set Architecture (ISA) - the specification.
JET Algorithm Attila Hidvégi. Overview FIO scan in crate environment JET Algorithm –Hardware tests (on JEM 0.2) –Results and problems –Some VHDL tips.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Programmable Logic Devices
Design and Analysis of Low-Power novel implementation of encryption standard algorithm by hybrid method using SHA3 and parallel AES.
Multiplier Design [Adapted from Rabaey’s Digital Integrated Circuits, Second Edition, ©2003 J. Rabaey, A. Chandrakasan, B. Nikolic]
Computer Design Basics
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Introduction Introduction to VHDL Entities Signals Data & Scalar Types
Instructor: Dr. Phillip Jones
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
CGRA Express: Accelerating Execution using Dynamic Operation Fusion
FPGA Implementation of Multicore AES 128/192/256
CDA 3101 Spring 2016 Introduction to Computer Organization
Instruction Level Parallelism and Superscalar Processors
Elliptic Curve Cryptography over GF(2m) on a Reconfigurable Computer:
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Guest Lecturer TA: Shreyas Chand
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Design Basics
Presentation transcript:

Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea

Ho Won Kim 2 Contents Introduction Design Philosophy for Fast HEC coprocessors  Parallelism  Pipelining  Loop unfolding on inversion operation Design Methodology Arithmetic Unit HECC coprocessor Architecture  Various HECC types : from high performance to low area Performance Results Conclusions

Ho Won Kim 3 Introduction (1/4)  

Ho Won Kim 4 Introduction (2/4) Group Cardinality  HEC of genus g over F q  The cardinality of J C (F q ) is given by Hasse-Weil:  Major implication : group size  (field size) g  Don ’ t choose genus ≥ 4 (5) because of possible attacks [Frey/Rück, Gaudry, Theriault, …] Group size vs. Field size  Group size of (commercial security level)  ECC (g=1): field size = 160 bit  HECC (g=2): field size = 80 bit  HECC (g=3): field size = 56 bit  HECC (g=4): field size = 52 bit

Ho Won Kim 5 Introduction (3/4) Explicit Formulae of HECC t1 = a*e; t2 = b*d; t3 = b*f; t4 = c*e; t5 = a*f; t6 = c*d; t7 = sqr(c+f); t8 = sqr(b+e); t9 = (a+d)*(t3+t4); t10= (a+d)*(t5+t6); r =(f+c+t1+t2)*(t7+t9) + t10*(t5+t6) + t8*(t3+t4); t11 = (b+e)*(c+f); inv2 = (t1+t2+c+f)*(a+d)+t8; inv1 = inv2*d + t10 + t11; inv0 = inv2*e + d*(t10+t11) + t9 + t7; t12 = (inv1+inv2)*(k+n+l+o); t13 = (l+o)*inv1; t14 = (inv0+inv2)*(k+n+m+p); t15 = (m+p)*inv0; t16 = (inv0+inv1)*(l+o+m+p); t17 = (k+n)*inv2; rs0 = t15; rs1 = t13+t15+t16; rs2 = t13+t14+t15+t17; rs3 = t12+t13+t17; rs4 = t17; t18 = rs3+rs4*d; s0s = rs0 + f*t18; s1s = rs1 + rs4*f + e*t18; s2s = rs2 + rs4*e + d*t18; w1 = inv(r*s2s); w2 = r*w1; w3 = w1*sqr(s2s); w4 = r*w2; w5 = sqr(w4); Input:D 1 = div(a 1,b 1 ), D 2 = div(a 2,b 2 ) Output:D 3 = D 1 + D 2 = div(a 3,b 3 ) Composition:d = gcd(a 1,a 2,b 1 +b 2 +h)=s 1 a 1 +s 2 a 2 +s 3 (b 1 +b 2 +h) a‘ 3 = a 1 a 2 /d b‘ 3 = [s 1 a 1 b 2 +s 2 a 2 b 1 +s 3 (b 1 b 2 +f)]/f mod a‘ 3 Reduction:WHILE deg(a‘ k ) > g, DO a‘ k = f – b‘ k-1 mod a‘ k b‘ k = (-h-b‘ k-1 ) mod a‘ k END WHILE a 3 = a‘ k b 3 = b‘ k s0 = w2*s0s; s1 = w2*s1s; s2 = w2*s2s; z0 = s0*c; z1 = s1*c+s0*b; z2 = s0*a+s1*b+c; z3 = s1*a+s0+b; z4 = a+s1; z5 = to_GF2E(1L); t1 = w4*h2; t2 = w4*h3; u3s = d + z4 + s1; u2s = d*u3s + e + z3 + s0 + t2 + s1*z4; u1s = d*u2s + e*u3s + f + z2 + t1 + s1*(z3+t2) + s0*z4 + w5; u0s = d*u1s + e*u2s + f*u3s + z1 + w4*h1 + s1*(z2+t1) + s0*(z3+t2) + w5*(a+f6); t1 = u3s+z4; v0s = w3*(u0s*t1 + z0) + h0 + m; v1s = w3*(u1s*t1 + u0s + z1) + h1 + l; v2s = w3*(u2s*t1 + u1s + z2) + h2 + k; v3s = w3*(u3s*t1 + u2s + z3) + h3; a3 = f6 + u3s + v3s*(v3s+h3); b3 = u2s + a3*u3s + f5 + v3s*h2 + v2s*h3; c3 = u1s + a3*u2s + b3*u3s + f4 + v2s*(v2s+h2) + v3s*h1 + v1s*h3; k3 = v2s + (v3s+h3)*a3 + h2; l3 = v1s + (v3s+h3)*b3 + h1; m3 = v0s + (v3s+h3)*c3 + h0; Explicit formulae (field arithmetic only): Polynomial arithmetic: Explicit formulae : ITCC04 [PWP04] Group doubling: 1inv, 9 mults Group Addition: 1 inv, 21 mults Harley’s explicit method

Ho Won Kim 6 Introduction (4/4) Pros & Cons of the HECC  Pros  Short field size : for genus 2 HECC, the size of the underlying field size is a half of that of ECC –So, It has room to adopt high speed implementation techniques such as parallelism and loop unfolding  Cons  There are many multiplication stages in Explicit formulae –So, when HECC is implemented as a hardware, its interconnect network and buffer allocation will be complicated Purpose of this work  To check its applicability as a high performance public key crypto system  To check its applicability at the resource constrained environment such as PDA & Smart Cards from practical point of view

Ho Won Kim 7 Design Philosophy (1/2) To make HECC coprocessor faster, we have used the following techniques:  Parallelism  Multiple number of field operation units to execute the explicit formulae as fast as possible  The number of multipliers is decided by drawing data dependency graph (DDG) for explicit formulae –For genus-2 HECC explicit formulae, we can see two multipliers are good choice for implementation –The usage rate of two multipliers is about 90 % group addition operation in affine coord.

Ho Won Kim 8 Design Philosophy (2/2)  Pipelining  Field operations(field addition, field squaring) and data copy operation between buffers are performed at the same clock cycle  And can be overlapped with multiplication and inversion  Loop Unfolding  “ Loop unfolding ” is the process of unfolding a loop so that several iterations(clock cycles) are unrolled into the same iteration(one clock cycle)  Is applied to the MAIA inversion algorithm to boost the performance with reasonable hardware increases

Ho Won Kim 9 Fast Inversion Block (1/2) Maximally 4 loops are executed in one clock cycle MAIA algorithm with 4 loops are unfolded Can be realized by simple XOR, rewiring

Ho Won Kim 10 Fast Inversion Block (2/2) Types Unfolding level # of Slices Frequency (MHz) Clock Cycles TTC ( ) MAIA with loop unfolding 1 (original alg.) Four loops are unfolded  We get two times better performance !! Features of the Inversion Block of the HECC coprocessors

Ho Won Kim 11 Design Methodology  Architecture design  VHDL coding  synthesis & implementation to FPGA Main Points toward high performance HECC coprocessor Design  Make the H/W complexity of the Interconnect Network as small as possible  Is done by carefully designed arithmetic units and data path, etc.  Make the number of registers as small as possible  Is done by careful buffer allocation  Make efficient AUs  By using parallelism, pipelining, loop unfolding techniques, etc.

Ho Won Kim 12 Arithmetic Unit AU (Arithmetic Unit)  Field addition : simple XOR (done on the data-path)  Field squaring : XOR and rewiring (done on the data-path)  Field multiplication : scalable, high performance multiplication logic is implemented (digit serial multiplier)  Field inversion : high performance inversion logic is implemented (modified almost inverse algorithm with a loop unfolding technique) AU Block Diagram

Ho Won Kim 13 HEC Architecture (1/3) Various HECC Coprocessor Types from High Performance to Moderate Size  Type 1 : for high performance  Parallel execution of the group addition & doubling  2 multipliers & 1 inversion logic for group addition  1 multiplier & 1 inversion logic for group doubling (Affine case)  Fast execution of the addition & doubling is possible. but, it causes high hardware complexity

Ho Won Kim 14 HEC Architecture (2/3)  Type 2  Use only registers for RF and multiplexers as an interconnect network  Parallel execution of data read & write is possible. but, it causes high complexity at the interconnect network  Multipliers and inversion logic are shared for group ops.  Technology independent design as Type 1 (portable to any FPGA and ASIC)  Type 3 : low hardware complexity  Uses memory to reduce hardware complexity  Uses buses to reduce the complexity of interconnect network  Incurs more latencies to perform explicit formulae, but, reduces hardware complexity

Ho Won Kim 15 HEC Architecture (3/3) TypesLogic Interconnect Network Scalar mult Method Storage For RF Affine Coord Type 1 Addition:2 MUL,1INV Mux Right to Left (parallel binary) 13 REGs Doubling:1MUL,1INV10 REGs Type 2 2MUL,1 INV(shared)MuxLeft to Right (binary)14 REGs Type 3 2MUL,1 INV(shared)Mux, BusLeft to Right (binary) Memory (14 entries) Architectural characteristics of HECC coprocessors

Ho Won Kim 16 Performance Results (1/2) Types Coord. Type Scalar Mult. Key size # Slices Freq. (MHz) TTC (ms) Area X Time Cla03 Projective Parallel Bin D=1 D=4 166 Bits 22, , Elias GF(2 113 ) Projective NAF,D=1 NAF,D=4 226 bits 21, , Type 1, GF(2 89 ) Affine Parallel Bin, Binary, Binary 178 bits 9, Type 2, GF(2 89 ) 7, Type 3, GF(2 89 ) 4, Type 1,GF(2 113 ) Affine Parallel Bin, Binary, Binary 226 Bits 11, Type 2,GF(2 113 ) 8, Type 3,GF(2 113 ) 6, ECC Orlando et al. 167bits1, Gura et al163bits11, Performance of the HECC coprocessors (scalar mult.) Target platform : Xilinx FPGA XC2V

Ho Won Kim 17 Performance Results (2/2) Performance of the HECC coprocessors (scalar mult.) Xilinx Virtex II FPGA (XC2V4000ff1517-6) Normalized to the best AT product  Performance (TTC)  Area-Time Product

Ho Won Kim 18 Conclusions The high performance of the HECC coprocessor is due to  fast inversion algorithm  High operating frequency of multiplier in spite of its large digit size (D=32)  Reduced interconnect network latency by using carefully designed buffer allocation and Arithmetic Units  Parallel execution of field operations  Pipelined execution of the field operations and data movement between register files We can say that HECC coprocessor can be used at high performance & resource constrained security environments  Since the performance is about ms with moderate H/W size (Type 1, GF(2 89 ))  However, more research works are still necessary to surpass the ECC