Yuan Ma, Zongbin Liu, Wuqiong Pan, Jiwu Jing

Slides:

Advertisements

Similar presentations

1 A New Multiplication Technique for GF(2 m ) with Cryptographic Significance Athar Mahboob and Nassar Ikram National University of Sciences & Technology,

Advertisements

Function Evaluation Using Tables and Small Multipliers CS252A, Spring 2005 Jason Fong.

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Defect Tolerance for Yield Enhancement of FPGA Interconnect Using Fine-grain and Coarse-grain Redundancy Anthony J. YuGuy G.F. Lemieux September 15, 2005.

Thapliyal 1MAPLD 2005/1011 A High Speed and Efficient Method of Elliptic Curve Encryption Using Ancient Indian Vedic Mathematics Himanshu Thapliyal and.

14. Aug Towards Practical Lattice-Based Public-Key Encryption on Reconfigurable Hardware SAC 2013, Burnaby, Canada Thomas Pöppelmann and Tim Güneysu.

Are standards compliant Elliptic Curve Cryptosystems feasible on RFID?

Commercial FPGAs: Altera Stratix Family Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

ECE 734: Project Presentation Pankhuri May 8, 2013 Pankhuri May 8, point FFT Algorithm for OFDM Applications using 8-point DFT processor (radix-8)

Using Carry-Save Adders For Radix- 4, Can Be Used to Generate 3a – No Booth’s Slight Delay Penalty from CSA – 3 Gates.

Architectural Improvement for Field Programmable Counter Array: Enabling Efficient Synthesis of Fast Compressor Trees on FPGA Alessandro Cevrero 1,2 Panagiotis.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

UNIVERSITY OF MASSACHUSETTS Dept

High Speed Hardware Implementation of an H.264 Quantizer. Alex Braun Shruti Lakdawala.

A Dual Field Elliptic Curve Cryptographic Processor Laboratory for Reliable Computing (LaRC) Electrical Engineering Department National Tsing Hua University.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.

CHES20021 Scalable and Unified Hardware to Compute Montgomery Inverse in GF(p) and GF(2 n ) A. Gutub, A. Tenca, E. Savas, and C. Koc Information Security.

Distributed Arithmetic: Implementations and Applications

Radu Muresan CODES+ISSS'04, September 8-10, 2004, Stockholm, Sweden1 Current Flattening in Software and Hardware for Security Applications Authors: R.

M. Interleaving Montgomery High-Radix Comparison Improvement Adders CLA CSK Comparison Conclusion Improving Cryptographic Architectures by Adopting Efficient.

1 Montgomery Multiplication David Harris and Kyle Kelley Harvey Mudd College Claremont, CA {David_Harris,

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.

GPGPU platforms GP - General Purpose computation using GPU

Montgomery multiplication Algorithm Mohammad Farmani Under supervision of : Dr. S. Bayat-sarmadi 2 nd. Semister, Sharif University of Technology.

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006) 13/10/2006 1/26 Superscalar Coprocessor for High-speed Curve-based Cryptography K.

IEEE Globecom-2006, NXG-02: Broadband Access ©Copyright All Rights Reserved 1 FPGA based Acceleration of Linear Algebra Computations. B.Y. Vinay.

A Compact and Efficient FPGA Implementation of DES Algorithm Saqib, N.A et al. In:International Conference on Reconfigurable Computing and FPGAs, Sept.

03/12/20101 Analysis of FPGA based Kalman Filter Architectures Arvind Sudarsanam Dissertation Defense 12 March 2010.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

Institute for Applied Information Processing and Communications (IAIK) – VLSI & Security Dr. Johannes Wolkerstorfer IAIK – Graz University of Technology.

Efficient FPGA Implementation of QR

(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

Decimal Multiplier on FPGA using Embedded Binary Multipliers Authors: H. Neto and M. Vestias Conference: Field Programmable Logic and Applications (FPL),

Software Defined Radio 長庚電機通訊組碩一張晉銓指導教授 : 黃文傑博士.

Implementation of Finite Field Inversion

A Flexible DSP Block to Enhance FGPA Arithmetic Performance

Advanced Information Security 6 SIDE CHANNEL ATTACKS Dr. Turki F. Al-Somani 2015.

FPT 2006 Bangkok A Novel Memory Architecture for Elliptic Curve Cryptography with Parallel Modular Multipliers Ralf Laue, Sorin A. Huss Integrated Circuits.

Hyperelliptic Curve Coprocessors On a FPGA HoWon Kim ETRI, Korea.

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

Mohammad Reza Najafi Main Ref: Computer Arithmetic Algorithms and Hardware Designs (Behrooz Parhami) Spring 2010 Class presentation for the course: “Custom.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware.

Radix-2 2 Based Low Power Reconfigurable FFT Processor Presented by Cheng-Chien Wu, Master Student of CSIE,CCU 1 Author: Gin-Der Wu and Yi-Ming Liu Department.

Accelerating Homomorphic Evaluation on Reconfigurable Hardware Thomas Pöppelmann, Michael Naehrig, Andrew Putnam, Adrian Macias.

DPA Countermeasures by Improving the Window Method Kouichi Itoh, Jun Yajima, Masahiko Takenaka and Naoya Torii Workshop on Cryptographic Hardware and Embedded.

Cryptographic coprocessor

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Full Tree Multipliers All k PPs Produced Simultaneously Input to k-input Multioperand Tree Multiples of a (Binary, High-Radix or Recoded) Formed at Top.

Lecture5 – Introduction to Cryptography 3/ Implementation Rice ELEC 528/ COMP 538 Farinaz Koushanfar Spring 2009.

A New Class of High Performance FFTs Dr. J. Greg Nash Centar ( High Performance Embedded Computing (HPEC) Workshop.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Motivation Basis of modern cryptosystems

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang Nov. 5, 2010.

Design and Analysis of Low-Power novel implementation of encryption standard algorithm by hybrid method using SHA3 and parallel AES.

Improved Resource Sharing for FPGA DSP Blocks

D. Cheung – IQC/UWaterloo, Canada D. K. Pradhan – UBristol, UK

Advanced Information Security 6 Side Channel Attacks

Instructor: Dr. Phillip Jones

FPGA IMPLEMENTATION OF NIST P-384 MODULAR MULTIPLIER

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Presentation transcript:

Yuan Ma, Zongbin Liu, Wuqiong Pan, Jiwu Jing A High-Speed Elliptic Curve Cryptographic Processor for Generic Curves over GF(p) Yuan Ma, Zongbin Liu, Wuqiong Pan, Jiwu Jing State Key Laboratory of Information Security, Institute of Information Engineering, CAS, Beijing, China SAC 2013

Outline Introduction Processing Method Proposed Architecture Implementation and Comparison Conclusion and Future Work

Outline Introduction Processing Method Proposed Architecture Implementation and Comparison Conclusion and Future Work

Motivation People like to use ECC because... Smaller Key sizes Faster implementation Less storage and power consumption

Motivation Our goal... Getting the fastest ECC hardware implementation for generic curves over GF(p) Applicable to FPGAs and ASICs

Hierarchy of Operations Double&Add, Window, NAF, Montgomery ladder... Affine coordinates, Projective Jacobian coordinates... Montgomery multiplication, Fast reduction...

Previous Works for ECC Implementations For generic curves Guillermin [1] based on RNS (Residue Number System) the fastest one(0.68 ms for 256-bit PM on Stratix II) Side channel analysis (SCA) resistance large area For specific curves Güneysu et al. [3] NIST primes, fast reduction faster than [1] (0.49 ms for 256-bit PM on Virtex-4) limited in FPGAs, restricted in NIST prime field Mentens [2] based on traditional Montgomery multiplications 2.35 ms for 256-bit PM on Virtex-2 Pro SCA resistance Low frequency [1]Guillermin, N.: A high speed coprocessor for elliptic curve scalar multiplications over Fp . CHES 2010 [2] Mentens, N.: Secure and ecient coprocessor design for cryptographic applications on FPGAs. PhD thesis [3]Güneysu, et al.: Ultra high performance ECC over NIST primes on commercial FPGAs. CHES 2008

Previous work for Montgomery multiplication radix-2 based high-radix based: significantly reducing clock cycles, thus faster in approximately 2n clock cycles, such as systolic array architectures in approximately n clock cycles, but at a low frequency, such as [2] Our primary goal Designing a new Montgomery multiplication architecture which is able to simultaneously process one Montgomery multiplication within approximately n clock cycles and improve the working frequency to a high level Key techniques the parallel array architecture with one-way carry propagation can efficiently weaken the data dependency for calculating quotients, yielding that the quotients can be determined in a single clock cycle a high working frequency can be achieved by employing quotient pipelining inside DSP blocks

Outline Introduction Processing Method Proposed Architecture Implementation and Comparison Conclusion and Future Work

Pipelined Montgomery Algorithm Orup, H.: Simplifying quotient determination in high-radix modular multiplication. In: IEEE Symposium on Computer Arithmetic. 1995

DSP Blocks

Processing Method for Pipelined Implementation

Outline Introduction Processing Method Proposed Architecture Implementation and Comparison Conclusion and Future Work

Montgomery Multiplier Processing Element (PE)

PE Array

Redundant Number Adder

ECC Processor Architecture

Elliptic Curve Arithmetic Modular Adder/Subtracter straightforward integer addition/subtraction without modular reduction As an alternative, the modular reduction is performed by the Montgomery multiplication with an expanded R Point Doubling and Addition Jacobian projective coordinates successive multiplications can be performed independently A + B mod M → A + B ∈ (0,8M) A－B mod M → A － B + 4M ∈ (0,8M)

SCA Resistance randomized Jacobian coordinates method against DPA executed only twice or once no impact on the area and little decrease in the speed a window method presented in [4] against SPA 2w－1＋ tw point doublings and 2w－1＋t－1 point additions, window size w, the number of words t implemented by block RAMs which are abundant in modern FPGAs acceptable for our design Möller, B.: Securing elliptic curve point multiplication against side-channel attacks. In ISC 2001.

Outline Introduction Processing Method Proposed Architecture Implementation and Comparison Conclusion and Future Work

Hardware Implementation Our ECC processor for 256-bit curves named ECC-256p is implemented on Xilinx Virtex-4 and Virtex-5 FPGA devices The addition width is set to 54 w is set to 4. One point multiplication requires 264 doublings and 71 additions at the cost of a pre-computed table with 15 points The critical path of ECC-256p is the addition of three 32-bit number in the PE The final inversion at the end of the scalar multiplication is taken into account

Results After PAR Operation ECC-256p MUL 35 (average 29) ADD/SUB 7 Point Doubling (Jacobian) 232 Point Addition (Jacobian) 484 Inversion (Fermat) 13685 Point Multiplication (Window) 109297 Clock cycles Virtex-4 Virtex-5 Slices 4655 1725 LUTs 5740 (4-input) 4177 (6-input) Flip-flops 4876 4792 DSP blocks 37 BRAMs 11 (18 Kb) 10 (36 Kb) Frequency (Delay) 250 MHz (0.44 ms) 291 MHz (0.38 ms) Area and Speed

Performance Comparison Curve Device Size (DSP) Frequency Delay SCA res. Our 256 any Virtex-5 1725 Slices (37 DSPs) 291 MHz 0.38 ms Yes Work Virtex-4 4655 Slices (37 DSPs) 250 MHz 0.44 ms [1] Stratix II 9177 ALM (96 DSPs) 157 MHz 0.68 ms [2] Virtex-2 Pro 3529 Slices (36 MULTs) 67 MHz 2.35 ms [5] 15755 Slices (256 MULTs) 39.5 MHz 3.84 ms No [3] 256 NIST 1715 Slices (32 DSPs) 487 MHz 0.49 ms [6] 192 NIST Virtex-E 5708 Slices 40 MHz 3 ms [5] McIvor, C.J., et al.: Hardware elliptic curve cryptographic processor over GF(p). IEEE Transactionson on Circuits and Systems(2006) [6] Orlando, G., Paar, C.: A scalable GF(p) elliptic curve processor architecture for programmable hardware. CHES 2001

Outline Introduction Processing Method Proposed Architecture Implementation and Comparison Conclusion and Future Work

Conclusion and Future Work Pipelined Montgomery based scheme is a better choice than the classic Montgomery based and RNS based ones for ECC implementations speed consumed resources In future work, transferring the architecture to ASICs replacing the multiplier cores, i.e. DSP blocks with excellent pipelined multiplier IP cores

Thank you!