Implementation of Finite Field Inversion

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

Architecture-Specific Packing for Virtex-5 FPGAs

Algorithm Design Methodologies Divide & Conquer Dynamic Programming Backtracking.

Functions and Functional Blocks

EGRE 427 Advanced Digital Design Figures from Application-Specific Integrated Circuits, Michael John Sebastian Smith, Addison Wesley, 1997 Chapter 5 Programmable.

1 KU College of Engineering Elec 204: Digital Systems Design Lecture 9 Programmable Configurations Read Only Memory (ROM) – –a fixed array of AND gates.

 Alexandra Constantin  James Cook  Anindya De Computer Science, UC Berkeley.

A Survey of Logic Block Architectures For Digital Signal Processing Applications.

Floating-Point FPGA (FPFPGA) Architecture and Modeling (A paper review) Jason Luu ECE University of Toronto Oct 27, 2009.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Architecture Design Methodology. 2 The effects of architecture design on metrics:  Area (cost)  Performance  Power Target market:  A set of application.

Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.

1 A Timing-Driven Synthesis Approach of a Fast Four-Stage Hybrid Adder in Sum-of-Products Sabyasachi Das University of Colorado, Boulder Sunil P. Khatri.

Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.

Nov. 29, 2005ELEC Power Minimization Using Voltage Reduction and Parallel Processing By Sudheer Vemula.

1 FPGA Lab School of Electrical Engineering and Computer Science Ohio University, Athens, OH 45701, U.S.A. An Entropy-based Learning Hardware Organization.

UNIVERSITY OF MASSACHUSETTS Dept

Development of Empirical Models From Process Data

Distributed Arithmetic: Implementations and Applications

Logic and Computer Design Dr. Sanjay P. Ahuja, Ph.D. FIS Distinguished Professor of CIS ( ) School of Computing, UNF.

A COMPARATIVE STUDY OF MULTIPLY ACCCUMULATE IMPLEMENTATIONS ON FPGAS Using Distributed Arithmetic and Residue Number System.

Dominant Eigenvalues & The Power Method

GPGPU platforms GP - General Purpose computation using GPU

Sub-Nyquist Sampling DSP & SCD Modules Presented by: Omer Kiselov, Daniel Primor Supervised by: Ina Rivkin, Moshe Mishali Winter 2010High Speed Digital.

Chapter 6-2 Multiplier Multiplier Next Lecture Divider

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Efficient.

An automatic tool flow for the combined implementation of multi-mode circuits Brahim Al Farisi, Karel Bruneel, João Cardoso, Dirk Stroobandt.

Power Reduction for FPGA using Multiple Vdd/Vth

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.

L7: Pipelining and Parallel Processing VADA Lab..

AMIN FARMAHININ-FARAHANI CHARLES TSEN KATHERINE COMPTON FPGA Implementation of a 64-bit BID-Based Decimal Floating Point Adder/Subtractor.

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Lab Session 2 Design of Elliptic Curve Cryptosystem

Linear Feedback Shift Register. 2 Linear Feedback Shift Registers (LFSRs) These are n-bit counters exhibiting pseudo-random behavior. Built from simple.

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

Alternative Wide Block Encryption For Discussion Only.

Anurag Dwivedi. Basic Block - Gates Gates -> Flip Flops.

Timing-Driven Routing for FPGAs Based on Lagrangian Relaxation

LAB SESSION 1 Design of Elliptic Curve Cryptosystem Debdeep Mukhopadhyay Chester Rebeiro Dept. of Computer Science and Engineering Indian Institute of.

1 Leakage Power Analysis of a 90nm FPGA Authors: Tim Tuan (Xilinx), Bocheng Lai (UCLA) Presenter: Sang-Kyo Han (ECE, University of Maryland) Published.

1 Synthesizing Datapath Circuits for FPGAs With Emphasis on Area Minimization Andy Ye, David Lewis, Jonathan Rose Department of Electrical and Computer.

ELEC692 VLSI Signal Processing Architecture Lecture 3

Priority encoder. Overview Priority encoder- theoretic view Other implementations The chosen implementation- simulations Calculations and comparisons.

Timo O. Korhonen, HUT Communication Laboratory 1 Convolutional encoding u Convolutional codes are applied in applications that require good performance.

An Introduction to Elliptic Curve Cryptography

Updating Designed for Fast IP Lookup Author : Natasa Maksic, Zoran Chicha and Aleksandra Smiljani´c Conference: IEEE High Performance Switching and Routing.

Range Enhanced Packet Classification Design on FPGA Author: Yeim-Kuan Chang, Chun-sheng Hsueh Publisher: IEEE Transactions on Emerging Topics in Computing.

FPGA Logic Cluster Design Dr. Philip Brisk Department of Computer Science and Engineering University of California, Riverside CS 223.

JET Algorithm Attila Hidvégi. Overview FIO scan in crate environment JET Algorithm –Hardware tests (on JEM 0.2) –Results and problems –Some VHDL tips.

Philip Brisk 2 Paolo Ienne 2 Hadi Parandeh-Afshar 1,2 1: University of Tehran, ECE Department 2: EPFL, School of Computer and Communication Sciences Improving.

Hardware Implementations of Finite Field Primitives

Managed by UT-Battelle for the Department of Energy Vector Control Algorithm for Efficient Fan-out RF Power Distribution Yoon W. Kang SNS/ORNL Fifth CW.

EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.

Full Design. DESIGN CONCEPTS The main idea behind this design was to create an architecture capable of performing run-time load balancing in order to.

A Survey of Fault Tolerant Methodologies for FPGA’s Gökhan Kabukcu

1 Modeling Synchronous Logic Circuits Debdeep Mukhopadhyay Associate Professor Dept of Computer Science and Engineering NYU Shanghai and IIT Kharagpur.

Chandrasekhar 1 MAPLD 2005/204 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Floating-Point FPGA (FPFPGA)

MAPLD 2005 Reduced Triple Modular Redundancy for Tolerating SEUs in SRAM based FPGAs Vikram Chandrasekhar, Sk. Noor Mahammad, V. Muralidharan Dr. V. Kamakoti.

Multiplier-less Multiplication by Constants

A Theoretical Analysis of Square versus Rectangular Component Multipliers in Recursive Multiplication Behrooz Parhami Department of Electrical and Computer.

Chapter 11 Limitations of Algorithm Power

UNIVERSITY OF MASSACHUSETTS Dept

Off-path Leakage Power Aware Routing for SRAM-based FPGAs

UNIVERSITY OF MASSACHUSETTS Dept

A SRAM-based Architecture for Trie-based IP Lookup Using FPGA

Presentation transcript:

Implementation of Finite Field Inversion Debdeep Mukhopadhyay Chester Rebeiro Dept. of Computer Science and Engineering Indian Institute of Technology Kharagpur INDIA

Finite Field Inverse 23-27 May 2011 Anurag Labs, DRD0

Itoh-Tsujii Method for Binary Fields 23-27 May 2011 Anurag Labs, DRD0

The Steps 23-27 May 2011 Anurag Labs, DRD0

How do we do a Squaring Consider (again) the field GF(24), with irreducible polynomial x4+x+1. What is (x3+x2+1)2 in this field ? 23-27 May 2011 Anurag Labs, DRD0

Squaring Squaring can be represented in the form of a matrix multiplication T.a 23-27 May 2011 Anurag Labs, DRD0

Quad Operation Quad operation can be done by two squaring operations. Quad operation can be written in the form T2.a 23-27 May 2011 Anurag Labs, DRD0

Advantage of using Quad Operations Quad circuits have better LUT utilization compared to Squarer circuits 23-27 May 2011 Anurag Labs, DRD0

Generalization of the Itoh-Tsujii Algorithm 23-27 May 2011 Anurag Labs, DRD0

Theorem 1 23-27 May 2011 Anurag Labs, DRD0

Theorem 2 23-27 May 2011 Anurag Labs, DRD0

Quad Itoh-Tsujii Inversion Algorithm 23-27 May 2011 Anurag Labs, DRD0

A Circuit for Inversion At every clock cycle, either the multiplier or the quadblock is active. The output of the multiplier is stored in mout register 23-27 May 2011 Anurag Labs, DRD0

Finding the Inverse 23-27 May 2011 Anurag Labs, DRD0

Finding the Inverse Step 2 23-27 May 2011 Anurag Labs, DRD0

Finding the Inverse Step 2 23-27 May 2011 Anurag Labs, DRD0

Control Signals for the Inverse 23-27 May 2011 Anurag Labs, DRD0

Performance Charts 23-27 May 2011 Anurag Labs, DRD0

Higher Powered Itoh-Tsujii We seen that Quad circuits utilize LUTs in a better way compared to squarer circuits. Also LUT size is increasing as silicon technology reduces We have seen 4-LUT become 6-LUT, and now 8-LUT This gives us a motivation to investigate using higher powers other than quad circuits 23-27 May 2011 Anurag Labs, DRD0

Revisiting the Theorems 23-27 May 2011 Anurag Labs, DRD0

2n Itoh-Tsujii Inversion These are the overheads Higher Powered 23-27 May 2011 Anurag Labs, DRD0

Overhead in 2n Itoh-Tsujii Computation of . Using addition chain for , can be computed in clock cycles, where is the length of addition chain for . Computation of , for Using addition chain for , that contains , can be computed during computation, because . 23-27 May 2011 Anurag Labs, DRD0

2n Itoh-Tsujii Design 23-27 May 2011 Anurag Labs, DRD0

Building the Optimal Design For a given field and a given FPGA how do decide the optimal design ? Configurable Parameters Addition chain. Power circuit used in power block. Number of cascaded power circuits in the power block. These have an effect on Number of clock cycles. Critical path delay. 23-27 May 2011 Anurag Labs, DRD0

Estimating AREA required on an FPGA A k input LUT (k-LUT) can implement any functionality of maximum k input variables. Total number of k-LUTs to implement a function with variables can be expressed as 23-27 May 2011 Anurag Labs, DRD0

Estimating Delay of a Design in an FPGA Delay in FPGAs comprise of LUT delay and routing delay.. For this ITA architecture, we have experimentally found, total delay is proportional to number of LUTs in critical path. We denote number of LUTs in a delay path as maxlutpath. In k-LUT, maxlutpath of an variable function is 23-27 May 2011 Anurag Labs, DRD0

Recap : Karatsuba Multiplier 23-27 May 2011 Anurag Labs, DRD0

Hybrid Karatsuba Multiplier for GF(2233) Note that the school book multiplier has replaced the general Karatsuba Multiplier School Book Multiplier 23-27 May 2011 Anurag Labs, DRD0

Estimating LUT Requirement for Hybrid Karatsuba Multiplier The field multiplier is a hybrid Karatsuba multiplier. A bit hybrid Karatsuba multiplier consists of two bit and one bit multipliers. This happens in recursive manner. In threshold ( ) level, School-Book multiplier is invoked. Total area of bit hybrid Karatsuba multiplier is given by Total area for the School-Book multiplier is 23-27 May 2011 Anurag Labs, DRD0

Estimating Delay of Hybrid Karatsuba Multiplier The hybrid Karatsuba multiplier is distributed in smaller multipliers like a tree. Height of the tree is Each level of the Simple Karatsuba tree introduces one LUT delay. In threshold ( ) level, School-Book multiplier delay is added. Delay of School-Book multiplier is Delay of the entire multiplier in LUTs is given by 23-27 May 2011 Anurag Labs, DRD0

Estimating Area & Delay for Modular Reduction For fields generated by trinomials, area of modular reduction is almost equal to field size and delay is one LUT considering LUT size . For fields generated by pentanomials, and 2 LUT for . and 2 LUT for . 23-27 May 2011 Anurag Labs, DRD0

Area & Delay Estimates for 2n Circuit The output of a 2n circuit, which raises an input can be expressed as , where is binary field matrix and , LUT requirement per output bit is Total LUT requirement for the 2n circuit is LUT delay per output bit is Since all bits are in parallel, delay of 2n circuit is 23-27 May 2011 Anurag Labs, DRD0

Area & Delay Estimates for Multiplexer For a 2s : 1 MUX, there are s selection lines and thus the output is a function of 2s + s variables. For a MUX in , each of the 2s input lines is of width m bits. Total LUT requirement is Total LUT delay of the MUX is When number of inputs to MUX , the above gives a close upper bound 23-27 May 2011 Anurag Labs, DRD0

Area & Delay of PowerBlock Let the Powerblock contains us number of cascaded 2n circuits. The has selection lines, where LUT requirement for is Total LUT requirement for Powerblock is Delay of is Total LUT delay of Powerblock in 23-27 May 2011 Anurag Labs, DRD0

Area & Delay for the Entire Architecture LUT estimate for the entire architecture is There are two parallel delay paths. LUT delay of first path is LUT delay of second path is LUT delay of entire architecture is 23-27 May 2011 Anurag Labs, DRD0

Optimal Number of Cascades For a given field and based FPGA, Powerblock can be configured with different power circuits and cascades . Increase in reduces clock cycles, but increases delay of Powerblock. is fixed, but depends on and . is minimum when Minimum delay of the ITA architecture is thus 23-27 May 2011 Anurag Labs, DRD0

Power Circuit Selection to achieve Minimum Clock Cycles Number of clock cycles for the inversion can be approximated as Number of clock cycles for increases linearly with . The term reduces with increase in . When is small, the reduction in is significant for increase in . But, for large values of n, the increase in dominates over the decrease in So, increases with increase in for large values of . 23-27 May 2011 Anurag Labs, DRD0

Tuning Design for Optimality The performance metric is Minimization of without increasing gives best performance. Area remains almost same. The following steps are performed to achieve optimal performance The optimal architecture is given by 23-27 May 2011 Anurag Labs, DRD0

Validation of Theoretical Estimates Our estimation model uses maxlutpath to find LUT delay. Routing delay is difficult to model in FPGAs. To get overall delay, we have used experimental results for a reference ITA architecture. Total delay of reference architecture is the Let LUT delay of reference architecture is Total delay of any other ITA architecture in the same field is approximately Here is a constant and depends on FPGA technology. In 4-LUT based and 6-LUT based Xilinx FPGAs, has values 0.2 and 0.1 respectively. 23-27 May 2011 Anurag Labs, DRD0

Validation on 4-input LUT FPGAs 23-27 May 2011 Anurag Labs, DRD0

Validation on 6-input LUT FPGAs 23-27 May 2011 Anurag Labs, DRD0

Experimental Results 23-27 May 2011 Anurag Labs, DRD0

Comparison Charts 23-27 May 2011 Anurag Labs, DRD0