An automated pipeline balancing in the SRC Reconfigurable Computer and its application to the RC5 cipher breaking Hatim Diab 1, Miaoqing Huang 1, Kris.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
1 CIS 5371 Cryptography 5b. Pseudorandom Objects in Practice Block Ciphers.
1 SECURE-PARTIAL RECONFIGURATION OF FPGAs MSc.Fisnik KRAJA Computer Engineering Department, Faculty Of Information Technology, Polytechnic University of.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.
Jet algorithm/FPGA by Attila Hidvégi. Content Jet algorithm Jet-FPGA – Changes – Results – Analysing the inputs Tests at RAL Summary and Outlook.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Applications of Systolic Array FTR, IIR filtering, and 1-D convolution. 2-D convolution and correlation. Discrete Furier transform Interpolation 1-D and.
ENGIN112 L38: Programmable Logic December 5, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 38 Programmable Logic.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.
ORYX 1 ORYX ORYX 2 ORYX  ORYX not an acronym, but upper case  Designed for use with cell phones o To protect confidentiality of voice/data o For “data.
CS 151 Digital Systems Design Lecture 38 Programmable Logic.
AES Proposal: Rijndael Joan Daemen Vincent Rijmen “Rijndael is expected, for all key and block lengths defined, to behave as good as can be expected from.
RUN-TIME RECONFIGURATION FOR AUTOMATIC HARDWARE/SOFTWARE PARTITIONING Tom Davidson, Karel Bruneel, Dirk Stroobandt Ghent University, Belgium Presenting:
Digital signature using MD5 algorithm Hardware Acceleration
A Compact and Efficient FPGA Implementation of DES Algorithm Saqib, N.A et al. In:International Conference on Reconfigurable Computing and FPGAs, Sept.
Experimental Performance Evaluation For Reconfigurable Computer Systems: The GRAM Benchmarks Chitalwala. E., El-Ghazawi. T., Gaj. K., The George Washington.
FPGA Fault Emulator Jiří Kvasnička, Pavel Kubalík, Hana Kubátová.
Performance and Overhead in a Hybrid Reconfigurable Computer O. D. Fidanci 1, D. Poznanovic 2, K. Gaj 3, T. El-Ghazawi 1, N. Alexandridis 1 1 George Washington.
ECE 545 Project 1 Part IV Key Scheduling Final Integration List of Deliverables.
Allen Michalski CSE Department – Reconfigurable Computing Lab University of South Carolina Microprocessors with FPGAs: Implementation and Workload Partitioning.
1 Chapter 3 Ciphers Mechanism that decides the process of encryption/decryption Stream Cipher: Bit-by-bit encryption / decryption Block Cipher: Block-by-block.
DLS Digital Controller Tony Dobbing Head of Power Supplies Group.
Efficient FPGA Implementation of QR
Chapter 20 Symmetric Encryption and Message Confidentiality.
Automated Design of Custom Architecture Tulika Mitra
LOGO Hardware side of Cryptography Anestis Bechtsoudis Patra 2010.
Chapter 20 Symmetric Encryption and Message Confidentiality.
High Performance Scalable Base-4 Fast Fourier Transform Mapping Greg Nash Centar 2003 High Performance Embedded Computing Workshop
Gaj1P230/MAPLD 2004 Elliptic Curve Cryptography over GF(2 m ) on a Reconfigurable Computer: Polynomial Basis vs. Optimal Normal Basis Representation Comparative.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
South Carolina The DARPA Data Transposition Benchmark on a Reconfigurable Computer Sreesa Akella, Duncan A. Buell, Luis E. Cordova, and Jeff Hammes Department.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.
A Physical Resource Management Approach to Minimizing FPGA Partial Reconfiguration Overhead Heng Tan and Ronald F. DeMara University of Central Florida.
Rateless Wireless Networking Decoder Mikhail Volkov Edison Achelengwa Minjie Chen.
R ECONFIGURABLE H ARDWARE FOR H IGH - SECURITY /H IGH -P ERFORMANCE E MBEDDED S YSTEMS : T HE SAFES P ERSPECTIVE Guy Gogniat, Tilman Wolf, Wayne Burleson,
Introduction to FPGA Tools
Lopamudra Kundu Reg. No. : of Roll No.:- 91/RPE/ Koushik Basak
Pipelining and Retiming
FPGA Implementation of RC6 including key schedule Hunar Qadir Fouad Ramia.
A Ultra-Light Block Cipher KB1 Changhoon Lee Center for Information Security Technologies, Korea University.
Chapter 2 Symmetric Encryption.
RTL Design Methodology Transition from Pseudocode & Interface
J. Harkins1 of 51MAPLD2005/C178 Sorting on the SRC 6 Reconfigurable Computer John Harkins, Tarek El-Ghazawi, Esam El-Araby, Miaoqing Huang The George Washington.
Lecture 3 RTL Design Methodology Transition from Pseudocode & Interface to a Corresponding Block Diagram.
An Automated Development Framework for a RISC Processor with Reconfigurable Instruction Set Extensions Nikolaos Vassiliadis, George Theodoridis and Spiridon.
Encryption / Decryption on FPGA Final Presentation Written by: Daniel Farcovich ID Saar Vigodskey ID Advisor: Mony Orbach Summer.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Fast VLSI Implementation of Sorting Algorithm for Standard Median Filters Hyeong-Seok Yu SungKyunKwan Univ. Dept. of ECE, Vada Lab.
An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George.
ECE 545 Project 1 Introduction & Specification Part I.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
FIGURES FOR CHAPTER 16 SEQUENTIAL CIRCUIT DESIGN
Programmable Hardware: Hardware or Software?
Cellular Automata Project:
Ph.D. in Computer Science
Elliptic Curve Cryptography over GF(2m) on a Reconfigurable Computer:
Implementation of IDEA on a Reconfigurable Computer
Hossein Omidian, Guy Lemieux
RTL Design Methodology Transition from Pseudocode & Interface
Presentation transcript:

An automated pipeline balancing in the SRC Reconfigurable Computer and its application to the RC5 cipher breaking Hatim Diab 1, Miaoqing Huang 1, Kris Gaj 2, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 The George Washington University 2 George Masson University

Diab1011/MAPLD'042 Objectives Implement pipelined RC5 Key Breaker on a single chip, Demonstrate automatic balancing of a pipeline by a compiler (SRC), Show the cost of added pipeline.

Diab1011/MAPLD'043 Requirements Given: –A matching pair of Plain text message (M) and Cipher text (C) Find the correct corresponding Secret Key –Test the possible Secrete Keys exhaustively, –Keys, 128bit-long key from all 0’s to all 1’s. Requirements –The processing element (PE) to be fed a new Secrete Key (K i ) each cycle, –Compare C with the output C i corresponding to K i

Diab1011/MAPLD'044 RC5 Algorithm Mixing in the Secret Key. i=j=0 A=B=0 do 3*max(26,4) times // S[0..25] is the array to be mixed for rc5 encryption A=S[i]=(S[i]+A+B)<<<3; // L[0…3] is the array converted from the secrete key K[0..15] B=L[j]=(L[j]+A+B)<<<(A+B); i=(i+1) mod (26); // The output is the array S[0..25], which will be used to encrypt j=(j+1) mod (4); // the plain text. Encryption. LE=A+S[0]; // A is the upper part of plain text RE=B+S[1]; // B is the low part of plain text for i=1 to 12 do LE=((LE ⊕ RE)<<<RE)+S[2*i]; RE=((RE ⊕ LE)<<<LE)+S[2*i+1]; The processed LE is the upper part of cipher text, The processed RE is the low part of cipher text.

Diab1011/MAPLD'045 Key-Breaking Flowchart

Diab1011/MAPLD'046 Condition & Implementation RC5 32/12/16 –Cipher text 32*2 bits = 64 bits –12 rounds –Key = 16 * 8bits = 128 bits Implement RC5 encryption using –12 rounds of encryption macros, with 6 clocks latency –78 iterations of key generation macros, with 3 clocks latency

Diab1011/MAPLD'047 Design & Bottleneck Pipelined design –Process one key every clock cycle in a pipelined fashion Data dependencies –One of the features of RC5 is the extensive use of data dependent rotations, –S value needed every 26 th step, –L value needed every 4 th step, Manual HDL-based realization of the pipeline proved to be time-consuming and error-prone.

Diab1011/MAPLD'048 Data Dependencies in Each Iteration

Diab1011/MAPLD'049 Solution Implement on one FPGA chip concurrently –78 key initialization macros –12 encryption macros Connect the macros in a linear pipeline. The SRC compiler will balance the pipeline by inserting delay channels to make all macros run synchronously.

Diab1011/MAPLD'0410 Delay Channels Added by SRC Compiler Delay 1 = 1 reg Delay 2 = 2 reg Delay 5 = 5 reg wire

Diab1011/MAPLD'0411 Detailed flow

Diab1011/MAPLD'0412 Compilation Result Device utilization summary: Number of External IOBs 594 out of % Number of LOCed External IOBs 594 out of % Number of Slices out of % Number of BUFGMUXs 1 out of 166% Maximum Clock Frequency

Diab1011/MAPLD'0413 Effectiveness of the Benchmark Cipher TextExpected KeyFound Key Time (SRC) (  s)Time (PC) (  s) EEDBA521 6D8F4B ,3420 C53073A4 8AFAE ,028359,000 07CEC757 C72BCAE ,781,9801,847,105,000 2F68DC4A ADBFACC ,466,2745,251,282, CACD D1EDD ,050,562 Too large to simulate 51C6514A 4EF0A99B ,318,493Too large to simulate

Diab1011/MAPLD'0414 Conclusion The objective was realized, i.e., every clock one 128bit-long variable is pushed into the processing chain, A speed-up of 1000x over SW and 300x over serial HW implementations was achieved, For the flexible parameters used in RC5 algorithm, different map routines can be designed respectively to fit the distinct area and throughput requirements, The automated pipeline balancing of the SRC compiler proved to substantially decrease the development time of complex pipelined designs.