Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cryptographic Algorithms and their Implementations Discussion of how to map different algorithms to our architecture  Public-Key Algorithms (Modular Exponentiation)

Similar presentations


Presentation on theme: "Cryptographic Algorithms and their Implementations Discussion of how to map different algorithms to our architecture  Public-Key Algorithms (Modular Exponentiation)"— Presentation transcript:

1 Cryptographic Algorithms and their Implementations Discussion of how to map different algorithms to our architecture  Public-Key Algorithms (Modular Exponentiation)  Rijndael  Serpent  Others (Mars, RC6, Twofish, etc.)

2 Modular Exponentiation Square and Multiply Algorithm for Modular Exponentiation

3 Modular Exponentiation Montgomery Modular Multiplication

4 Modular Exponentiation Several Approaches to implementing Modular Multiplication:  Redundant Representation based (e.g. Carry-save)  Residue Number System based.  Systolic Array Based. Word-based implementations preferable, due to similarity with Symmetric-key  Rules out systolic arrays

5 Modular Exponentiation Most popular and fastest were Carry-Save representation based implementations. Carry-save based were also word-oriented. We selected fastest, simplest implementation:  Extremely beneficial to have simplicity and homogeneity in algorithms when designing a custom reconfigurable fabric.  Performance when implemented on Xilinx Virtex FPGAs: almost 5 Mb/s !!! (highest reported that we could find)

6 Modular Exponentiation Five-to-two Multiplier Modular Exponentiation (P, E, M) K = 22k mod M … computed externally 1.P1 0, P2 0 = 5to2_MontMult(K, 0, 1, 0, M), Z1 0, Z2 0 = 5to2_MontMult(K, 0, P, 0, M); 2.FOR i = 0 to n-1 DO 3.Z1 i+1, Z2 i+1 = 5to2_MontMult(Z1 i, Z2 i, Z1 i, Z2 i, M) 4.IF e i = 1 THEN P1 i+1, P2 i+1 = 5to2_MontMult(P1 i, P2 i, Z1 i, Z2 i, M) ELSE P1 i+1, P2 i+1 = P1 i, P2 i 5.ENDFOR 6.P1 n, P2 n = 5to2_MontMult(1, 0, P1 n-1, P2 n-1, M) 7.P = P1 n + P2 n 8.RETURN P

7 Modular Exponentiation Five-to-two CSA Montgomery Multiplication (A1, A2, B1, B2, M) 1.S1 0, S2 0 = 0, 0 2.FOR i = 0 to m-1 DO 3.q i = [(S1 i + S2 i ) + A i *(B1+B2)] mod 2 4.S1 i+1, S2 i+1 = CSR [(S1 i + S2 i ) + A i *(B1+B2) + q i *M] div 2 5.ENDFOR

8 Modular Exponentiation Their Implementation of MM

9 Modular Exponentiation Implementing MM on our design

10 Modular Exponentiation Each of the 64-CSA blocks maps to a single basic block Outputs of the last basic block are registered. q i is generated by random-logic block at the second basic-block  Broadcast to all groups A i is generated in a similar manner, utilizing two more basic-blocks:  Also broadcast to all groups

11 Modular Exponentiation Efficient and scalable mapping to our design  1024-bit RSA will need to use 16 groups, while  2048-bit will use 32, and 4096-bits will use 64 groups Primary concern : clock rate may be limited by bit-broadcasts of q i and A i  Potential impediment to scalability  We are exploring methods for pipelining these broadcasts as well, to increase cycle-time and scalability.

12 Rijndael Primary operations:  Sub-Bytes  Shift-Rows  Mix-Columns  Add-Round-Key

13 Rijndael Representation of Data: 128-bit state. 32-bits 128-bits 8-bits each32-bits 128-bits of state

14 Rijndael Add-Round-Key Simple 128-bit XOR operation: uses 1 basic-block Sub-Bytes: Simple operation: byte-wise table lookup from S-Box Each S-box is 2kbits. 16 parallel S-boxes required ! No basic-blocks required, ALL memory-blocks required ! Shift-Rows Simple operation: 4 x 32-bit permutations Uses only 1 basic-block

15 Rijndael Mix-Columns Somewhat complicated: can be implemented using table lookups, but we’re out of Memory ! Alternative implementation:

16 Rijndael Mix-Columns  Operation may be expressed in terms of “xtime()” function  Mix-columns implementation requires “xtime()” operation on each byte, followed by 4 XOR operations

17 Rijndael Mix-Columns  In order to efficiently implement “xtime()”, we modified it this way  In this form, only 2 basic-blocks are needed to apply “xtime()” to all 16 bytes  A single basic-block will take the 128-bit data as input, and generate the “xtime()” mask (0000x 7 x 7 0x 7 ) for each of the 16 bytes at the permute unit.  Another basic-block will now first perform the XOR operation, followed by a left shift (and substitute LSB with x 7 ) at the permute unit.

18 Rijndael Mix-Columns  After generating output from the “xtime()” function, 4 x 128-bit XOR operations need to be performed 4 basic-blocks will be used  Note that the mix-column operation is carried out in parallel on all 4 columns. Xtime masks for all bytes XOR operation

19 Rijndael Implementation summary  8 basic-blocks required only 2 (1 each) for Add-Round-Key and Shift-Rows 6 for Mix-Columns (2 for xtime(), 4 for XOR operations)  16 Memory-blocks required !! All memory blocks used up in a single round!  In-efficient implementation due to memory intensive implementation of Rijndael Only 10% logic used, versus 100% memory usage.

20 Rijndael Potential Solutions  Add lots of memory !! At least 10 times more Issues with memory placement  Consider memory-less implementations of Sub-Byte Requires GF() constant multiplication and Inverse Affine Transforms Currently under study as the more efficient and practical option.

21 Serpent Substitution-permutation cipher comprised of  Key Mixing,  S-Box Substitution, and  Linear Transformation. S-boxes: 4 x 4 bit  32 copies required each round  16 x 4 x 32 = 2048 bits per round.

22 Serpent The Linear Transformation step consists of:  8 fixed permute operations, and  8 XOR operations All operands are 32- bits wide

23 Serpent Serpent is an ideal match for our architecture:  8 x 32-bit fixed shifts and rotates can be easily implemented by the permute units of 2 basic-blocks.  Additional 2 basic-blocks required to implement the 8 x 32-bit XOR operations.  128-bit key mixing stage per round would require 1 more basic-block Total of 5 basic-blocks and 2kbits of memory required per round. Each round perfectly fits in a single group of our architecture! 16 rounds of Serpent’s total of 32 may be unrolled in our architecture

24 Other Algorithms DES  Implementation of a single round is trivial: a single group may implement multiple rounds ! Twofish  Complex structure, requires more time to define implementation on our architecture.  However, all its basic operations are directly supported. RC6 and MARS  Involve complicated operations requiring special purpose logic: Data-dependent rotations Multiplication Modulo 2 32

25 Other Algorithms RC6 and MARS  This special-purpose logic was not incorporated because: Algorithms are more suitable for software implementations than in hardware Lack of support and popularity of these algorithms Addition of special-purpose logic would occur overhead beyond its area, as additional supporting interconnect must be provided.

26 Comparison with Related Work Although we cannot provide results based on empirical evaluation, we can present a logical framework for comparison of individual features Through deductive reasoning, we identify what possible advantages one approach may have over the other, assuming all other factors normalized.

27 Comparison with Related Work Comparison with FPGA based implementations  Area Efficiency Use of basic gates instead of LUTs Basic-blocks with limited flexibility, thus fewer configuration bits Basic units (full adders) combined into clusters of 64, and programmed as a single entity – further savings in configuration memory elements  Performance Use of basic gates instead of LUTs Simpler Interconnect, with fewer routing-switches Hierarchical organization – no long wires (except for bit-broadcast) Far smaller configuration data required – faster reconfiguration time

28 Comparison with Related Work Comparison with FPGA based implementations  Potential pitfalls Design dedicates considerable amount of area to inter- block interconnect. Until actual area can be quantified, we are unsure of area efficiency estimates. Need to identify most suitable Performance/Area tradeoff.

29 Comparison with Related Work Comparison with COBRA Architecture  Uses multiple copies of special purpose logic blocks, couples with extremely simple interconnect.

30 Comparison with Related Work Comparison with COBRA Architecture  Low logic-utilization – we have more generic blocks,  Fixed latency operations  Intermediate values registered only at RCE boundary.

31 Programming Methodology Reconfigurable Computing devices suffer from following two critical issues:  Lack of a comprehensive programming model  Lack of hardware virtualization First issue implies the difficulty of programming RC architectures such as FPGAs Second issue deals with exposition of hardware resource limitations to programmer.

32 Programming Methodology How COBRA deals with these issues  Essentially a special-purpose programmable architecture than a configurable one  VLIW like instructions alleviate some of the programming model related issues  Also resolve the virtualization aspect.

33 Programming Methodology The programming methodology and the impact of the issues mentioned can be seen in terms of a spectrum: COBRA [3]MicroprocessorOur ApproachFPGAs

34 Programming Methodology Programming model issue less severe for us because:  Simple, highly specialized architecture Hardware Virtualization is still a concern.

35 Programming Methodology Programming model:  Provide basic primitives that are supported by our architecture.  Programming is to be accomplished by expressing an algorithm using these primitives and interconnecting these primitives together using 32-bit interconnect.  Mapping such a description onto our design should be a trivial software challenge.  Due to special purpose nature, primitives are limited in number and thus programming should be an easy task.

36 Programming Methodology 32-bit Carry Save Adder 32-bit XOR 32-bit AND 32-bit OR 32, 64, and 128-bit Ripple Carry Adder 32, 64, and 128-bit Fixed Shifts 32 bit Rotates and random permutes. 64-bit, 128-bit limited permutes (TBD). ANDing 32-bit value with a single bit 128-bit shift-register Random bit-logic implementation, since each block is also capable of implementing: single 4-input function two 3-input functions four 2-input functions 4 global bit-broadcast lines 32-bit interconnect, point to point. Programming Primitives:

37 Conclusion: Work in Progress Following areas of design still under consideration and not completely defined yet:  Configurable Memory-block Architecture  VLSI Design to evaluate performance metrics and fine- tuning of logical design i.e. if found to be too slow, reduce no of switches, use longer wires, minimize the amount of interconnect to that which is necessary, etc.  Furthermore, the iterative process of evaluating more symmetric-key algorithms and refining the architecture is still in progress.


Download ppt "Cryptographic Algorithms and their Implementations Discussion of how to map different algorithms to our architecture  Public-Key Algorithms (Modular Exponentiation)"

Similar presentations


Ads by Google