1 Low Cost Design of Advanced Encryption Standard (AES) Processor Ming-Chih Chen Department of Electronic Engineering National Kaohsiung First University.

1 Low Cost Design of Advanced Encryption Standard (AES) Processor Ming-Chih Chen Department of Electronic Engineering National Kaohsiung First University of Science and Technology

2 Outline  Introduction  Previous AES Design Methods  Two Proposed Substructure Sharing Methods for XOR-based Operations  Two Proposed CSE Algorithms for Sum-of- Product Operations  Comparisons and Implementations  Conclusions

3 Introduction

4 Introduction  In Oct. 2000, the Rijndael Advanced Encryption Standard was selected by the NIST (National Institute of Standards and Technology) as a new encryption standard.  The Rijndael AES algorithm is a symmetric block cipher that processes data blocks of 128 bits using cipher keys with lengths of 128, 192, and 256 bits.  Applications for AES include the security of wireless network (IEEE 802.11), smart card, …etc.

5 Advanced Encryption Standard Finite Field Operations AES Transformations & Algorithm

6 Finite Field Operations

7 Finite Field Addition  Bitwise XOR operation (or modulo-2 addition) (Polynomial notation) (Binary notation) (Hexadecimal notation)

8 Multiplication in GF(2 8 )  Multiplication of two polynomials modulo an irreducible polynomial m(x)=x 8 +x 4 +x 3 +x+1  Ex: {57} · {83}={c1}  Multiplicative identity: {01}  Multiplicative inverse of b(x) is denoted by b -1 (x)  Extended Euclidean algorithm − b(x)a(x) + m(x)c(x)=1 => b -1 (x)=a(x) mod m(x)

9 Multiplication by X  b 7 =0 − Left shift  b 7 =1 − Left shift followed by bitwise XOR with {1b}  This operation is denoted by xtime( )

10 Polynomial with Coeffs. in GF(2 8 )  Each coeff. of a polynomial is a byte (8-bit) − Polynomial addition: a(x) + b(x) − Byte-wise XOR for corresponding coeffs.  Polynomial multiplication modulo x 4 +1 − d(x) = a(x) b(x) (similar to cyclic convolution)

11 AES Transformations & Algorithm

12 Inputs and Outputs  Input and output − Sequences of blocks with block length of 128 bits (Nb = 4 words for each block  Cipher key − Sequence of cipher keys with key length of 128, 192 or 256 bits (Nk = 4, 6, or 8 words for each key)

13 Byte Representation  Block length = 128 bits = 16 bytes  Key length = 128, 192 or 256 bits = 16, 24 or 32 bytes  Finite field element representation − Polynomial, {01100011}=x 6 +x 5 +x+1  Hexadecimal representation − {01100011}={63}  One extra bit to the left of a byte − {01}{1b}

14 State  State: 2-D 4 x 4 array of bytes  A state has four rows and Nb columns  1D array of 32-bit words w 0, w 1, w 2, w 3 with each word wi composed of a column in the 2-D state

15 Key-Block-Round

16 Rijndael AES Algorithm (a) Encryption(b) Direct Decryption(c) Modified Decryption

17 Four Transformations in Cipher  SubBytes( ): SB − Nonlinear byte substitution  ShiftRows( ): SR − Cyclically left-shift the last three rows of the state  MixColumns( ): MC − Transformation on each column of the state  AddRoundKey( ): ARK − Each column is XORed with a 32-bit key schedule word generated from the key expansion

18 SubBytes( )  Take multiplicative inverse (MI) in GF(2 8 ): S  S -1  Apply affine transformation (AF) over GF(2) as follows: S’=M · S -1 +C (C={63} 16 ) − where S and S’ are input/output bytes in 8-D vector formats

19 Overall Effect of SubBytes( )  Substitution table (S-box)

20 ShiftRows( )

21 MixColumns( )  Polynomial multiplication of a fixed term a(x)={03}x 3 +{01}x 2 +{01}x+{02} modulo x 4 +1

22 AddRoundKey( )

23 Key Expansion  For Nk=4 or 6, and i ≠ multiple of Nk − w[i] = w[i-1] ⊕ w[i-Nk]  for i = multiple of Nk − w[i] = transformation1(w[i-1]) ⊕ w[i-Nk] − Transformation 1 contains RotWord(), followed by SubWord(), followed by XOR with Rcon[i]  If Nk=8 and i-4 = multiple of Nk − w[i] = transformation2(w[i-1]) ⊕ w[i-Nk] − Transformation 2 contains SubWord() only

24 Key Expansion Structure: On-the-Fly w(i) / w(i+4) w(i+1) / w(i+5) w(i+2) / w(i+6) w(i+3) / w(i+7) w(i+4) / w(i) w(i+5) / w(i+1) w(i+6) / w(i+2) w(i+7) / w(i+3) w(i+3) / w(i+3)

25 Four Transformations in Inverse Cipher  InvSubBytes( ): ISB  Nonlinear byte substitution  InvShiftRows( ): ISR  Cyclically left-shift the last three rows of the state  InvMixColumns( ): IMC  Transformation on each column of the state  AddRoundKey( ): ARK  Each column is XORed with a 32-bit key schedule word generated from the key expansion

26 InvSubBytes( )  Apply inverse affine (IAF) transformation over GF(2) as follows: S -1 =M -1 (S’+c)  Take multiplicative inverse (MI) in GF(2 8 ): S -1  S  Overall effect: S -1 -box

27 InvShiftRows  Cyclically right-shift the last three rows of the state.

28 InvMixColumns( )  Polynomial multiplication of a fixed term a -1 (x)={0b}x 3 +{0d}x 2 +{09}x+{0e} modulo x 4 +1

29 Previous AES Design Methods

30 Optimization Approaches for AES Transformations

31 Three Categories of Transformation Optimization  The optimization of separate transformations.  The optimization of combined round transformations.  The optimization of integrated encryption/decryption transformations.

32 The Optimization of Separate Transformations (1)  Two major transformations: − SB (ISB), MC (IMC)  SB (ISB): − Perform MI (Multiplicative Inverse) in GF(2 8 ) followed by AF. − 1. Uses 256x8-bit table look-up ROM (S-box) to store all pre- calculated results. − 2. Changes the calculation of MI in GF(2 8 ) to that in the composite field GF((2 4 ) 2 ). − 3. Changes the calculation of MI in GF(2 8 ) to that in the composite field GF(((2 2 ) 2 ) 2 ). − 4. Uses the calculation of MI in GF(2 8 ) based on matrix decomposition of A -1.

33 Calculation of Multiplicative Inverse (MI) in GF((2 4 ) 2 ) (1.2a)  There are three stages for the calculation of MI in GF((2 4 ) 2 ).

34 Calculation of Multiplicative Inverse (MI) in GF((2 4 ) 2 ) (1.2b)  Stage 1: − Translate from GF(2 8 ) to the composite field in GF((2 4 ) 2 ). Expand The implementation of T transformation has area=17A XOR, and delay=3 T XOR.

35 Calculation of Multiplicative Inverse (MI) in GF((2 4 ) 2 ) (1.2c)  Stage 2: − Find the MI for the two number in GF(2 4 ). where A=(0001) 2, and B=(1001) 2 where A=(0001) 2, and B=(1001) 2

36 Calculation of Multiplicative Inverse (MI) in GF((2 4 ) 2 ) (1.2d)  Stage 3: − Convert the number in GF((2 4 ) 2 ) to the number in GF(2 8 ) using T -1.

37 Calculation of Multiplicative Inverse (MI) Using A -1 (1.4)  A -1 : − The A -1 (MI) can be calculated by − It requires four GF(2 8 ) multipliers, plus one A 2 and three A 4 components.

38 The Optimization of Separate Transformations (2)  MC (IMC): − 1. Byte-level optimization: Multiplication block (XTime): multiplies a byte with a constant value {02} 16 and then reduces the numbers of XTime blocks by different byte- level sharing methods. − Ex1: MC: D”={01}A+{01}B+{02}D+{03}E D”={01}A+{01}B+{02}D+{03}E =A+B+XTime(D)+XTime(E)+E =A+B+XTime(D)+XTime(E)+E − Ex2: MC: D”={02}(D+E)+(A+B+D+E)+D D”={02}(D+E)+(A+B+D+E)+D using {02}D={02}D+D+D, D+D=0 using {02}D={02}D+D+D, D+D=0

39 The Optimization of Separate Transformations (3) – 2. Bit-level optimization: Common sub-expression elimination algorithm (CSE): extracts the common factors as possible in order to further reduce the hardware cost. – Ex: {02]A={a 6, a 5, a 4, a 3 +a 7, a 2 +a 7, a 1,a 0 +a 7, a 7 } {02]A={a 6, a 5, a 4, a 3 +a 7, a 2 +a 7, a 1,a 0 +a 7, a 7 } {03}A={a 6 +a 7, a 5 +a 6, a 4 +a 5, a 3 +a 4 +a 7, a 2 +a 3 +a 7, a 1 +a 2, {03}A={a 6 +a 7, a 5 +a 6, a 4 +a 5, a 3 +a 4 +a 7, a 2 +a 3 +a 7, a 1 +a 2, a 0 +a 1 +a 7, a 0 +a 7 } a 0 +a 1 +a 7, a 0 +a 7 } The factor a 0 +a 7 appears at 1-th bit of {02}A, and 0, 1-th bits of {03}A can be extracted and replaced with a 8 =(a 0 +a 7 ). The factor a 0 +a 7 appears at 1-th bit of {02}A, and 0, 1-th bits of {03}A can be extracted and replaced with a 8 =(a 0 +a 7 ). The factor a 3 +a 7 appears at 4-th bit of {02}A, and 3, 4-th bits of {03}A can also be extracted and replaced with a 9 =(a 3 +a 7 ). The factor a 3 +a 7 appears at 4-th bit of {02}A, and 3, 4-th bits of {03}A can also be extracted and replaced with a 9 =(a 3 +a 7 ).

40 The Optimization of Combined Round Transformations (1)  Combine SB, SR, and MC in encryption or ISB, ISR, and IMC in decryption. − 1. Table-lookup ROM (T-box or T -1 -box):

41 The Optimization of Combined Round Transformations (2) – 2. Combined IMC/ISR/IAF and AF/SR/MC with Shared MI in GF((2 4 ) 2 ): (a) Combined AF/SR/MC(b) Combined IMC/ISR/IAF Integration of AES Enc. and Dec. with shared MI in GF((2 4 ) 2 )

42 The Optimization of Integrated Encryption/Decryption Transformations (1)  Two major integrations: − Integration of SB and ISB, integration of MC and IMC.  SB/ISB: − Share the same MI logic in GF(2 8 ) but multiplexes the AF and IAF.

43 The Optimization of Integrated Encryption/Decryption Transformations (2)  MC/IMC: − 1. Share the common factor, XTime block, for constructing one output byte of MC and IMC as shown in followed figure. − 2. Decompose the constant matrix of IMC =MC x C. C is a constant matrix as shown in the following equation.

44 The Optimization of Integrated Encryption/Decryption Transformations (3) – 3. Decompose the IMC=MC+F+G. F and G are two constant matrix multiplications. MC IMC: FG + +

45 Our Proposed Substructure Sharing Methods for XOR- based Operations Bit-level Expressions of AES Transformations Proposed Method: Bit-level Substructure Sharing

46 Bit-level Expressions of AES Transformations

47 Bit-level Expressions of AES Transformations  Two kinds of major transformations, SB (ISB), MC (IMC) occupy about 65% of total area cost for implementing AES.  They can be expressed as bit-level XOR-based sum- of-product (SoP) operations. – SB: Out SB =MI+AF – ISB: Out ISB =IAF+MI – MI: GF((2 4 ) 2 ), GF(((2 2 ) 2 ) 2 ) – MC: Out MC ={01}A+{01}B+{02}D+{03}E (1-byte output) – IMC: Out IMC ={0d}A+{09}B+{0e}D+{0b}E (1-byte output)

48 Two Proposed CSE Algorithms for Sum-of- Product Operations Bit-level SoP Expressions Proposed Method III: Vertical CSE Algorithm Proposed Method IV: Horizontal CSE Algorithm

49 Bit-level Expressions (1)  A group of P bit-level equations (z 0, z 1,..., z P-1 ) with M 0 primary input variables (a 0, a 1, …, a M 0 -1 ) and N 0 product-terms (w 0, w 1, …, w N 0 -1 ) can be expressed as the following matrix product form:

50 Bit-level Expressions (2)  The N 0 intermediate bit variables w i can be expressed as – with where is defined as and ． denotes the bit-wised AND operation. and ． denotes the bit-wised AND operation.

51 Proposed Method I: Vertical CSE Algorithm (1)  Vertical Optimization Algorithm  Input: A set of bit-level equations consists of SoPs.  Output: A set of modified equations with the extracted common factors.  Stage 1: Extract the multi-term common factors of AND-based operations by using the following four selection rules: – Rule 1: Find the multi-term C.F. with highest occurrence count between two bit-level equations. – Rule 2: Select the index pair that has the least correlation with other found index pairs. – Rule 3: Find the pair that have the smallest number of terms in the two bit-level equations. – Rule 4: Find the C.F. that lead to the largest C.F. in the next iteration.  Stage 2: Extract the multi-term common factors of XOR-based operations by using the above selection rules.

52 Proposed Method II: Horizontal CSE Algorithm (1)  Horizontal Optimization Algorithm  Input: A set of bit-level equations consists of SoPs.  Output: A set of modified equations with the extracted common factors (C.F.).  Stage 1: Extract the two-term C.F. of AND-based operations by using the following four selection rules: – Rule 1: Find the two-term C.F. with highest occurrence count across all bit- level equations. – Rule 2: Select the index pair that has the least correlation with other found index pairs. – Rule 3: Chose the column vector pair so that the affected rows in the coefficient matrix have the smallest number of accumulated 1’s. – Rule 4: Find the C.F. that lead to the largest C.F. in the next iteration.  Stage 2: Extract the two-term common factors of XOR-based operations by using the above selection rules.

53 Proposed Method II: Horizontal CSE Algorithm (2)  Example for AF: – The original AF consists of 32 XOR and 4 INV operations with (3 XOR+ 1INV) delay.

54 Proposed Method II: Horizontal CSE Algorithm (3)  Perform Stage 2: – Perform Rule 1: Find 9 two-term common factors with the same occurrence count across 8 equations. – S max =4, (a 0,a 1 ), (a 0,a 7 ), (a 1,a 2 ), (a 2,a 3 ), (a 3,a 4 ), (a 4,a 5 ), (a 5,a 6 ), (a 5,a 8 ), (a 6,a 7 ). Let a 8 denotes 1. Let a 8 denotes 1. Ex: S h (a 0,a 1 )=(w 1, w 2, w 3, w 4 )

55 Proposed Method II: Horizontal CSE Algorithm (4) – Perform Rule 2: Find 7 two-term common factors selected by Rule 1 have the same least correlation. – U max =6, (a0,a1), (a0,a7), (a1,a2), (a2,a3), (a3,a4), (a5,a8), (a6,a7) Ex: U h (a 0,a 1 )={(a 2, a 3 ), (a 3,a 4 ), (a 4,a 5 ), (a 5,a 6 ), (a 5,a 8 ), (a 6,a 7 )}

56 Proposed Method II: Horizontal CSE Algorithm (5) – Perform Rule 3: Find a two-term common factors (a 5, a 8 ) selected by Rule 2 have the smallest number of terms in the two bit-level equations. – V min =10, (a0,a1), (a0,a7), (a1,a2), (a2,a3), (a3,a4), (a6,a7) – V min =9, (a5, a8) Ex: V h (a 0,a 1 )=5+5=10

57 Proposed Method II: Horizontal CSE Algorithm (6)  Continue to perform Stage 2: – Find another 7 two-term common factors can be extracted while finishing the operations of Stage 2. – C.F.: (a3,a4), (a6,a7), (a0,a1), (a2,X2), (a2,X4), (X1,X5), (X1,X3). – The final results after the stages of Horizontal CSE can be expressed as: The opt. AF reduces 14 XOR, and 3 INV operations, but still keeps the same path delay with the original AF.

58 Comparisons and Implementations Comparison of SB/ISB and MC/IMC Implementations Overall AES System Implementations

59 Comparison of SB/ISB and MC/IMC Implementations

60 Comparison of SB/ISB Implementations

61 Architecture-level Performance Comparisons of 32-b Integrated SB/ISB Realization (1) I II I I

62 Architecture-level Performance Comparisons of 32-b Integrated SB/ISB Realization (2)  We observe that the designs with our proposed CSE algorithms have smaller area cost than those designs without using our CSE. II I I I

63 Gate-level Based Synthesis Results of 32-b Integrated SB/ISB Implementations (1) I II I II I II

64 Gate-level Based Synthesis Results of 32-b Integrated SB/ISB Implementations (2)  The gate-level synthesis results in this table are obtained with Synopsys Design Compiler (DC) under the same area-optimized constraints based on Artisan UMC 0.18um cell library.  This table reveals that our proposed CSE indeed further reduces the area cost in spite of the area optimization by Synopsys DC. I II I II I II

65 Performance Comparison of Various Designs for Separate MC and IMC Modules with 128-bit Output Generated in Parallel

66 Performance Analysis of the Combined MC/IMC Modules with 128-bit Output Generated in Parallel  In the design of MC/IMC modules, our method III and IV also have better area and speed performance compared with [Zhang 2004], the smallest MC/IMC design known so far.

67 Overall AES System Implementations Three Various AES Architectures Experimental Results of AES Implementations

68 Three Various AES Architectures  AES Architecture (A) – The iterative architecture (A) merges the encryption and direct decryption processes.  AES Architecture (B) – The iterative architecture (B) merges the encryption and direct decryption processes and combines AF/SR/MC and IMC/ISR/IAF into a single function unit.  AES Architecture (C) – The iterative architecture (A) merges the encryption and modified decryption processes.

69 AES Architecture (A)

70 Architecture-level Performance Analysis of the SoP based Transformations in AES Architecture (A) of Fig. 5.1

71 Synthesis Results of all the Components in the AES Processor Using Synopsys Design Compiler with Artisan UMC 0.18um library in AES Architecture (A) of Fig. 5.1

72 AES Architecture (B)

73 Architecture-level Performance Analysis of the SoP based Transformations in AES Architecture (B) of Fig. 5.2

74 Synthesis Results of all the Components in the AES Processor Using Synopsys Design Compiler with Artisan UMC 0.18um library in AES Architecture (B) of Fig. 5.2

75 AES Architecture (C)

76 Architecture-level Performance Analysis of the SoP based Transformations in AES Architecture (C) of Fig. 5.3

77 Synthesis Results of all the Components in the AES Processor Using Synopsys Design Compiler with Artisan UMC 0.18um library in AES Architecture (C) of Fig. 5.3

78 Experimental Results of AES Implementations (1)  In the architecture-level analysis: – We observe that the designs with our method III and IV can achieve area reduction rates, ranging from 40.7% to 47.0% compared with the direct realization.  In the gate-level implementation: – We observe that the designs with our method III and IV can achieve area reduction rates, ranging from 12.2% to 18.6% compared with the direct realization using the same Synopsys’s optimization commands.

79 Comparison of Different AES Processor Implementations

80 Conclusions and Future Works

81 Conclusions and Future Works  Conclusions: – Four new common-subexpression-elimination (CSE) optimization methods are proposed to reduce to area cost in the memory-free AES implementations. – Our method I and II are employed to reduce the area cost of MC and IMC transformations. – Our method III and IV are employed to optimize the transformations,, including MC (IMC), SB (ISB), that can be expressed as sum-of- product (SoP) formats in three different AES architecture. – The major contribution of this dissertation is the area cost reduction of AES processor designs based on our proposed optimization methods in bit-level SoP expressions.  Future Works: – One of the future works is to reduce the power consumption for area- optimized AES processors by balancing dynamic hazards caused by the propagation delays.

82 Comparison of Different AES Processors

83 128-bit AES Chip Information & Layout Graph

84 8-bit AES Layout

85 New Approaches Developed by AIC Lab.  Design 32-bit, 16-bit, 8-bit AES processors.  Implement the Power Evaluation Device based on Embedded Systems.  Implement the Warning System for People's Falling.

86 Demo of the AES Processor  The encryption/decryption processes of JPEG graphs.

87 Demo of the Warning System  Zigbee Network  Accelerator  GPS

1 Low Cost Design of Advanced Encryption Standard (AES) Processor Ming-Chih Chen Department of Electronic Engineering National Kaohsiung First University.

Similar presentations

Presentation on theme: "1 Low Cost Design of Advanced Encryption Standard (AES) Processor Ming-Chih Chen Department of Electronic Engineering National Kaohsiung First University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Low Cost Design of Advanced Encryption Standard (AES) Processor Ming-Chih Chen Department of Electronic Engineering National Kaohsiung First University.

Similar presentations

Presentation on theme: "1 Low Cost Design of Advanced Encryption Standard (AES) Processor Ming-Chih Chen Department of Electronic Engineering National Kaohsiung First University."— Presentation transcript:

Similar presentations

About project

Feedback