TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla

Agenda  Introduction  Survey of Existing Architectures  Xtensa+ Crypto Processor Rijndael Algorithm (AES final selection) RC6, IDEA, and DES  Performance  Trade-off Analysis  Conclusion

Introduction  Commercial Networking Applications require flexible & high throughput secure connectivity  Encryption/Decryption algorithm computation intensive  Multi-session applications present significant load on embedded processors  Embedded systems need performance while optimizing power and area  Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded

Survey of Existing Architectures  Three categories Specialized Crypto Processors Reconfigurable Architectures Full Hardware Implementation (ASICs/FPGAs)  High Variation in architecture complexity  Performance vs Area tradeoff  Suitability for Embedded Applications

Specialized Crypto Processors  Few VLIW architectures - CryptoManiac  Instruction Combining – Instruction Word combining to exploit ILP  Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation  Coarse configurability of datapath  Mostly lacking SIMD support  Performance is typically 2x to 6x that of general processors

Reconfigurable Architectures  Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP  Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication  Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity VLIW Instructions Reconfiguration Registers  Suitable for Block Ciphers  High Variability in Performance increase w.r.t Processors

Full Hardware Implementation  High performance implementations targeted to ASICs/FPGAs DES – 12 Gbps on Virtex-E XCV300E AES – 18 Gbps on ASIC using TSMC 0.18  m process  Lacking flexibility and crypto-modes  Memory and Area efficient  Typical latency only in DMA of data to Hardware unit  Need additional processor for control path

Xtensa+ Crypto Architecture  Custom Extensions to Xtensa Processor using the TIE framework  Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied  Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor  Currently Implemented using Table construct in TIE  Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis  Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied  Possible future extensions to include multi-session key storage and fast retrieval support

AES Overview  AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption  Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits  Designed to be efficient in both hardware and software across a variety of platforms  10, 12, or 14 rounds depending on key size  128-bit round key used for each round Can be pre-computed and cached for future encryptions

AES Implementation Abstraction  Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR  Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs  Decryption is essentially the same, but with different tables and a different key schedule

TIE Implementation  Our implementation does all 16 lookups in parallel, requiring 16 SRAMs  x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index  Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0

Other Ciphers Implemented  DES (Data Encryption Standard) 64-bit block, 56-bit key, 16 rounds, Feistel network 8 6x4 S-Boxes, XORs, and bit-level permutations Can’t really be done efficiently in software TIE Implementation required 1 Instruction per round  IDEA (International Data Encryption Algorithm) 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers 4 Multiplications mod 2 16 + 1, 4 adds mod 2 16, 6 XORS Each round is highly sequential, so difficult to parallelize TIE Implementation required 7 Instructions per round  RC6 Same block and key modes as AES, 20 rounds, iterated Multiplication mod 2 32, XORs, rotations, addition mod 2 32 TIE Implementation required 2 Instructions per round

AES Performance in Xtensa+  Performance of TIE extensions approaches performance of non-pipelined ASICs Total of 31 run-time instructions per data-block  Initial EXOR Instruction  1 Instruction per round computation (10 total)  20 Cycles for Load and Store of 128-bit Data Blocks  Generally an order of magnitude better than pure software  Also faster than reconfigurable hardware or a specialized VLIW processor

Mbps of Throughput BaseVLIWTIEASICReconfig. AES43.751298418000594 DES26.52405861500053.3 IDEA2820023120341013 RC66136850815200470

Cycles Per Block BaseVLIWTIEASIC AES838903110 DES6901122616 IDEA653112669 RC6600140609

Design Tradeoffs  Flexibility Algorithm changes New algorithms New encryption modes Implementation bugs  Time to Market Closer to software development time Can choose which parts to accelerate

Power vs. Performance: Mbps/mW BaseVLIWTIEASICRec. AES0.361.155.63300.66 DES0.220.544.1959.130.08 IDEA0.230.622.1315.82.89 RC60.511.374.6914.121.35

Conclusion  Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution  Suitable for most Embedded Applications like 802.11i, etc.  Using Xtensa for cryptography is a good choice if: You don’t need absolute throughput You don’t need absolute flexibility You need a control processor anyway The algorithms needed are known ahead of time

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Similar presentations

Presentation on theme: "TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

Similar presentations

Presentation on theme: "TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla."— Presentation transcript:

Similar presentations

About project

Feedback