TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla
Agenda Introduction Survey of Existing Architectures Xtensa+ Crypto Processor Rijndael Algorithm (AES final selection) RC6, IDEA, and DES Performance Trade-off Analysis Conclusion
Introduction Commercial Networking Applications require flexible & high throughput secure connectivity Encryption/Decryption algorithm computation intensive Multi-session applications present significant load on embedded processors Embedded systems need performance while optimizing power and area Our study – existing architectures, analysis of Xtensa as an alternative, performance analysis and trade-offs for embedded
Survey of Existing Architectures Three categories Specialized Crypto Processors Reconfigurable Architectures Full Hardware Implementation (ASICs/FPGAs) High Variation in architecture complexity Performance vs Area tradeoff Suitability for Embedded Applications
Specialized Crypto Processors Few VLIW architectures - CryptoManiac Instruction Combining – Instruction Word combining to exploit ILP Crypto Arithmetic Unit(s) – multiple XORs, GF multiplication/addition, lookup table substitution, and permutation Coarse configurability of datapath Mostly lacking SIMD support Performance is typically 2x to 6x that of general processors
Reconfigurable Architectures Numerous reconfigurable processor architectures – PipeRench, MorphoSys, COBRA, and GARP Functional Units that provide all crypto arithmetic - multiple XORs, GF multiplication/addition, modulo multiplication Reconfigurable Interconnection Network to provide dynamic change to functional unit connectivity VLIW Instructions Reconfiguration Registers Suitable for Block Ciphers High Variability in Performance increase w.r.t Processors
Full Hardware Implementation High performance implementations targeted to ASICs/FPGAs DES – 12 Gbps on Virtex-E XCV300E AES – 18 Gbps on ASIC using TSMC 0.18 m process Lacking flexibility and crypto-modes Memory and Area efficient Typical latency only in DMA of data to Hardware unit Need additional processor for control path
Xtensa+ Crypto Architecture Custom Extensions to Xtensa Processor using the TIE framework Addition of Generic Key Schedule Register File and Instructions to support all Crypto Algorithms studied Addition of multiple on-chip SRAMs (in addition to 4 Data-RAMs) to the Xtensa processor Currently Implemented using Table construct in TIE Hacked TIE Compiler generated Verilog Code to instantiate multiple RAM models (implemented using multi-dimensional array) for viability analysis Addition of 4 State Registers and 4 Next State Registers generic to all algorithms studied Possible future extensions to include multi-session key storage and fast retrieval support
AES Overview AES (Advanced Encryption Standard) is the standard set to replace DES for both government and private-sector encryption Uses a fixed block size of 128-bits, with key sizes of 128-, 196-, or 256-bits Designed to be efficient in both hardware and software across a variety of platforms 10, 12, or 14 rounds depending on key size 128-bit round key used for each round Can be pre-computed and cached for future encryptions
AES Implementation Abstraction Each round consists of a lookup, byte-level permutation, finite field multiplication, and key XOR Lookup and multiplication can be combined into four separate 8x32 lookup tables, so each round is 16 lookups and 16 XORs Decryption is essentially the same, but with different tables and a different key schedule
TIE Implementation Our implementation does all 16 lookups in parallel, requiring 16 SRAMs x0, x1, x2, x3, represents the round state (each 32 bits), k0, k1, k2, k3 are the current round key, and Tij are the T-boxes, where i is a duplication index and j is the T-box index Each round is then: x0 = T00[x0]^T01[x1>>8]^T02[x2>>16]^T03[x3>>24] ^ k0 x1 = T10[x1]^T11[x2>>8]^T12[x3>>16]^T13[x0>>24] ^ k0 x2 = T20[x2]^T21[x3>>8]^T22[x0>>16]^T23[x1>>24] ^ k0 x3 = T30[x3]^T31[x0>>8]^T32[x1>>16]^T33[x2>>24] ^ k0
Other Ciphers Implemented DES (Data Encryption Standard) 64-bit block, 56-bit key, 16 rounds, Feistel network 8 6x4 S-Boxes, XORs, and bit-level permutations Can’t really be done efficiently in software TIE Implementation required 1 Instruction per round IDEA (International Data Encryption Algorithm) 64-bit block, 128-bit key, 8 rounds, iterated, operates on 16-bit numbers 4 Multiplications mod , 4 adds mod 2 16, 6 XORS Each round is highly sequential, so difficult to parallelize TIE Implementation required 7 Instructions per round RC6 Same block and key modes as AES, 20 rounds, iterated Multiplication mod 2 32, XORs, rotations, addition mod 2 32 TIE Implementation required 2 Instructions per round
AES Performance in Xtensa+ Performance of TIE extensions approaches performance of non-pipelined ASICs Total of 31 run-time instructions per data-block Initial EXOR Instruction 1 Instruction per round computation (10 total) 20 Cycles for Load and Store of 128-bit Data Blocks Generally an order of magnitude better than pure software Also faster than reconfigurable hardware or a specialized VLIW processor
Mbps of Throughput BaseVLIWTIEASICReconfig. AES DES IDEA RC
Cycles Per Block BaseVLIWTIEASIC AES DES IDEA RC
Design Tradeoffs Flexibility Algorithm changes New algorithms New encryption modes Implementation bugs Time to Market Closer to software development time Can choose which parts to accelerate
Power vs. Performance: Mbps/mW BaseVLIWTIEASICRec. AES DES IDEA RC
Conclusion Xtensa instructions provide flexibility, performance, and Mbps/mW all somewhere between an ASIC and a VLIW or Software-based solution Suitable for most Embedded Applications like i, etc. Using Xtensa for cryptography is a good choice if: You don’t need absolute throughput You don’t need absolute flexibility You need a control processor anyway The algorithms needed are known ahead of time