Master’s Thesis: Fast Flexible Architectures for Secure Communication

Slides:



Advertisements
Similar presentations
Intro to the “c6x” VLIW processor
Advertisements

Origins  clear a replacement for DES was needed Key size is too small Key size is too small The variants are just patches The variants are just patches.
Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Cryptography and Network Security
AES clear a replacement for DES was needed
Design of a Reconfigurable Hardware For Efficient Implementation of Secret Key and Public Key Cryptography.
Cryptography and Network Security (AES) Dr. Monther Aldwairi New York Institute of Technology- Amman Campus 10/18/2009 INCS 741: Cryptography 10/18/20091Dr.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Cryptography and Network Security Chapter 5. Chapter 5 –Advanced Encryption Standard "It seems very simple." "It is very simple. But if you don't know.
Cryptography and Network Security Chapter 5 Fourth Edition by William Stallings.
Dr. Lo’ai Tawalbeh 2007 Chapter 5: Advanced Encryption Standard (AES) Dr. Lo’ai Tawalbeh New York Institute of Technology (NYIT) Jordan’s Campus.
Encryption Schemes Second Pass Brice Toth 21 November 2001.
Study of AES Encryption/Decription Optimizations Nathan Windels.
Todd Austin University of Michigan X-Stack Energy Optimization: Fact or Fiction.
Chapter 5 Advanced Encryption Standard. Origins clear a replacement for DES was needed –have theoretical attacks that can break it –have demonstrated.
Chapter 5 –Advanced Encryption Standard "It seems very simple." "It is very simple. But if you don't know what the key is it's virtually indecipherable."
9/17/15UB Fall 2015 CSE565: S. Upadhyaya Lec 6.1 CSE565: Computer Security Lecture 6 Advanced Encryption Standard Shambhu Upadhyaya Computer Science &
Encryption for Mobile Computing By Erik Olson Woojin Yu.
Advance Encryption Standard. Topics  Origin of AES  Basic AES  Inside Algorithm  Final Notes.
1 Architectural Support for Copy and Tamper Resistant Software David Lie, Chandu Thekkath, Mark Mitchell, Patrick Lincoln, Dan Boneh, John Mitchell and.
LOGO Hardware side of Cryptography Anestis Bechtsoudis Patra 2010.
Module 3 – Cryptography Cryptography basics Ciphers Symmetric Key Algorithms Public Key Algorithms Message Digests Digital Signatures.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Computer Architecture 2 nd year (computer and Information Sc.)
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Advanced Encryption Standard Dr. Shengli Liu Tel: (O) Cryptography and Information Security Lab. Dept. of Computer.
RTL Design Methodology Transition from Pseudocode & Interface
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
1 CPCS425: Information Security (Topic 5) Topic 5  Symmetrical Cryptography  Understand the principles of modern symmetric (conventional) cryptography.
Practical Aspects of Modern Cryptography Josh Benaloh & Brian LaMacchia.
CS 352H: Computer Systems Architecture
Computer Organization
15-740/ Computer Architecture Lecture 3: Performance
Provides Confidentiality
Central Processing Unit Architecture
A Closer Look at Instruction Set Architectures
School of Computer Science and Engineering Pusan National University
Morgan Kaufmann Publishers
Embedded Systems Design
Morgan Kaufmann Publishers The Processor
Architecture & Organization 1
Morgan Kaufmann Publishers The Processor
CISC (Complex Instruction Set Computer)
CDA 3101 Spring 2016 Introduction to Computer Organization
Running OpenSSL Crypto Algorithms in Simplescalar
Implementation of IDEA on a Reconfigurable Computer
Architecture & Organization 1
Central Processing Unit
Dynamic High-Performance Multi-Mode Architectures for AES Encryption
Advanced Computer Architecture
Guest Lecturer TA: Shreyas Chand
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Processor: Multi-Cycle Datapath & Control
The Vector-Thread Architecture
RTL Design Methodology Transition from Pseudocode & Interface
CSC3050 – Computer Architecture
International Data Encryption Algorithm
CSCE 715: Network Systems Security
CPU Structure CPU must:
Performing Security Auditing In Hardware
Advanced Encryption Standard
Chapter 4 The Von Neumann Model
Presentation transcript:

Master’s Thesis: Fast Flexible Architectures for Secure Communication Lisa Wu University of Michigan Advanced Computer Architecture Laboratory Advisor: Professor Todd Austin

Advanced Computer Architecture Lab Project Overview Cipher Kernel Analyses Throughput analysis, bottleneck analysis, relative run time cost, kernel characterization Architectural Extensions CryptoManiac Architecture Instruction architecture, system architecture, processing element architecture, physical design characteristics Super Optimizer Validation and parameter studies Performance Analysis Encryption rate studies 9/18/2018 Advanced Computer Architecture Lab

My Research Contribution Design and implementation of the CryptoManiac co-processor Hardware models of CryptoManiac 8WC, 4WC, 3WC, 2WC, and 4WNC ISA and scheduling of kernels Timing, area, power, and performance analyses of the CryptoManiac co-processor Design and implementation of the super optimizer Instruction combination study Automatic generation of varied width schedules Publication - ISCA 2001 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab Cryptography Definitions: encryption vs. decryption public-key cipher vs. secret-key cipher Public-secret key ciphers are the most commonly used f(x) g(x) plaintext ciphertext plaintext Public Key Private Key g(x) g(x) plaintext ciphertext plaintext Private Key Private Key 9/18/2018 Advanced Computer Architecture Lab

SSL Session Breakdown Focus: Secret-Key Ciphers client server authenticate public private key https get https recv . private close 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab Benchmark Suite Cipher Key Size Blk Size Rnds/Blk Author Application 3DES 112 64 48 CryptSoft SSL, SSH Blowfish 128 64 16 CryptSoft Norton Utilities IDEA 128 64 8 Ascom PGP, SSH Mars 128 128 16 IBM AES Candidate RC4 128 8 1 CryptSoft SSL RC6 128 128 18 RSA Security AES Candidate Rijndael 128 128 10 Rijmen AES Standard Twofish 128 128 16 Counterpane AES Candidate 9/18/2018 Advanced Computer Architecture Lab

Cipher Throughput Analysis Alpha 21264 vs. 4W All except Mars and Twofish were within 10% of the actual machine tests Mars 11%, Twofish 15% Alpha 21264 vs. DF Blowfish, IDEA, and RC6 are running within 20% of DF performance Mars 29%, Twofish 76% RC4 and Rijndael are outliers 9/18/2018 Advanced Computer Architecture Lab

Cipher Bottleneck Analysis Alias - impact of stalling loads in the pipeline until all ealier store addresses have been resolved Branch - effects of mispredictions Issue - impact of reducing issue width Mem - impact of introducing a realistic memory system Res - impact of limited functional unit resources Window - impact of a limited-size instruction window 9/18/2018 Advanced Computer Architecture Lab

Cipher Relative Run Time Cost Focus: Kernel Loop 3DES and IDEA are small even for 16 byte sessions Mars, RC4, RC6, Rijndael, and Twofish drop well below 10% for 4k+ byte sessions Blowfish is outlier, drops below 10% only for 64k+ byte sessions 9/18/2018 Advanced Computer Architecture Lab

Cipher Kernel Characterization SBOX - substitutions XBOX - permutations IDEA, Mars, RC4, and RC6 rely on arithmetic computations; benefit from more resources (multiplies) and from faster operations (rotates) Blowfish, 3DES, Rijndael and Twofish rely on substitutions; benefit from increased memory bandwidth and accesses 9/18/2018 Advanced Computer Architecture Lab

Architectural Extensions All instructions are limited to two register input operands and one register output ROL and ROR (rotates) for 64 and 32-bit data types ROLX and RORX support a constant rotate of a register input, followed by an XOR with another register input MULMOD computes the modular multiplication of two register values modulo the value 0x10001 SBOX speeds the accessing of substitution tables with 256-entry tables and 32-bit contents SBOXSYNC synchronize the SBOX table with memory XBOX implements a portion of a full 64-bit permutation 9/18/2018 Advanced Computer Architecture Lab

SBOX Instruction Semantics 10 8 16 24 63 op 00 SBOX Table Table Index SBOX instruction eliminates address generation All SBOX tables are aligned to a 1k byte boundary Address generation becomes zero-latency bit concatenation Stores to SBOX storage are not visible by later SBOX’s until An SBOXSYNC is executed An alias bit is set 9/18/2018 Advanced Computer Architecture Lab

Performance of ISA Extensions 9/18/2018 Advanced Computer Architecture Lab

The CryptoManiac Processor A 4-wide 32-bit VLIW machine with no cache and a simple branch predictor Supports a triadic (three input operands) ISA that permits combining of most cryptographic operation pairs for better clock cycle utilization Can be combined into chip multiprocessor configurations for improved performance on workloads with inter-session and inter-packet parallelism 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab CryptoManic ISA bundle := <inst><inst><inst><inst> inst := <operation pair><dest><operand 1><operand 2><operand 3> operation pair := <short><tiny>|<tiny><short>|<tiny><tiny>|<long><nop> tiny := <xor> | <and> | <inc> | <signext> | <nop> short := <add> | <sub> | <rot> | <sbox> | <nop> long := <mul> | <mulmod> Examples: Instruction Expression Add-Xor R4, R1, R2, R3 R4 <- (R1+R2)R3 And-Rot R4, R1, R2, R3 R4 <- (R1&&R2)<<<R3 And-Xor R4, R1, R2, R3 R4 <- (R1&&R2)R3 9/18/2018 Advanced Computer Architecture Lab

Scheduling Example: Blowfish BOX ADD XO R Sign Ext L oad SBOX SBOX SBOX SBOX SBOX Add-XOR Load Add XOR XOR-SignExt Takes a total of only 4 cycles to execute! 9/18/2018 Advanced Computer Architecture Lab

High-Level Schematic of a Single Functional Unit Pipelined 32-Bit MUL 1K Byte SBOX Cache Adder Rotator XOR AND Logical Unit 9/18/2018 Advanced Computer Architecture Lab

CryptoManiac Architecture B T I M E m R F U D a t e n Q / O u r f c K y s o X W 9/18/2018 Advanced Computer Architecture Lab

CryptoManiac System Architecture id session action data… result… CM Proc Keystore Req Scheduled In Q Out Q requests . results Request Format Result Format 9/18/2018 Advanced Computer Architecture Lab

Timing and Area Results 9/18/2018 Advanced Computer Architecture Lab

Encryption Performance HDTV OC-3 OC-12 9/18/2018 Advanced Computer Architecture Lab

Special Case Studies: 3DES and Rijndael 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab The Super Optimizer Validate hand-scheduled kernel results Automate generation of optimized kernels for the various CryptoManiac architecture studied Instruction combination studies give insight as to possibly eliminate unnecessary hardware S 9/18/2018 Advanced Computer Architecture Lab

Instruction Combination Study 9/18/2018 Advanced Computer Architecture Lab

Instruction Combining Characteristics 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab Conclusion Two hardware/software-design techniques to improve the performance of secret-key cipher algorithms Add instruction support for fast substitutions, general permutations, rotates, and modular arithmetic SBOX eliminates address generation Overall speedup of 59% over baseline machine w/ rotates Design an efficient 4-wide VLIW cryptographic co-processor called the CryptoManiac Instruction combining - efficient utilization of clock cycle Rijndael runs 2.25 times faster with 1/100th area and power of a 600MHz Alpha processor 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab Future Work Access the cost of programmability in the CryptoManiac by comparing design and performance of A dedicated hardware Rijndael implementation (no programmability) A FPGA Rijndael implementation (hardware programmability) CryptoManiac (software programmability). Other application specific processors such as audio processing, speech recognition, and soft-radio. 9/18/2018 Advanced Computer Architecture Lab

Advanced Computer Architecture Lab Acknowledgement Credit for much of the work described in this thesis belongs to my advisor, Professor Todd Austin, for his insight, guidance, and patience. He provided for an excellent research environment, left me enough freedom to do things the way I thought they should be done, and was always available to discuss ideas and problems. I would also like to thank my committee members Professor Steve Reinhardt and Professor Gary Tyson for reviewing this document and serving on the defense committee. Other people that have contributed to the CryptoManiac project include Chris Weaver for hardware design and synthesis support, Jerome Burke and John McDonald for earlier versions of ISA extensions code modifications. 9/18/2018 Advanced Computer Architecture Lab