MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith.

Slides:

Advertisements

Similar presentations

Copyright© 2011, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel ® Software Development.

Advertisements

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Assembly Language for Intel-Based Computers, 4 th Edition Chapter 1: Basic Concepts (c) Pearson Education, All rights reserved. You may modify and.

Performance What differences do we see in performance? Almost all computers operate correctly (within reason) Most computers implement useful operations.

Computer Abstractions and Technology

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

1 Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Proposal of MISTY1 as a Block Cipher of Cipher Suites in TLS Hirosato Tsuji Toshio Tokita Mitsubishi Electric Corporation.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Inline Assembly Section 1: Recitation 7. In the early days of computing, most programs were written in assembly code. –Unmanageable because No type checking,

Data Parallel Algorithms Presented By: M.Mohsin Butt

VHDL AES 128 Encryption/Decryption

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Pyxis Aaron Martin April Lewis Steve Sherk. September 5, 2005 Pyxis16002 General-purpose 16-bit RISC microprocessor bit registers 24-bit address.

Assembly Language for Intel-Based Computers, 5 th Edition Chapter 1: Basic Concepts (c) Pearson Education, All rights reserved. You may modify.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Chapter 8.  Cryptography is the science of keeping information secure in terms of confidentiality and integrity.  Cryptography is also referred to as.

Study of AES Encryption/Decription Optimizations Nathan Windels.

Digital signature using MD5 algorithm Hardware Acceleration

MATH 224 – Discrete Mathematics

A Compact and Efficient FPGA Implementation of DES Algorithm Saqib, N.A et al. In:International Conference on Reconfigurable Computing and FPGAs, Sept.

Programming Models CT213 – Computing Systems Organization.

Cosc 2150: Computer Organization

Chapter 18 – Miscellaneous Topics. Multiple File Programs u Makes possible to accommodate many programmers working on same project u More efficient to.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Cryptography Team Presentation 2

“Implementation of a RC5 block cipher algorithm and implementing an attack on it” Cryptography Team Presentation 1.

Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.

Chapter 2 Instructions: Language of the Computer Part I.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

LUCIFER hell's favorite cipher.... By: OUTSOURCED Trevin Maerten Eitan Romanoff.

Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.

Advanced Encryption Standard Dr. Shengli Liu Tel: (O) Cryptography and Information Security Lab. Dept. of Computer.

CSE 351 Number Representation & Operators Section 2 October 8, 2015.

RTL Design Methodology Transition from Pseudocode & Interface

Exploiting Parallelism

September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.

8086/8088 Instruction Set, Machine Codes and Addressing Modes.

CSE 351 Number Representation. Number Bases Any numerical value can be represented as a linear combination of powers of n, where n is an integer greater.

EGRE 426 Computer Organization and Design Chapter 4.

An optimization of the SAFER+ algorithm for custom hardware and TMS320C6x DSP implementation. By: Sachin Garg Vikas Sharma.

Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.

ECE 545 Project 1 Introduction & Specification Part I.

1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.

Computer Architecture & Operations I

Computer Architecture & Operations I

NFV Compute Acceleration APIs and Evaluation

Provides Confidentiality

A bit of C programming Lecture 3 Uli Raich.

Implementation of IDEA on a Reconfigurable Computer

Arithmetic operations, decisions and looping

Performance Optimization for Embedded Software

Chap. 6 Programming the Basic Computer

Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

128-bit Block Cipher Camellia

128-bit Block Cipher Camellia

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

Multi-Core Programming Assignment

Presentation transcript:

MISTY1 Block Cipher Undergrad Team U8 – JK FlipFlop Clark Cianfarini and Garrett Smith

What is MISTY1? Cryptographic block cipher Cryptographic block cipher Developed by Mitsubishi Electric Developed by Mitsubishi Electric Created in 1995 Created in 1995 Developed primarily for encryption on mobile phones and other mobile devices Developed primarily for encryption on mobile phones and other mobile devices Stands for: Mitsubishi Improved Security TechnologY Stands for: Mitsubishi Improved Security TechnologY

Technical Specs Feistel Network Feistel Network 64-bit block size 64-bit block size 128-bit key 128-bit key Rounds in multiples of 4 (4, 8, 12, 16, …) Rounds in multiples of 4 (4, 8, 12, 16, …) RFC 2994 RFC 2994 Picture from: sty/misty_e_b.pdf

Our Original Implementation 8 rounds; the standard 8 rounds; the standard 128-bit key and 64-bit data as hexadecimal inputs (command line arguments) 128-bit key and 64-bit data as hexadecimal inputs (command line arguments) Encrypt and decrypt functionality both implemented (as well as performing both consecutively for benchmarking) Encrypt and decrypt functionality both implemented (as well as performing both consecutively for benchmarking)

Original (Unoptimized) Design Designed for code size and clarity Designed for code size and clarity Written in C Written in C Only standard libraries used Only standard libraries used Inefficiencies in: loops, multiplies and divides, function calls, parameter passing Inefficiencies in: loops, multiplies and divides, function calls, parameter passing Usage:./misty [I] Usage:./misty [I] 'e' to encrypt, 'd' to decrypt, 'b' to test both'e' to encrypt, 'd' to decrypt, 'b' to test both K is a required 16-digit hex string (128 bits)K is a required 16-digit hex string (128 bits) M is a required 8-digit hex string (64 bits)M is a required 8-digit hex string (64 bits) I is an optional number of iterations for benchmarkingI is an optional number of iterations for benchmarking

Original Design GPROF Profile % cumulative self self total time seconds seconds calls us/call us/call name fi fo fl flinv key_schedule decrypt_block encrypt_block unpack_data decrypt_round_even encrypt_round_even decrypt_round_odd __gmon_start__ encrypt_round_odd encrypt_final main xtoi print_hex_data parse_hex_arg 80% of the time spent in FO/FI/FL/FLINV 80% of the time spent in FO/FI/FL/FLINV Compiled with gcc Compiled with gcc Benchmarked on 64-bit 2.4 GHz, linux Benchmarked on 64-bit 2.4 GHz, linux

Unoptimized Execution Time gcc misty_slow.c -o slow gcc misty_slow.c -o slow time./slow b aabbccddeeff abcdef time./slow b aabbccddeeff abcdef real 0m23.093s user 0m22.886s sys 0m0.031s real 0m23.093s user 0m22.886s sys 0m0.031s 10 million iterations, 2.31 µs per iteration (~ 1.15 µs per encryption and decryption) 10 million iterations, 2.31 µs per iteration (~ 1.15 µs per encryption and decryption)

Revised Software Design Designed for optimal performance Designed for optimal performance Loops unrolled (rounds, d0/d1 pack) Loops unrolled (rounds, d0/d1 pack) Pow-2 mul, div, mod → shift, and Pow-2 mul, div, mod → shift, and Functions inlined Functions inlined Reduced parameter passing (key) Reduced parameter passing (key) Compiler optimization levels enabled Compiler optimization levels enabled Compiler architecture-specific options enabled Compiler architecture-specific options enabled

Rounds: Before Unrolling for (i = 0; i < NUM_ROUNDS; i++) for (i = 0; i < NUM_ROUNDS; i++) { if (i == (NUM_ROUNDS - 1)) if (i == (NUM_ROUNDS - 1)) encrypt_final(i, &d0, &d1, ek); encrypt_final(i, &d0, &d1, ek); else if ((i % 2) == 0) else if ((i % 2) == 0) encrypt_round_even(i, &d0, &d1, ek); encrypt_round_even(i, &d0, &d1, ek); else else encrypt_round_odd(i, &d0, &d1, ek); encrypt_round_odd(i, &d0, &d1, ek); }

Rounds: After Unrolling // round 0 // round 0 d0 = fl(d0, 0); d0 = fl(d0, 0); d1 = fl(d1, 1); d1 = fl(d1, 1); d1 = d1 ^ fo(d0, 0); d1 = d1 ^ fo(d0, 0); // round 1 // round 1 d0 = d0 ^ fo(d1, 1); d0 = d0 ^ fo(d1, 1); // round 2 // round 2 d0 = fl(d0, 2); d0 = fl(d0, 2); d1 = fl(d1, 3); d1 = fl(d1, 3); d1 = d1 ^ fo(d0, 2); d1 = d1 ^ fo(d0, 2); // round 3 // round 3 d0 = d0 ^ fo(d1, 3); d0 = d0 ^ fo(d1, 3); // round 7 // round 7 d0 = d0 ^ fo(d1, 7); d0 = d0 ^ fo(d1, 7); // finalize // finalize d0 = fl(d0, 8); d0 = fl(d0, 8); d1 = fl(d1, 9); d1 = fl(d1, 9); // round 4 // round 4 d0 = fl(d0, 4); d0 = fl(d0, 4); d1 = fl(d1, 5); d1 = fl(d1, 5); d1 = d1 ^ fo(d0, 4); d1 = d1 ^ fo(d0, 4); // round 5 // round 5 d0 = d0 ^ fo(d1, 5); d0 = d0 ^ fo(d1, 5); // round 6 // round 6 d0 = fl(d0, 6); d0 = fl(d0, 6); d1 = fl(d1, 7); d1 = fl(d1, 7); d1 = d1 ^ fo(d0, 6); d1 = d1 ^ fo(d0, 6);

Execution Time and Speedup Description Time Speedup Slow / Initial 0m23.093s Unroll Rounds 0m21.573s Unroll D0/D1 Init 0m20.750s Shift and AND 0m18.978s Unroll Packing 0m18.135s Make EK Global 0m17.902s Inline F0/FI/FL 0m15.921s Enable O1 0m4.308s Enable O2 0m4.276s Enable O3 0m4.155s Architecture Flags 0m4.128s

Building and Testing the Optimized Implementation gcc misty_fast.c -o fast gcc misty_fast.c -o fast gcc misty_fast.c -o fast -O1 gcc misty_fast.c -o fast -O1 gcc misty_fast.c -o fast -O2 gcc misty_fast.c -o fast -O2 gcc misty_fast.c -o fast -O3 gcc misty_fast.c -o fast -O3 gcc misty_fast.c -o fast -O3 -march=core2 gcc misty_fast.c -o fast -O3 -march=core2 Fastest execution time: real 0m4.128s user 0m4.117s sys 0m0.007s Fastest execution time: real 0m4.128s user 0m4.117s sys 0m0.007s 10 million iterations, 413 ns per iteration 10 million iterations, 413 ns per iteration

Execution Time and Speedup

Final Design GPROF Profile % cumulative self self total time seconds seconds calls ns/call ns/call name decrypt_block encrypt_block main print_hex_data parse_hex_arg Most function calls inlined, only decrypt_block and encrypt_block remain Most function calls inlined, only decrypt_block and encrypt_block remain

What was Learned? Original implementation may not have been implemented all that badly (~1.5 speedup from manual implementations) Original implementation may not have been implemented all that badly (~1.5 speedup from manual implementations) Larger benefit from instruction level optimization (gcc) Larger benefit from instruction level optimization (gcc) Profile first, then optimize in places where it actually matters Profile first, then optimize in places where it actually matters Bit-wise AND operator lower precedence than modulus: Bit-wise AND operator lower precedence than modulus: x % y + z → (x % y) + z x % y + z → (x % y) + z x & y + z → x & (y + z) x & y + z → x & (y + z) All optimizations add up to a significant amount of savings All optimizations add up to a significant amount of savings

Future Work Use of SSE vector instructions for parallel operations Use of SSE vector instructions for parallel operations Data types such as uint8_t/uint16_t converted to natural integer size for better memory alignment and access performance Data types such as uint8_t/uint16_t converted to natural integer size for better memory alignment and access performance Use of a union to replace packing and unpacking of data from array to D0/D1 Use of a union to replace packing and unpacking of data from array to D0/D1 Written directly in optimized assembly Written directly in optimized assembly Dedicated hardware implementation (ASIC/FPGA) for MISTY1 (originally designed to be implemented in hardware) Dedicated hardware implementation (ASIC/FPGA) for MISTY1 (originally designed to be implemented in hardware)

Questions? ?