AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *
MEMORY HIERARCHY – Microprocessor Asst. Prof. Dr. Choopan Rattanapoka and Asst. Prof. Dr. Suphot Chunwiphat.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Types of Parallel Computers
CPU Processor Speed Timeline Speed =.02 Mhz Year= 1972 Transistors= 3500 It takes 66, CPU’s to equal 1 i7.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
Compilation Techniques for Multimedia Processors Andreas Krall and Sylvain Lelait Technische Universitat Wien.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
MAPLD 2005 A High-Performance Radix-2 FFT in ANSI C for RTL Generation John Ardini.
1 CS402 PPP # 1 Computer Architecture Evolution. 2 John Von Neuman original concept.
Multicore experiment: Plurality Hypercore Processor Performed by: Anton Fulman Ze’ev Zilberman Supervised by: Mony Orbach Characterization presentation.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Cache Organization of Pentium
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 1 Introduction.
Semi-Automatic Composition of Data Layout Transformations for Loop Vectorization Shixiong Xu, David Gregg University of Dublin, Trinity College
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Exploiting SIMD parallelism with the CGiS compiler framework Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm Saarland University.
The Central Processing Unit
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
History of Microprocessor MPIntroductionData BusAddress Bus
Performance Optimization Getting your programs to run faster CS 691.
TAKE – A Derivation Rule Compiler for Java Jens Dietrich, Massey University Jochen Hiller, TopLogic Bastian Schenke, BTU Cottbus.
Trace-Based Optimization for Precomputation and Prefetching Madhusudan Raman Supervisor: Prof. Michael Voss.
SAXS Scatter Performance Analysis CHRIS WILCOX 2/6/2008.
Lab 2 Parallel processing using NIOS II processors
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Microprocessors BY Sandy G.
CS 295 – Memory Models Harry Xu Oct 1, Multi-core Architecture Core-local L1 cache L2 cache shared by cores in a processor All processors share.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
Intro to Programming Web Design ½ Shade Adetoro. Programming Slangs IDE - Integrated Development Environment – the software in which you develop an application.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Simulation of Decode Filter Cache using SimpleScalar simulator Presented by Fei Hong.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Single Node Optimization Computational Astrophysics.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
PROCESSOR Ambika | shravani | namrata | saurabh | soumen.
Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
POLITECNICO DI MILANO A SystemC-based methodology for the simulation of dynamically reconfigurable embedded systems Dynamic Reconfigurability in Embedded.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
Computer Organization Exam Review CS345 David Monismith.
A few words on locality and arrays
Backprojection Project Update January 2002
Cache Organization of Pentium
Core i7 micro-processor
Architecture Background
Basics Of X86 Architecture
Spare Register Aware Prefetching for Graph Algorithms on GPUs
“The Brain”… I will rule the world!
Implementation of IDEA on a Reconfigurable Computer
Instruction Scheduling for Instruction-Level Parallelism
Performance Optimization for Embedded Software
STUDY AND IMPLEMENTATION
Assembly Language for Intel-Based Computers
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Superscalar and VLIW Architectures
“The Brain”… I will rule the world!
Presentation transcript:

AES Encryption Code Generator Undergraduate Research Project by Paul Magrath. Supervised by Dr David Gregg.

Motivation AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  What is AES?  Why is it interesting?  Widely used.  New intrinsics. 2

Motivation AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  How do we write optimized code for these new instructions?  Problem: Hand coded assembly.  Solution: A domain specific code generator. 3

Motivation AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  Why a domain specific code generator?  Effective  Proven  Speed  Maintainability  Tunability 4

Motivation AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  So: AES Encryption Code Generator. 5

Background  AES-NI arrives in ‘Westmere’ (due 2010) AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 6

AES Code Generator AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  Takes in input AES encryption loop file.  Generates variants of loop.  Compiles, runs repeatedly and get median of runtimes of each variant.  Reports best variant runtime achieved. 7

AES Code Generator AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 8

AES Code Generator AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  Steaming store  Unwind inner loop  Use local variables  Unwind outer loop  Interleave  Parallel (OpenMP)  Prefetch to cache  Prefetch to register Variant Options 9

AES Code Generator  Implementation:  Python wrapper  C++ application  PapiEx integration AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 10

AES Code Generator  Testing:  Intel C Compiler  GNU C Compiler  32 Bit  64 Bit  Simulator AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 11

AES Code Generator  Experimental Results from:  Intel Core 2 Quad  Intel Core 2 Duo  Intel Pentium 4 Dual AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 12

Experimental Results  So...what was learnt? AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 13

The Basics AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin Variant Applied Cycles vs Variants Applied 14

Unwinding Outer Loop AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin Level of Unwinding and Local Variables 15

Parallel – Intel Core 2 Quad AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin Level of Unwinding and Local Variables 16

Parallel – Intel Core 2 Duo AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin Level of Unwinding and Local Variables 17

Generator Tunability AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  Intel Core 2 Quad Core (64-bit)  Intel Pentium 4 Dual Processor (64-bit) 18

Future Work AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin  Genetic search algorithms.  Intel Shannon. 19

Questions? AES Encryption Code Generator - Paul Magrath - Trinity College, Dublin 20