Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.

Slides:



Advertisements
Similar presentations
Machine cycle.
Advertisements

Computer Architecture and the Fetch-Execute Cycle
Chapter 1. Basic Structure of Computers
Pipelining (Week 8).
Adding the Jump Instruction
RISC and Pipelining Prof. Sin-Min Lee Department of Computer Science.
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
CS1104: Computer Organisation School of Computing National University of Singapore.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Systems. Computer System Components Computer Networks.
Execution of an instruction
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Data Manipulation Computer System consists of the following parts:
Computer System Overview
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Chapter 7 Low-Level Programming Languages. 2 Chapter Goals List the operations that a computer can perform Discuss the relationship between levels of.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Henry Hexmoor1 Chapter 10- Control units We introduced the basic structure of a control unit, and translated assembly instructions into a binary representation.
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Pipelining By Toan Nguyen.
Lecture 13 - Introduction to the Central Processing Unit (CPU)
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
EXECUTION OF COMPLETE INSTRUCTION
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
CDA 3101 Fall 2013 Introduction to Computer Organization
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.
IT253: Computer Organization Lecture 9: Making a Processor: Single-Cycle Processor Design Tonga Institute of Higher Education.
Computer Architecture 2 nd year (computer and Information Sc.)
Design of a 8-bit RISC Micro controller Core By Ayush Mittal( ) Rakesh Kumar Sahoo( ) Under Guidance of Dr. M.B.Srinivas.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Team DataPath Research Computer Architechture. PC and IF in the Processor.
COMPILERS CLASS 22/7,23/7. Introduction Compiler: A Compiler is a program that can read a program in one language (Source) and translate it into an equivalent.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Computer Architecture Lecture 9 MIPS ALU and Data Paths Ralph Grishman Oct NYU.
Nadathur R Satish and Pierre-Yves Droz EECS Department, University of California Berkeley.
HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.
Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.
Elements of Datapath for the fetch and increment The first element we need: a memory unit to store the instructions of a program and supply instructions.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
MIPS Processor.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
CS161 – Design and Architecture of Computer Systems
Electrical and Computer Engineering University of Cyprus
Basic Computer Organization and Design
Embedded Systems Design
Morgan Kaufmann Publishers The Processor
Morgan Kaufmann Publishers
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Improving Program Efficiency by Packing Instructions Into Registers
Design of the Control Unit for Single-Cycle Instruction Execution
The Processor and Machine Language
Functional Units.
Design of the Control Unit for One-cycle Instruction Execution
Control Unit Introduction Types Comparison Control Memory
Guest Lecturer TA: Shreyas Chand
Fundamental Concepts Processor fetches one instruction at a time and perform the operation specified. Instructions are fetched from successive memory locations.
Branch instructions We’ll implement branch instructions for the eight different conditions shown here. Bits 11-9 of the opcode field will indicate the.
Control units In the last lecture, we introduced the basic structure of a control unit, and translated our assembly instructions into a binary representation.
Review: The whole processor
Objectives Describe common CPU components and their function: ALU Arithmetic Logic Unit), CU (Control Unit), Cache Explain the function of the CPU as.
Computer Architecture Assembly Language
Computer Science. The CPU The CPU is made up of 3 main parts : Cache ALU Control Unit.
Chapter 4 The Von Neumann Model
Presentation transcript:

Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts

Introduction We want to prove that a processor’s instruction code can be compressed after compilation, and decompressed real time during a processor’s fetch cycle. The encode/decode is performed by a software encoder and a hardware decoder.

Introduction SoftwareHardware Compression Encoder Memory Processor The encoder processes the machine code and compresses it. It also inserts a small set of instructions to tell the decoder how to decode. At run time, the decoder decompresses the machine code and the processor receives the original instructions. Executable AssemblerCompiler Cache Decoder

Motivation Previous work has focused on either encoding instructions 1, decoding instructions 2, or both - but without implementation 3. 1 Reference: Cool Code for Hot Risc - Hampton and Zhang 2 Reference: Instruction Cache Compression for Embedded Systems – Jin and Chen 3 Reference: A Compression/Decompression Scheme for Embedded Systems – Nikolova, Chouliaras, and Nunez-Yanez

CACHE FULL! Loading Instructions Into Cache Motivation Instruction Cache Program Instructions Let’s remember this amount: the amount not stored in cache. Fit more memory into cache at a time to decrease the likelihood of memory misses during the fetch cycle. FETCH!

Motivation Instruction Cache Program Instructions Now Try With Encoded Files Encoder Fit more memory into cache at a time to decrease the likelihood of memory misses during the fetch cycle.

CACHE FULL! Loading Instructions Into Cache Motivation Fit more memory into cache at a time to decrease the likelihood of memory misses during the fetch cycle. Instruction Cache Program Instructions Now Try With Encoded Files

Motivation Fit more memory into cache at a time to decrease the likelihood of memory misses during the fetch cycle. Instruction Cache Program Instructions More Instructions were Encoded this time!

Motivation More code fits in cache = less cache misses. Less cache misses = faster average fetch time. This is useful for time critical systems such as real time embedded systems.

Hardware Design Decisions We used a VHDL model of the LEON2 processor provided under the GNU License. The decoder was implemented in VHDL to easily integrate it with the LEON2 processor.

Decoder Implementation The Decoder has three modes  No_Decode – Each 32-bit fetch from memory is passed to the Instruction Fetch logic unchanged.  Algorithm_Load – The header block on code in memory is processed to load the decode algorithm for the following code.  Decode – Memory is decoded and reconstructed 32-bit instructions are passed to the Instruction fetch logic.

Decoder Implementation A variable shifter provides the required realignment Two lookup and shift operations are performed for each clock cycle to produce one 32 bit result per cycle The Decoder contains input buffering to ensure one instruction output per clock cycle unless there are sustained uncompressible instructions in the input.

CAM sample path Register Mux 16 bits 128 x 20 RAM TCAM PC Increment Logic Shift Logic Shift 16 Logic 128 x 20 RAM TCAM PC Increment Out Decoded Instruction Data in

Decoder Implementation The core of the Decoder is a CAM (Content Addressable Memory)  8 bits of the incoming code is used to address the CAM

CAM sample path Register Mux 16 bits 128 x 20 RAM TCAM PC Increment Logic Shift Logic Shift 16 Logic 128 x 20 RAM TCAM PC Increment Out Decoded Instruction Data in

Decoder Implementation The core of the Decoder is a CAM (Content Addressable Memory)  8 bits of the incoming code is used to address the CAM  The CAM returns a corresponding 16 bit decode

CAM sample path Register Mux 16 bits 128 x 20 RAM TCAM PC Increment Logic Shift Logic Shift 16 Logic 128 x 20 RAM TCAM PC Increment Out Decoded Instruction Data in

Decoder Implementation The core of the Decoder is a CAM (Content Addressable Memory)  8 bits of the incoming code is used to address the CAM  The CAM returns a corresponding 16 bit decode  The CAM also returns the required shift to left-align the next encoded instruction

CAM sample path Register Mux 16 bits 128 x 20 RAM TCAM PC Increment Logic Shift Logic Shift 16 Logic 128 x 20 RAM TCAM PC Increment Out Decoded Instruction Data in

Encoding Scheme The computer is no better than its program. ~ Elting Elmore Morison

Encoder Implementation The encoder was created in C++. It chooses an encoding scheme based on an analysis of the file content. The input file is a set of instructions for the LEON2 processor, and the output is the set of encoded instructions for the decoder to decode. The encoder adds a set of instructions to the beginning of each output file. This communicates the decoding algorithm.

C B A Encoding Algorithm We experimented with using a Huffman Tree to encode the files. A B But with a Huffman Tree, the encoding can become 2 N bits deep (where N is the number of bits encoded) …. A lot! C

Encoding Algorithm We experimented with using a Huffman Tree to encode the files. A B But with a Huffman Tree, the encoding can become 2 N bits deep (where N is the number of bits encoded) …. A lot! C

Encoding Algorithm We experimented with using a Huffman Tree to encode the files. A B C Instead we cut the tree off short and lump everything below the point into an “uncompressed” case Uncompressed Case Since A, B, and C are still common, and encoded in a short number of bits, we still get savings!

Encoding Implementation Empirical evidence suggested we encode 16 bits at a time. We chop off our Huffman tree at a tree depth of 8 (8 bits final encoding). Uncompressed code is 8 encode bits + the original 16 bits for a total of 24 bits. We make up for this with other compression.

Encoding Implementation 3 pass encoding. First pass – Analyze instructions in 16 bit chunks and record locations of branch instructions and targets.

Encoding Implementation Second pass – Encode the instructions. Place the target addresses at the beginning of a new instruction word. Leave Jump algorithms un-encoded. Analyze where new target instructions will be located. Third Pass – Write the encoding to an output file.

Compression Analysis We used test instruction sets that came with the VHDL LEON2 processor GNU licensing.

Results We are seeing 5% to 12% savings in instructions size. More compression could be realized if the algorithm descriptions are compressed ~ 5%-12%

Conclusions There is an obtainable gain by pursuing compression this way. Hardware implementation is unobtrusive. A compiler could include the encoder after link time easily. Savings is positive.

Questions? Team Lugnuts