High Performance Embedded Computing © 2007 Elsevier Chapter 2, part 2: CPUs High Performance Embedded Computing Wayne Wolf.

Slides:



Advertisements
Similar presentations
Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Advertisements

Instruction Set Design
Computer Organization and Architecture
Data Compression CS 147 Minh Nguyen.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Computer Organization and Architecture
Computer Organization and Architecture
Chapter 16 Control Unit Operation No HW problems on this chapter. It is important to understand this material on the architecture of computer control units,
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Memory Organization.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
Chapter 15 IA 64 Architecture Review Predication Predication Registers Speculation Control Data Software Pipelining Prolog, Kernel, & Epilog phases Automatic.
Memory Management Five Requirements for Memory Management to satisfy: –Relocation Users generally don’t know where they will be placed in main memory May.
Chapter 1 Data Storage. 2 Chapter 1: Data Storage 1.1 Bits and Their Storage 1.2 Main Memory 1.3 Mass Storage 1.4 Representing Information as Bit Patterns.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
Data Compression Algorithms for Energy-Constrained Devices in Delay Tolerant Networks Christopher M. Sadler and Margaret Martonosi In: Proc. of the 4th.
1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.
4-1 Chapter 4 - The Instruction Set Architecture Computer Architecture and Organization by M. Murdocca and V. Heuring © 2007 M. Murdocca and V. Heuring.
Levels of Architecture & Language CHAPTER 1 © copyright Bobby Hoggard / material may not be redistributed without permission.
MIPS coding. SPIM Some links can be found such as:
Impact of Java Compressed Heap on Mobile/Wireless Communication Mayumi KATO and Chia-Tien Dan Lo (itcc’05) Department of Computer Science, University of.
Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Multimedia Data Introduction to Lossless Data Compression Dr Sandra I. Woolley Electronic, Electrical.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University.
1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Huffman Code and Data Decomposition Pranav Shah CS157B.
Principles of Linear Pipelining
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
Chapter One Introduction to Pipelined Processors
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
Computer Organization CDA 3103 Dr. Hassan Foroosh Dept. of Computer Science UCF © Copyright Hassan Foroosh 2002.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
CS654: Digital Image Analysis Lecture 34: Different Coding Techniques.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Hanyang University Hyunok Oh Energy Optimal Bit Encoding for Flash Memory.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Basic Elements of Processor ALU Registers Internal data pahs External data paths Control Unit.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Chapter 8: Memory Management. 8.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 8: Memory Management Background Swapping Contiguous.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
Controller Implementation
Data Compression.
The University of Adelaide, School of Computer Science
Improving Program Efficiency by Packing Instructions Into Registers
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
UNIT IV.
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Virtual Memory Overcoming main memory size limitation
Efficient Placement of Compressed Code for Parallel Decompression
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Chapter 2, part 2: CPUs High Performance Embedded Computing Wayne Wolf

High Performance Embedded Computing © 2007 Elsevier Topics Memory systems.  Memory component models.  Caches and alternatives. Code compression.

High Performance Embedded Computing © 2007 Elsevier Generic memory block

High Performance Embedded Computing © 2007 Elsevier Simple memory model Core array is n rows x m columns. Total area A = A r + A x + A p + A c. Row decoder area A r = a r m. Core area A x = a x mn. Precharge circuit area A p = a p m. Column decoder area A c = a c m.

High Performance Embedded Computing © 2007 Elsevier Simple energy and delay models  =  setup +  r +  x +  bit +  c. Total energy E = E D + E S.  Static energy component E S is a technology parameter.  Dynamic energy E D = E r + E x + E p + E c.

High Performance Embedded Computing © 2007 Elsevier Multiport memories structure Delay vs. memory size and number of ports.

High Performance Embedded Computing © 2007 Elsevier Kamble and Ghose cache power model Cache is m-way set- associative, capacity of D bytes, T bits of tag and L bytes of line, St status bits per block frame. Bit line energy:

High Performance Embedded Computing © 2007 Elsevier Kamble/Ghose, cont’d. Word line energy: Output line energy: Address input lines:

High Performance Embedded Computing © 2007 Elsevier Shiue and Chakrabarti cache energy model add_bs: number of transitions on address bus per instruction. data_bs: number of transitions on data bus per instruction. word_line_size: number of memory cells on a word line. bit_line_size: number of memory cells on a bit line. E m : Energy consumption of a main memory access.  : technology parameters.

High Performance Embedded Computing © 2007 Elsevier Shiue/Chakrabarti, cont’d.

High Performance Embedded Computing © 2007 Elsevier Register files First stage in the memory hierarchy. When too many values are live, some values must be spilled onto main memory and read back later.  Spills cost time, energy. Register file parameters:  Number of words.  Number of ports.

High Performance Embedded Computing © 2007 Elsevier Performance and energy vs. register file size. [Weh01] © 2001 IEEE

High Performance Embedded Computing © 2007 Elsevier Cache size vs. energy [Li98] © 1998 IEEE

High Performance Embedded Computing © 2007 Elsevier Cache parameters Cache size:  Larger caches hold more data, burn more energy, take area away from other functinos. Number of sets:  More independent references, more locations mapped onto each line. Cache line length:  Longer lines give more prefetching bandwidth, higher energy consumption.

High Performance Embedded Computing © 2007 Elsevier Wolfe/Lam classification of program behavior in caches Self-temporal: same array element is accessed in different loop iterations. Self-spatial reuse: same cache line is accessed in different loop iteraitons. Group-temporal reuse: different parts of the program access the same array element. Group-spatial reuse: different parts of the program access the same cache line.

High Performance Embedded Computing © 2007 Elsevier Multilevel cache optimization Gordon-Ross et al adjust cache parameters in order:  Cache size.  Line size.  Associativity. Design cache size for first level, then second level; line size for first, then second level; associativity for first, then second level.

High Performance Embedded Computing © 2007 Elsevier Scratch pad memory Scratch pad is managed by software, not hardware.  Provides predictable access time.  Requires values to be allocated. Use standard read/write instructions to access scratch pad.

High Performance Embedded Computing © 2007 Elsevier Code compression Extreme version of instruction encoding:  Use variable-bit instructions.  Generate encodings using compression algorithms. Generally takes longer to decode. Can result in performance, energy, code size improvements. IBM CodePack (PowerPC) used Huffman encoding.

High Performance Embedded Computing © 2007 Elsevier Terms Compression ratio:  Compressed code size/uncompressed code size * 100%.  Must take into account all overheads.

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin approach Object code is fed to lossless compression algorithm.  Wolfe/Chanin used Huffman’s algorithm. Compressed object code becomes program image. Code is decompressed on-the-fly during execution. Source code compiler Object code compressor Compressed object code

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin execution Instructions are decompressed when read from main memory.  Data is not compressed or decompressed. Cache holds uncompressed instructions.  Longer latency for instruction fetch. CPU does not require significant modifications. CPU decompressor cache memory

High Performance Embedded Computing © 2007 Elsevier Huffman coding Input stream is a sequence of symbols. Each symbol’s probability of occurrence is known. Construct a binary tree of probabilities from the bottom up.  Path from room to symbol gives code for that symbol.

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin results [Wol92] © 1992 IEEE

High Performance Embedded Computing © 2007 Elsevier Compressed vs. uncompressed code Code must be uncompressed from many different starting points during branches. Code compression algorithms are designed to decode from the start of a stream. Compressed code is organized into blocks.  Uncompress at start of block. Unused bits between blocks constitute overhead. add r1, r2, r3 mov r1, a bne r1, foo uncompressed compressed

High Performance Embedded Computing © 2007 Elsevier Block structure and compression Trade-off:  Compression algorithms work best on long blocks.  Program branching works best with short blocks. Labels in program move during compression. Two approaches:  Wolfe and Chanin used branch table to translate branches during execution (adds code size).  Lefurgy et al. patched compressed code to refer branches to compressed locations.

High Performance Embedded Computing © 2007 Elsevier Compression ratio vs. block size [Lek99b] © 1999 IEEE

High Performance Embedded Computing © 2007 Elsevier Compression formats Lefurgy et al. used first four bits to define length of compressed sequence (8, 12, 16, 23 bits). Ishiura and Yamaguchi automatically extracted fields from instructions to optimze encoding. Larin and Conte tailored the encoding of fields to the range of values used in that field by the program.

High Performance Embedded Computing © 2007 Elsevier Pre-cache compression Decompress as instructions come out of the cache. One instruction must be decompressed many times. Program has smaller cache footprint.

High Performance Embedded Computing © 2007 Elsevier Encoding algorithms Data compression has developed a large number of compression algorithms. These algorithms were designed for different constraints:  Large text files.  No real-time or power constraints. Evaluate existing algorithms under the requirements of code compressions, develop new algorithms.

High Performance Embedded Computing © 2007 Elsevier Energy savings evaluation Yoshida et al. used dictionary-based encoding. Power reduction ratio:  N: number of instructions in original program.  m: bit width of those instructions.  n: number of compressed instructions.  k: ratio of on-chip/off-chip memory power dissipation.

High Performance Embedded Computing © 2007 Elsevier Arithmetic coding Huffman coding maps symbols onto the integer number line. Arithmetic coding maps symbols onto the real number line.  Can handle arbitrarily fine distinctions in symbol probabilities. Table-based method allows fixed-point arithmetic to be used. [Lek99c] © 1999 IEEE

High Performance Embedded Computing © 2007 Elsevier Markov models A Markovian state machine allows us to define conditional probabilities of sequences of symbols. State in Markov model is a subset of the previously seen sequence. Transitions out of each state are conditioned on next symbol. Probabilities of transitions vary from state to state.

High Performance Embedded Computing © 2007 Elsevier Arithmetic coding and Markov model Lekatsas and Wolf combined arithmetic coding and Markov models (SAMC). Markov model has limited depth to avoid blow-up.  Long bit sequences wrap around both horizontally and vertically.  Model depth should multiply/divide instruction size.

High Performance Embedded Computing © 2007 Elsevier SAMC results [Lek99a] © 1999 IEEE

High Performance Embedded Computing © 2007 Elsevier Tunstall coding Tunstall coding transforms variable-sized strings into equal-sized codes. Coding three has 2 N leaf nodes.  Depth of tree varies. Xie and Wolf added Markov model to Tunstall coding. Allows parallel decoding of segments of the codeword.

High Performance Embedded Computing © 2007 Elsevier Tunstall/Markov coding results [Xie02] © 2002 IEEE

High Performance Embedded Computing © 2007 Elsevier Dictionary-based methods Liao et al. identified common code sequences, synthesized subroutines.  Also proposed hardware implementation. Kirovski et al. proposed a procedure cache for software-controlled code compression.  Handler maps procedure identifiers to code during execution.  Handler also manages free space. Chen at al.: software-controlled Java byte-code compression. Lefurgy et al. proposed exception mechanism to manage compressed code in the cache.

High Performance Embedded Computing © 2007 Elsevier Lefurgy et al. execution time vs. instruction cache miss ratio [Lef00] © 2000 IEEE

High Performance Embedded Computing © 2007 Elsevier Lefurgy et al. selective compression results

High Performance Embedded Computing © 2007 Elsevier Code and data compression Unlike (non-modifiable) code, data must be compressed and decompressed dynamically. Can substantially reduce cache footprints. Requires different trade-offs.

High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv algorithm Dictionary-based method. Decoder builds dictionary during decompression process. LZW variant uses a fixed-size buffer. Source text Uncompressed source CoderDictionary CoderDictionary Compressed text

High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv example

High Performance Embedded Computing © 2007 Elsevier MXT Tremaine et al. has 3-level cache system.  Level 3 is shared among several processor, connected to main memory.  Data and code are compressed/uncompressed as they move between main memory and level 3 cache. Uses a variant of Lempel-Ziv 1977 algorithm.  All compression engines share the same dictionary.  Typically, 1 KB blocks are divided into 256-byte compression blocks.

High Performance Embedded Computing © 2007 Elsevier Other applications Benini et al. evaluated energy savings of post-cache decompression.  Simple dictionary gave 35% energy savingts. Lekatsas et al. combined data and code compression and encryption.  Modified operating system performs compression, encryption at proper point in memory access process.