High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.
Instruction Set Design
CSCI 4717/5717 Computer Architecture
Control path Recall that the control path is the physical entity in a processor which: fetches instructions, fetches operands, decodes instructions, schedules.
Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Multi-Level Caches Vittorio Zaccaria. Preview What you have seen: Data organization, Associativity, Cache size Policies -- how to manage the data once.
1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Give qualifications of instructors: DAP
1 Starting a Program The 4 stages that take a C++ program (or any high-level programming language) and execute it in internal memory are: Compiler - C++
S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.
ECE/CS 552: Cache Performance Instructor: Mikko H Lipasti Fall 2010 University of Wisconsin-Madison Lecture notes based on notes by Mark Hill Updated by.
1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )
EECC722 - Shaaban #1 Lec # 10 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
EECC722 - Shaaban #1 Lec # 9 Fall Conventional & Block-based Trace Caches In high performance superscalar processors the instruction fetch.
1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
MIPS coding. SPIM Some links can be found such as:
Reconfigurable Caches and their Application to Media Processing Parthasarathy (Partha) Ranganathan Dept. of Electrical and Computer Engineering Rice University.
High Performance Embedded Computing © 2007 Elsevier Chapter 2, part 2: CPUs High Performance Embedded Computing Wayne Wolf.
Lecture 18: Dynamic Reconfiguration II November 12, 2004 ECE 697F Reconfigurable Computing Lecture 18 Dynamic Reconfiguration II.
Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.
Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.
A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.
Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University.
Module 8 Part B Adapted By and Prepared James Tan © 2001.
1 Classification of Compression Methods. 2 Data Compression  A means of reducing the size of blocks of data by removing  Unused material: e.g.) silence.
Lossless Compression CIS 465 Multimedia. Compression Compression: the process of coding that will effectively reduce the total number of bits needed to.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 January Session 2.
Chapter 1 Data Storage © 2007 Pearson Addison-Wesley. All rights reserved.
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 22 Memory Definitions Memory ─ A collection of storage cells together with the necessary.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Compression techniques Adaptive and non-adaptive.
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.
Lampel ZIV (LZ) code The Lempel-Ziv algorithm is a variable-to-fixed length code Basically, there are two versions of the algorithm LZ77 and LZ78 are the.
Overview of microcomputer structure and operation
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Hello world !!! ASCII representation of hello.c.
Operating Systems A Biswas, Dept. of Information Technology.
Selective Code Compression Scheme for Embedded System
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
The University of Adelaide, School of Computer Science
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Ka-Ming Keung Swamy D Ponpandi
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
CSC3050 – Computer Architecture
Lecture 4: Instruction Set Design/Pipelining
Ka-Ming Keung Swamy D Ponpandi
Efficient Placement of Compressed Code for Parallel Decompression
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

High Performance Embedded Computing © 2007 Elsevier Topics Branching in VLIW (from last time) Memory systems.  Memory component models.  Caches and alternatives. Code compression.

High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors Operations in the same VLIW as a branch must not depend on the branch outcome VLIWs may utilize static branch prediction  Uses profiling to predict branch direction  Have different instructions depending on if the branch is predicted to be taken or not taken VLIWs sometimes have specialized loop instructions – execute the instruction word or set of instruction words a specified number of time. VLIWs sometimes have predicated instructions  Execute instruction conditionally based on value in “predicate register”  cmpgt p1 = r1, r2;; sub (p1) r3 = r1, r2, sub (!p1) r3 = r2, r1;;

High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors VLIWs often executes branches in two phases  Prepare the branch condition (store in a branch register)  Execute the branch based on the condition ## Implements: if (a > 0 && b < 0) { [ Block 1] } ## where a is in $r1, b is in $r2, and 0 is in r0 cmpgt$r3 = $r1, $r0## (a > 0) cmplt$r4 = $r2, $r0## (b < 0) ;; and$b0 = $r3, $r4 ;; br$b0, L1;; ## L1 starts Block 1

High Performance Embedded Computing © 2007 Elsevier Generic memory block

High Performance Embedded Computing © 2007 Elsevier Simple memory model Core array is n rows x m columns. Total area A = A r + A x + A p + A c. Row decoder area A r = a r m. Core area A x = a x mn. Precharge circuit area A p = a p m. Column decoder area A c = a c m.

High Performance Embedded Computing © 2007 Elsevier Simple energy and delay models  =  setup +  r +  x +  bit +  c. Total energy E = E D + E S.  Static energy component E S is a technology parameter.  Dynamic energy E D = E r + E x + E p + E c.

High Performance Embedded Computing © 2007 Elsevier Multiport memories - delay Delay vs. memory size and number of ports Multiport Memory R_addr1 R_addr2 W_addr W_data R_data1 R_data2

High Performance Embedded Computing © 2007 Elsevier Multiport memories - area Area of memories often grows with the square of the number of ports

High Performance Embedded Computing © 2007 Elsevier Kamble and Ghose cache power model

High Performance Embedded Computing © 2007 Elsevier Kamble/Ghose, cont’d. Cache is m-way set-associative, capacity of D bytes, T bits of tag, and L bytes of line, St status bits per line. The bit line energy is : The equation in the book is incorrect.

High Performance Embedded Computing © 2007 Elsevier Kamble/Ghose, cont’d. Word line energy: Output line energy: Address input lines:

High Performance Embedded Computing © 2007 Elsevier Shiue and Chakrabarti cache energy model add_bs: number of transitions on address bus per instruction. data_bs: number of transitions on data bus per instruction. word_line_size: number of memory cells on a word line. bit_line_size: number of memory cells on a bit line. E m : Energy consumption of a main memory access.  : technology parameters.

High Performance Embedded Computing © 2007 Elsevier Shiue/Chakrabarti, cont’d.

High Performance Embedded Computing © 2007 Elsevier Register files First stage in the memory hierarchy. When too many values are live, some values must be spilled onto cache/main memory and read back later.  Spills cost time, energy. Register file parameters:  Number of words.  Number of ports.

High Performance Embedded Computing © 2007 Elsevier Performance and energy vs. register file size. [Weh01] © 2001 IEEE

High Performance Embedded Computing © 2007 Elsevier Cache size vs. energy [Li98] © 1998 IEEE

High Performance Embedded Computing © 2007 Elsevier Cache parameters Cache size:  Larger caches hold more data, require more static energy, take area away from other functinos. Number of sets:  More independent references, more locations mapped onto each line. Cache line length:  Longer lines give more prefetching bandwidth, often result in higher energy consumption. What impact does each of these have on energy?

High Performance Embedded Computing © 2007 Elsevier Multilevel cache optimization Gordon-Ross et al adjust cache parameters in order:  Cache size.  Line size.  Associativity. Design cache size for first level, then second level; line size for first, then second level; associativity for first, then second level. Why vary the parameters in this order.

High Performance Embedded Computing © 2007 Elsevier Scratch pad memory Scratch pad is managed by software, not hardware.  Provides predictable access time.  Requires values to be allocated. Use standard read/write instructions to access scratch pad.

High Performance Embedded Computing © 2007 Elsevier Code compression Extreme version of instruction encoding:  Use variable-bit instructions.  Generate encodings using compression algorithms. Generally takes longer to decode. Can result in performance, energy, code size improvements. How? IBM CodePack (PowerPC) used Huffman encoding.

High Performance Embedded Computing © 2007 Elsevier Terms Compression ratio:  Compressed code size/uncompressed code size * 100%.  Must take into account all overheads.

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin approach Object code is fed to lossless compression algorithm.  Wolfe/Chanin used Huffman’s algorithm. Compressed object code becomes program image. Code is decompressed on-the-fly during execution. Source code compiler Object code compressor Compressed object code

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin execution Instructions are decompressed when read from main memory.  Data is not compressed or decompressed. Cache holds uncompressed instructions.  Longer latency to get instructions from memory. CPU does not require significant modifications. CPU decompressor cache memory

High Performance Embedded Computing © 2007 Elsevier Huffman coding Input stream is a sequence of symbols. Each symbol’s probability of occurrence is known. Construct a binary tree of probabilities from the bottom up.  Path from root to symbol gives code for that symbol.

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin results [Wol92] © 1992 IEEE

High Performance Embedded Computing © 2007 Elsevier Compressed vs. uncompressed code Code must be uncompressed from many different starting points during branches. Code compression algorithms are designed to decode from the start of a stream. Compressed code is organized into blocks.  Uncompress at start of block. Unused bits between blocks constitute overhead. add r1, r2, r3 mov r1, a bne r1, foo uncompressed compressed

High Performance Embedded Computing © 2007 Elsevier Block structure and compression Trade-off:  Compression algorithms work best on long blocks.  Program branching works best with short blocks. Labels in program move during compression. Two approaches:  Wolfe and Chanin used branch table to translate branches during execution (adds code size).  Lefurgy et al. patched compressed code to refer branches to compressed locations.

High Performance Embedded Computing © 2007 Elsevier Compression ratio vs. block size [Lek99b] © 1999 IEEE

High Performance Embedded Computing © 2007 Elsevier Pre-cache compression Decompress as instructions come out of the cache. One instruction may be decompressed many times. Program has smaller cache footprint. Why might a different type of decompression engine be needed?

High Performance Embedded Computing © 2007 Elsevier Compression algorithms There are a large number of algorithms for compressing data. These algorithms were designed for different constraints:  Large text files.  No real-time or power constraints. Evaluate existing algorithms under the requirements of code compressions and develop new algorithms  Several of these new algorithms are discussed in the textbook

High Performance Embedded Computing © 2007 Elsevier Code and data compression Unlike (non-modifiable) code, data must be compressed and decompressed dynamically.  Compress data before it gets written back to main memory Can substantially reduce the main memory or cache footprints. Requires different trade-offs.

High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv algorithm Dictionary-based method. Decoder builds dictionary during decompression process. LZW variant uses a fixed-size buffer. Source text Uncompressed source CoderDictionary DecoderDictionary Compressed text

High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv example

High Performance Embedded Computing © 2007 Elsevier MXT Tremaine et al. has 3-level cache system.  Level 3 is shared among several processor, connected to main memory.  Data and code are compressed/uncompressed as they move between main memory and level 3 cache. Uses a variant of Lempel-Ziv 1977 algorithm.  All compression engines share the same dictionary.  Typically, 1 KB blocks are divided into four 256-byte blocks for compression. These can be decompressed in parallel.