High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors Operations in the same VLIW as a branch must not depend on the branch outcome VLIWs may utilize static branch prediction  Uses profiling to predict branch direction  Have different instructions depending on if the branch is predicted to be taken or not taken VLIWs sometimes have specialized loop instructions – execute the instruction word or set of instruction words a specified number of time. VLIWs sometimes have predicated instructions  Execute instruction conditionally based on value in “predicate register”  cmpgt p1 = r1, r2;; sub (p1) r3 = r1, r2, sub (!p1) r3 = r2, r1;;

High Performance Embedded Computing © 2007 Elsevier Branching in VLIW Processors VLIWs often executes branches in two phases  Prepare the branch condition (store in a branch register)  Execute the branch based on the condition ## Implements: if (a > 0 && b < 0) { [ Block 1] } ## where a is in $r1, b is in $r2, and 0 is in r0 cmpgt$r3 = $r1, $r0## (a > 0) cmplt$r4 = $r2, $r0## (b < 0) ;; and$b0 = $r3, $r4 ;; br$b0, L1;; ## L1 starts Block 1

High Performance Embedded Computing © 2007 Elsevier Simple memory model Core array is n rows x m columns. Total area A = A r + A x + A p + A c. Row decoder area A r = a r m. Core area A x = a x mn. Precharge circuit area A p = a p m. Column decoder area A c = a c m.

High Performance Embedded Computing © 2007 Elsevier Simple energy and delay models  =  setup +  r +  x +  bit +  c. Total energy E = E D + E S.  Static energy component E S is a technology parameter.  Dynamic energy E D = E r + E x + E p + E c.

High Performance Embedded Computing © 2007 Elsevier Kamble/Ghose, cont’d. Cache is m-way set-associative, capacity of D bytes, T bits of tag, and L bytes of line, St status bits per line. The bit line energy is : The equation in the book is incorrect.

High Performance Embedded Computing © 2007 Elsevier Shiue and Chakrabarti cache energy model add_bs: number of transitions on address bus per instruction. data_bs: number of transitions on data bus per instruction. word_line_size: number of memory cells on a word line. bit_line_size: number of memory cells on a bit line. E m : Energy consumption of a main memory access.  : technology parameters.

High Performance Embedded Computing © 2007 Elsevier Register files First stage in the memory hierarchy. When too many values are live, some values must be spilled onto cache/main memory and read back later.  Spills cost time, energy. Register file parameters:  Number of words.  Number of ports.

High Performance Embedded Computing © 2007 Elsevier Cache parameters Cache size:  Larger caches hold more data, require more static energy, take area away from other functinos. Number of sets:  More independent references, more locations mapped onto each line. Cache line length:  Longer lines give more prefetching bandwidth, often result in higher energy consumption. What impact does each of these have on energy?

High Performance Embedded Computing © 2007 Elsevier Multilevel cache optimization Gordon-Ross et al adjust cache parameters in order:  Cache size.  Line size.  Associativity. Design cache size for first level, then second level; line size for first, then second level; associativity for first, then second level. Why vary the parameters in this order.

High Performance Embedded Computing © 2007 Elsevier Scratch pad memory Scratch pad is managed by software, not hardware.  Provides predictable access time.  Requires values to be allocated. Use standard read/write instructions to access scratch pad.

High Performance Embedded Computing © 2007 Elsevier Code compression Extreme version of instruction encoding:  Use variable-bit instructions.  Generate encodings using compression algorithms. Generally takes longer to decode. Can result in performance, energy, code size improvements. How? IBM CodePack (PowerPC) used Huffman encoding.

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin approach Object code is fed to lossless compression algorithm.  Wolfe/Chanin used Huffman’s algorithm. Compressed object code becomes program image. Code is decompressed on-the-fly during execution. Source code compiler Object code compressor Compressed object code

High Performance Embedded Computing © 2007 Elsevier Wolfe/Chanin execution Instructions are decompressed when read from main memory.  Data is not compressed or decompressed. Cache holds uncompressed instructions.  Longer latency to get instructions from memory. CPU does not require significant modifications. CPU decompressor cache memory

High Performance Embedded Computing © 2007 Elsevier Huffman coding Input stream is a sequence of symbols. Each symbol’s probability of occurrence is known. Construct a binary tree of probabilities from the bottom up.  Path from root to symbol gives code for that symbol.

High Performance Embedded Computing © 2007 Elsevier Compressed vs. uncompressed code Code must be uncompressed from many different starting points during branches. Code compression algorithms are designed to decode from the start of a stream. Compressed code is organized into blocks.  Uncompress at start of block. Unused bits between blocks constitute overhead. add r1, r2, r3 mov r1, a bne r1, foo uncompressed compressed

High Performance Embedded Computing © 2007 Elsevier Block structure and compression Trade-off:  Compression algorithms work best on long blocks.  Program branching works best with short blocks. Labels in program move during compression. Two approaches:  Wolfe and Chanin used branch table to translate branches during execution (adds code size).  Lefurgy et al. patched compressed code to refer branches to compressed locations.

High Performance Embedded Computing © 2007 Elsevier Pre-cache compression Decompress as instructions come out of the cache. One instruction may be decompressed many times. Program has smaller cache footprint. Why might a different type of decompression engine be needed?

High Performance Embedded Computing © 2007 Elsevier Compression algorithms There are a large number of algorithms for compressing data. These algorithms were designed for different constraints:  Large text files.  No real-time or power constraints. Evaluate existing algorithms under the requirements of code compressions and develop new algorithms  Several of these new algorithms are discussed in the textbook

High Performance Embedded Computing © 2007 Elsevier Code and data compression Unlike (non-modifiable) code, data must be compressed and decompressed dynamically.  Compress data before it gets written back to main memory Can substantially reduce the main memory or cache footprints. Requires different trade-offs.

High Performance Embedded Computing © 2007 Elsevier Lempel-Ziv algorithm Dictionary-based method. Decoder builds dictionary during decompression process. LZW variant uses a fixed-size buffer. Source text Uncompressed source CoderDictionary DecoderDictionary Compressed text

High Performance Embedded Computing © 2007 Elsevier MXT Tremaine et al. has 3-level cache system.  Level 3 is shared among several processor, connected to main memory.  Data and code are compressed/uncompressed as they move between main memory and level 3 cache. Uses a variant of Lempel-Ziv 1977 algorithm.  All compression engines share the same dictionary.  Typically, 1 KB blocks are divided into four 256-byte blocks for compression. These can be decompressed in parallel.

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Similar presentations

Presentation on theme: "High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Similar presentations

Presentation on theme: "High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from."— Presentation transcript:

Similar presentations

About project

Feedback