Efficient Placement of Compressed Code for Parallel Decompression

Slides:

Advertisements

Similar presentations

Instruction Set Design

Advertisements

Computer Organization and Architecture

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Performance Analysis and Optimization (General guidelines; Some of this is review) Outline: introduction evaluation methods timing space—code compression.

Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

Instruction Set Architecture (ISA) for Low Power Hillary Grimes III Department of Electrical and Computer Engineering Auburn University.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Pipelining By Toan Nguyen.

Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.

1 Code Compression Motivations Data compression techniques Code compression options and methods Comparison.

Seok-Won Seong and Prabhat Mishra University of Florida IEEE Transaction on Computer Aided Design of Intigrated Systems April 2008, Vol 27, No. 4 Rahul.

« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)

High Performance Embedded Computing © 2007 Elsevier Chapter 2, part 2: CPUs High Performance Embedded Computing Wayne Wolf.

A Decompression Architecture for Low Power Embedded Systems Lekatsas, H.; Henkel, J.; Wolf, W.; Computer Design, Proceedings International.

Coding Methods in Embedded Computing Wayne Wolf Dept. of Electrical Engineering Princeton University.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

Xiaoke Qin, Member, IEEE Chetan Murthy, and Prabhat Mishra, Senior Member, IEEE IEEE Transactions in VLSI Systems, March 2011 Presented by: Sidhartha Agrawal.

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)

Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.

Sunpyo Hong, Hyesoon Kim

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

Computer Hardware What is a CPU.

Lab 4 HW/SW Compression and Decompression of Captured Image

Advanced Architectures

Computer Organization and Architecture + Networks

Memory COMPUTER ARCHITECTURE

William Stallings Computer Organization and Architecture 8th Edition

Improving Memory Access 1/3 The Cache and Virtual Memory

Selective Code Compression Scheme for Embedded System

CSC 4250 Computer Architectures

Embedded Systems Design

Chapter 9 – Real Memory Organization and Management

5.2 Eleven Advanced Optimizations of Cache Performance

A Closer Look at Instruction Set Architectures

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

William Stallings Computer Organization and Architecture 7th Edition

Code Compression Motivation Efficient Compression

Improving Program Efficiency by Packing Instructions Into Registers

Instruction Level Parallelism and Superscalar Processors

Central Processing Unit

Stephen Hines, David Whalley and Gary Tyson Computer Science Dept.

EE 445S Real-Time Digital Signal Processing Lab Spring 2014

Instruction Level Parallelism and Superscalar Processors

ICIEV 2014 Dhaka, Bangladesh

Memory Organization.

Guest Lecturer TA: Shreyas Chand

Computer Architecture

CSC3050 – Computer Architecture

Created by Vivi Sahfitri

Memory System Performance Chapter 3

Chapter Five Large and Fast: Exploiting Memory Hierarchy

Lecture 4: Instruction Set Design/Pipelining

WJEC GCSE Computer Science

Operating Systems: Internals and Design Principles, 6/E

Spring 2019 Prof. Eric Rotenberg

Authors: Ding-Yuan Lee, Ching-Che Wang, An-Yeu Wu Publisher: 2019 VLSI

Presentation transcript:

Efficient Placement of Compressed Code for Parallel Decompression Xiaoke Qin and Prabhat Mishra Embedded Systems Lab Computer and Information Science and Engineering University of Florida, USA

Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion

Why Code Compression? Embedded systems are ubiquitous Automobiles, digital cameras, PDAs, cellular phones, medical and military equipments, …. Memory imposes cost, area and energy constraints during embedded systems design Increasing complexity of applications Code compression techniques address this by reducing the size of application programs

Code Compression Methodology Static Encoding (Offline) Application Program (Binary) Compression Algorithm Dynamic Decoding (Online) Processor (Fetch and Execute) Decompression Engine Compressed Code (Memory) Embedded Systems

Decompression Engine (DCE) Pre-cache design Between memory and cache Post-cache design Between cache and processor Decompression has to be very fast (at speed) (+) Cache holds compressed data (+) Reduced bus bandwidth and higher cache hits (+) Improved performance and energy reduction The design of decompression engine can be pre-cache or post-cache. In the pre-cache implementation, the decompression engine (DCE) is placed between cache and memory, therefore decompression efficiency is not critical. The post-cache placement is beneficial since the cache will hold the compressed data and thereby reduce the bus bandwidth as well as improve cache hits. However, the post-cache placement assumes that the DCE should be able to deliver instructions at the rate of the processor. Pre-Cache DCE Post-Cache DCE Main Memory I-Cache Processor D-Cache

Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 6

Code Compression Techniques Efficient code compression Huffman coding: Wolfe and Chanin, MICRO 1992 LZW: Lin, Xie and Wolf: DATE 2004 SAMC/Arithmetic coding: Lekatsas and Wolf, TCAD 1999 Dictionary-based code compression Liao, Devdas and Keutzer, TCAD 1998 Prakash et al., DCC 2003 Ros and Sutton, CASES 2004 Seong and Mishra, ICCAD’06, DATE’07, TCAD 2008 Divide an instruction into different parts Nam et al., FECCS 1999. Lekatsas and Wolf, DAC 1998 CodePack, Lefurgy 2000 7

Dictionary-Based Code Compression Format for Uncompressed Code (32 Bit Code) Format for Compressed Code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) Dictionary Index Original Program Compressed Program Dictionary 0000 0000 1000 0010 0000 0010 0100 0010 0100 1110 0101 0010 0000 1100 1100 0000 0 0 1 1000 0010 1 0000 0010 0 1 1 0100 1110 1 0101 0010 1 0000 1100 1 1100 0000 Index Entry 0000 0000 1 0100 0010 0 – Compressed 1 – Not Compressed The basic idea of dictionary-based code compression is to use a dictionary to store the most frequently occurring binaries in an application and use that dictionary to compress the application. During compression one extra bit is used to identify whether a particular binary is compressed or not. For example, the application binary shown here has 10 8-bit binaries. It has two binaries (00000000 and 01000010) that appear twice. Therefore, the dictionary has two entries. For instance, the first binary matches with the first location of the dictionary. Therefore, it is compressed as “0 0”. The first 0 indicates that this binary is compressed. The second 0 is the dictionary index which matches with this binary. In this example, the compression ratio is 97.5%

Code Compression Techniques Efficient Compression Huffman coding, arithmetic coding, … Excellent compression due to complex encoding Slow decompression Not suitable for post cache decompression Fast Decompression Dictionary-based, Bitmask-based, … Fast decompression due to simple/fixed encoding Compression efficiency is comprised We combine the advantages by employing a novel placement of compressed binaries

How to Accelerate Decompression? Divide code into several streams, compress and store each stream separately. Parallel decompression using multiple decoders Unequal compression Wastage of space Difficult to handle branch targets A B A B ?

Another Alternative A B A B Always perform fixed encoding variable-to-fixed fixed-to-fixed Sacrifices compression efficiency A B A B 11

Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 12

Overview of Our Approach Divide code into multiple streams Compress each of them separately Merge them using our placement algorithm Reduce space wastage Ensure that none of the decoders are idle

Compression Algorithm

Compression using Huffman Coding Huffman coding with instruction division and selective compression 0000 1110 0000 0100 0000 0000 1000 0000 1000 1110 Compression Ratio: Compressed Size / Original Size = 60/72 = 83.3% CR: 77.8% CR: 88.9%

Example using Two Decoders Branch Block Instructions between two Consecutive branch targets 0000 1110 0000 0100 0000 0000 1000 0000 1000 1110 Slot1 (4 bits) Slot2 (4 bits) Storage Structure Input Compressed Streams Sufficient Decode Length: 1 + length of uncompressed field = 1+4 = 5

Example

Decode-Aware Code Placement Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams Begin if !Ready1 and !Ready2 then Assign Stream1 to Slot1 and Stream2 to Slot2 else if !Ready1 and Ready2 then Assign Stream1 to Slot1 and Slot2 else if Ready1 and !Ready2 then Assign Stream2 to Slot1 and Slot2 else Assign Stream 1 to Slot1 and Stream2 to Slot2 End Readyi  i’th buffer has sufficient bits

Decompression Mechanism

Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 20

Experimental Setup MediaBench and MiBench Benchmarks Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to, gsm_un, mpeg2enc, mpeg2dec, and pegwit Compiled for four target architectures TI TMS320C6x, PowerPC, SPARC and MIPS Compared our approach with CodePack BPA1: Bitstream placement for Two Streams Two decoders work in parallel BPA2: Bitstream placement for Four Streams Four decoders work in parallel

Decode Bandwidth 2-4 times improvement in decode performance CodePack BPA1 BPA2 2-4 times improvement in decode performance

Compression Penalty Less than 1% penalty in compression performance

Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library Hardware Overhead BPA1 and CodePack uses similar area/power BPA2 requires double area/power Four 16-bit decoders This overhead is negligible 100-1000X smaller compared to typical reduction in overall area and energy by code compression. CodePack BPA1 BPA2 Area (um2) 122263 137529 253586 Power (mW) 7.5 9.8 14.6 Critical path (ns) 6.91 5.76 5.94 Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library 24

More than 4 Decoders? BPA1 – Two Decoders May need 1 startup stall cycle for each branch block BPA2 – Four Decoders May need 2 startup stall cycles for each branch block Proved that BPA1 and BPA2 uses exactly 1 and 2 cycles (respectively) more than optimal placement Too many parallel decoders is not profitable Overall increase in output bandwidth will slow down by more start up stalls Startup stalls may not be negligible with the execution time of the branch block itself.

Conclusion Memory is a major constraint Existing compression methods provide either efficient compression or fast decompression Our approach combines the benefits Efficient placement for parallel decompression Up to 4 times improvement in decode bandwidth Less than 1% impact on compression efficiency Future work Apply it for data compression data values, FPGA bitstream, manufacturing test, …

Thank you ! 27