Download presentation
Presentation is loading. Please wait.
Published byVerawati Salim Modified over 5 years ago
1
Efficient Placement of Compressed Code for Parallel Decompression
Xiaoke Qin and Prabhat Mishra Embedded Systems Lab Computer and Information Science and Engineering University of Florida, USA
2
Outline Introduction Code Compression Techniques
Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion
3
Why Code Compression? Embedded systems are ubiquitous
Automobiles, digital cameras, PDAs, cellular phones, medical and military equipments, …. Memory imposes cost, area and energy constraints during embedded systems design Increasing complexity of applications Code compression techniques address this by reducing the size of application programs
4
Code Compression Methodology
Static Encoding (Offline) Application Program (Binary) Compression Algorithm Dynamic Decoding (Online) Processor (Fetch and Execute) Decompression Engine Compressed Code (Memory) Embedded Systems
5
Decompression Engine (DCE)
Pre-cache design Between memory and cache Post-cache design Between cache and processor Decompression has to be very fast (at speed) (+) Cache holds compressed data (+) Reduced bus bandwidth and higher cache hits (+) Improved performance and energy reduction The design of decompression engine can be pre-cache or post-cache. In the pre-cache implementation, the decompression engine (DCE) is placed between cache and memory, therefore decompression efficiency is not critical. The post-cache placement is beneficial since the cache will hold the compressed data and thereby reduce the bus bandwidth as well as improve cache hits. However, the post-cache placement assumes that the DCE should be able to deliver instructions at the rate of the processor. Pre-Cache DCE Post-Cache DCE Main Memory I-Cache Processor D-Cache
6
Outline Introduction Code Compression Techniques
Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 6
7
Code Compression Techniques
Efficient code compression Huffman coding: Wolfe and Chanin, MICRO 1992 LZW: Lin, Xie and Wolf: DATE 2004 SAMC/Arithmetic coding: Lekatsas and Wolf, TCAD 1999 Dictionary-based code compression Liao, Devdas and Keutzer, TCAD 1998 Prakash et al., DCC 2003 Ros and Sutton, CASES 2004 Seong and Mishra, ICCAD’06, DATE’07, TCAD 2008 Divide an instruction into different parts Nam et al., FECCS 1999. Lekatsas and Wolf, DAC 1998 CodePack, Lefurgy 2000 7
8
Dictionary-Based Code Compression
Format for Uncompressed Code (32 Bit Code) Format for Compressed Code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) Dictionary Index Original Program Compressed Program Dictionary 0 0 0 1 Index Entry 1 0 – Compressed 1 – Not Compressed The basic idea of dictionary-based code compression is to use a dictionary to store the most frequently occurring binaries in an application and use that dictionary to compress the application. During compression one extra bit is used to identify whether a particular binary is compressed or not. For example, the application binary shown here has 10 8-bit binaries. It has two binaries ( and ) that appear twice. Therefore, the dictionary has two entries. For instance, the first binary matches with the first location of the dictionary. Therefore, it is compressed as “0 0”. The first 0 indicates that this binary is compressed. The second 0 is the dictionary index which matches with this binary. In this example, the compression ratio is 97.5%
9
Code Compression Techniques
Efficient Compression Huffman coding, arithmetic coding, … Excellent compression due to complex encoding Slow decompression Not suitable for post cache decompression Fast Decompression Dictionary-based, Bitmask-based, … Fast decompression due to simple/fixed encoding Compression efficiency is comprised We combine the advantages by employing a novel placement of compressed binaries
10
How to Accelerate Decompression?
Divide code into several streams, compress and store each stream separately. Parallel decompression using multiple decoders Unequal compression Wastage of space Difficult to handle branch targets A B A B ?
11
Another Alternative A B A B Always perform fixed encoding
variable-to-fixed fixed-to-fixed Sacrifices compression efficiency A B A B 11
12
Outline Introduction Code Compression Techniques
Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 12
13
Overview of Our Approach
Divide code into multiple streams Compress each of them separately Merge them using our placement algorithm Reduce space wastage Ensure that none of the decoders are idle
14
Compression Algorithm
15
Compression using Huffman Coding
Huffman coding with instruction division and selective compression Compression Ratio: Compressed Size / Original Size = 60/72 = 83.3% CR: 77.8% CR: 88.9%
16
Example using Two Decoders
Branch Block Instructions between two Consecutive branch targets Slot1 (4 bits) Slot2 (4 bits) Storage Structure Input Compressed Streams Sufficient Decode Length: 1 + length of uncompressed field = 1+4 = 5
17
Example
18
Decode-Aware Code Placement
Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams Begin if !Ready1 and !Ready2 then Assign Stream1 to Slot1 and Stream2 to Slot2 else if !Ready1 and Ready2 then Assign Stream1 to Slot1 and Slot2 else if Ready1 and !Ready2 then Assign Stream2 to Slot1 and Slot2 else Assign Stream 1 to Slot1 and Stream2 to Slot2 End Readyi i’th buffer has sufficient bits
19
Decompression Mechanism
20
Outline Introduction Code Compression Techniques
Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 20
21
Experimental Setup MediaBench and MiBench Benchmarks
Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to, gsm_un, mpeg2enc, mpeg2dec, and pegwit Compiled for four target architectures TI TMS320C6x, PowerPC, SPARC and MIPS Compared our approach with CodePack BPA1: Bitstream placement for Two Streams Two decoders work in parallel BPA2: Bitstream placement for Four Streams Four decoders work in parallel
22
Decode Bandwidth 2-4 times improvement in decode performance CodePack
BPA1 BPA2 2-4 times improvement in decode performance
23
Compression Penalty Less than 1% penalty in compression performance
24
Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library
Hardware Overhead BPA1 and CodePack uses similar area/power BPA2 requires double area/power Four 16-bit decoders This overhead is negligible X smaller compared to typical reduction in overall area and energy by code compression. CodePack BPA1 BPA2 Area (um2) 122263 137529 253586 Power (mW) 7.5 9.8 14.6 Critical path (ns) 6.91 5.76 5.94 Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library 24
25
More than 4 Decoders? BPA1 – Two Decoders
May need 1 startup stall cycle for each branch block BPA2 – Four Decoders May need 2 startup stall cycles for each branch block Proved that BPA1 and BPA2 uses exactly 1 and 2 cycles (respectively) more than optimal placement Too many parallel decoders is not profitable Overall increase in output bandwidth will slow down by more start up stalls Startup stalls may not be negligible with the execution time of the branch block itself.
26
Conclusion Memory is a major constraint
Existing compression methods provide either efficient compression or fast decompression Our approach combines the benefits Efficient placement for parallel decompression Up to 4 times improvement in decode bandwidth Less than 1% impact on compression efficiency Future work Apply it for data compression data values, FPGA bitstream, manufacturing test, …
27
Thank you ! 27
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.