Efficient Placement of Compressed Code for Parallel Decompression

Efficient Placement of Compressed Code for Parallel Decompression
Xiaoke Qin and Prabhat Mishra Embedded Systems Lab Computer and Information Science and Engineering University of Florida, USA

Outline Introduction Code Compression Techniques
Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion

Why Code Compression? Embedded systems are ubiquitous
Automobiles, digital cameras, PDAs, cellular phones, medical and military equipments, …. Memory imposes cost, area and energy constraints during embedded systems design Increasing complexity of applications Code compression techniques address this by reducing the size of application programs

Code Compression Methodology
Static Encoding (Offline) Application Program (Binary) Compression Algorithm Dynamic Decoding (Online) Processor (Fetch and Execute) Decompression Engine Compressed Code (Memory) Embedded Systems

Decompression Engine (DCE)
Pre-cache design Between memory and cache Post-cache design Between cache and processor Decompression has to be very fast (at speed) (+) Cache holds compressed data (+) Reduced bus bandwidth and higher cache hits (+) Improved performance and energy reduction The design of decompression engine can be pre-cache or post-cache. In the pre-cache implementation, the decompression engine (DCE) is placed between cache and memory, therefore decompression efficiency is not critical. The post-cache placement is beneficial since the cache will hold the compressed data and thereby reduce the bus bandwidth as well as improve cache hits. However, the post-cache placement assumes that the DCE should be able to deliver instructions at the rate of the processor. Pre-Cache DCE Post-Cache DCE Main Memory I-Cache Processor D-Cache

Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 6

Code Compression Techniques
Efficient code compression Huffman coding: Wolfe and Chanin, MICRO 1992 LZW: Lin, Xie and Wolf: DATE 2004 SAMC/Arithmetic coding: Lekatsas and Wolf, TCAD 1999 Dictionary-based code compression Liao, Devdas and Keutzer, TCAD 1998 Prakash et al., DCC 2003 Ros and Sutton, CASES 2004 Seong and Mishra, ICCAD’06, DATE’07, TCAD 2008 Divide an instruction into different parts Nam et al., FECCS 1999. Lekatsas and Wolf, DAC 1998 CodePack, Lefurgy 2000 7

Dictionary-Based Code Compression
Format for Uncompressed Code (32 Bit Code) Format for Compressed Code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) Dictionary Index Original Program Compressed Program Dictionary 0 0 0 1 Index Entry 1 0 – Compressed 1 – Not Compressed The basic idea of dictionary-based code compression is to use a dictionary to store the most frequently occurring binaries in an application and use that dictionary to compress the application. During compression one extra bit is used to identify whether a particular binary is compressed or not. For example, the application binary shown here has 10 8-bit binaries. It has two binaries ( and ) that appear twice. Therefore, the dictionary has two entries. For instance, the first binary matches with the first location of the dictionary. Therefore, it is compressed as “0 0”. The first 0 indicates that this binary is compressed. The second 0 is the dictionary index which matches with this binary. In this example, the compression ratio is 97.5%

Code Compression Techniques
Efficient Compression Huffman coding, arithmetic coding, … Excellent compression due to complex encoding Slow decompression Not suitable for post cache decompression Fast Decompression Dictionary-based, Bitmask-based, … Fast decompression due to simple/fixed encoding Compression efficiency is comprised We combine the advantages by employing a novel placement of compressed binaries

How to Accelerate Decompression?
Divide code into several streams, compress and store each stream separately. Parallel decompression using multiple decoders Unequal compression Wastage of space Difficult to handle branch targets A B A B ?

Another Alternative A B A B Always perform fixed encoding
variable-to-fixed fixed-to-fixed Sacrifices compression efficiency A B A B 11

Overview of Our Approach
Divide code into multiple streams Compress each of them separately Merge them using our placement algorithm Reduce space wastage Ensure that none of the decoders are idle

Compression Algorithm

Compression using Huffman Coding
Huffman coding with instruction division and selective compression Compression Ratio: Compressed Size / Original Size = 60/72 = 83.3% CR: 77.8% CR: 88.9%

Example using Two Decoders
Branch Block Instructions between two Consecutive branch targets Slot1 (4 bits) Slot2 (4 bits) Storage Structure Input Compressed Streams Sufficient Decode Length: 1 + length of uncompressed field = 1+4 = 5

Example

Decode-Aware Code Placement
Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams Begin if !Ready1 and !Ready2 then Assign Stream1 to Slot1 and Stream2 to Slot2 else if !Ready1 and Ready2 then Assign Stream1 to Slot1 and Slot2 else if Ready1 and !Ready2 then Assign Stream2 to Slot1 and Slot2 else Assign Stream 1 to Slot1 and Stream2 to Slot2 End Readyi  i’th buffer has sufficient bits

Decompression Mechanism

Experimental Setup MediaBench and MiBench Benchmarks
Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to, gsm_un, mpeg2enc, mpeg2dec, and pegwit Compiled for four target architectures TI TMS320C6x, PowerPC, SPARC and MIPS Compared our approach with CodePack BPA1: Bitstream placement for Two Streams Two decoders work in parallel BPA2: Bitstream placement for Four Streams Four decoders work in parallel

Decode Bandwidth 2-4 times improvement in decode performance CodePack
BPA1 BPA2 2-4 times improvement in decode performance

Compression Penalty Less than 1% penalty in compression performance

Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library
Hardware Overhead BPA1 and CodePack uses similar area/power BPA2 requires double area/power Four 16-bit decoders This overhead is negligible X smaller compared to typical reduction in overall area and energy by code compression. CodePack BPA1 BPA2 Area (um2) 122263 137529 253586 Power (mW) 7.5 9.8 14.6 Critical path (ns) 6.91 5.76 5.94 Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library 24

More than 4 Decoders? BPA1 – Two Decoders
May need 1 startup stall cycle for each branch block BPA2 – Four Decoders May need 2 startup stall cycles for each branch block Proved that BPA1 and BPA2 uses exactly 1 and 2 cycles (respectively) more than optimal placement Too many parallel decoders is not profitable Overall increase in output bandwidth will slow down by more start up stalls Startup stalls may not be negligible with the execution time of the branch block itself.

Conclusion Memory is a major constraint
Existing compression methods provide either efficient compression or fast decompression Our approach combines the benefits Efficient placement for parallel decompression Up to 4 times improvement in decode bandwidth Less than 1% impact on compression efficiency Future work Apply it for data compression data values, FPGA bitstream, manufacturing test, …

Thank you ! 27

Efficient Placement of Compressed Code for Parallel Decompression

Similar presentations

Presentation on theme: "Efficient Placement of Compressed Code for Parallel Decompression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Placement of Compressed Code for Parallel Decompression

Similar presentations

Presentation on theme: "Efficient Placement of Compressed Code for Parallel Decompression"— Presentation transcript:

Similar presentations

About project

Feedback