Efficient Placement of Compressed Code for Parallel Decompression Xiaoke Qin and Prabhat Mishra Embedded Systems Lab Computer and Information Science and Engineering University of Florida, USA
Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion
Why Code Compression? Embedded systems are ubiquitous Automobiles, digital cameras, PDAs, cellular phones, medical and military equipments, …. Memory imposes cost, area and energy constraints during embedded systems design Increasing complexity of applications Code compression techniques address this by reducing the size of application programs
Code Compression Methodology Static Encoding (Offline) Application Program (Binary) Compression Algorithm Dynamic Decoding (Online) Processor (Fetch and Execute) Decompression Engine Compressed Code (Memory) Embedded Systems
Decompression Engine (DCE) Pre-cache design Between memory and cache Post-cache design Between cache and processor Decompression has to be very fast (at speed) (+) Cache holds compressed data (+) Reduced bus bandwidth and higher cache hits (+) Improved performance and energy reduction The design of decompression engine can be pre-cache or post-cache. In the pre-cache implementation, the decompression engine (DCE) is placed between cache and memory, therefore decompression efficiency is not critical. The post-cache placement is beneficial since the cache will hold the compressed data and thereby reduce the bus bandwidth as well as improve cache hits. However, the post-cache placement assumes that the DCE should be able to deliver instructions at the rate of the processor. Pre-Cache DCE Post-Cache DCE Main Memory I-Cache Processor D-Cache
Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 6
Code Compression Techniques Efficient code compression Huffman coding: Wolfe and Chanin, MICRO 1992 LZW: Lin, Xie and Wolf: DATE 2004 SAMC/Arithmetic coding: Lekatsas and Wolf, TCAD 1999 Dictionary-based code compression Liao, Devdas and Keutzer, TCAD 1998 Prakash et al., DCC 2003 Ros and Sutton, CASES 2004 Seong and Mishra, ICCAD’06, DATE’07, TCAD 2008 Divide an instruction into different parts Nam et al., FECCS 1999. Lekatsas and Wolf, DAC 1998 CodePack, Lefurgy 2000 7
Dictionary-Based Code Compression Format for Uncompressed Code (32 Bit Code) Format for Compressed Code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) Dictionary Index Original Program Compressed Program Dictionary 0000 0000 1000 0010 0000 0010 0100 0010 0100 1110 0101 0010 0000 1100 1100 0000 0 0 1 1000 0010 1 0000 0010 0 1 1 0100 1110 1 0101 0010 1 0000 1100 1 1100 0000 Index Entry 0000 0000 1 0100 0010 0 – Compressed 1 – Not Compressed The basic idea of dictionary-based code compression is to use a dictionary to store the most frequently occurring binaries in an application and use that dictionary to compress the application. During compression one extra bit is used to identify whether a particular binary is compressed or not. For example, the application binary shown here has 10 8-bit binaries. It has two binaries (00000000 and 01000010) that appear twice. Therefore, the dictionary has two entries. For instance, the first binary matches with the first location of the dictionary. Therefore, it is compressed as “0 0”. The first 0 indicates that this binary is compressed. The second 0 is the dictionary index which matches with this binary. In this example, the compression ratio is 97.5%
Code Compression Techniques Efficient Compression Huffman coding, arithmetic coding, … Excellent compression due to complex encoding Slow decompression Not suitable for post cache decompression Fast Decompression Dictionary-based, Bitmask-based, … Fast decompression due to simple/fixed encoding Compression efficiency is comprised We combine the advantages by employing a novel placement of compressed binaries
How to Accelerate Decompression? Divide code into several streams, compress and store each stream separately. Parallel decompression using multiple decoders Unequal compression Wastage of space Difficult to handle branch targets A B A B ?
Another Alternative A B A B Always perform fixed encoding variable-to-fixed fixed-to-fixed Sacrifices compression efficiency A B A B 11
Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 12
Overview of Our Approach Divide code into multiple streams Compress each of them separately Merge them using our placement algorithm Reduce space wastage Ensure that none of the decoders are idle
Compression Algorithm
Compression using Huffman Coding Huffman coding with instruction division and selective compression 0000 1110 0000 0100 0000 0000 1000 0000 1000 1110 Compression Ratio: Compressed Size / Original Size = 60/72 = 83.3% CR: 77.8% CR: 88.9%
Example using Two Decoders Branch Block Instructions between two Consecutive branch targets 0000 1110 0000 0100 0000 0000 1000 0000 1000 1110 Slot1 (4 bits) Slot2 (4 bits) Storage Structure Input Compressed Streams Sufficient Decode Length: 1 + length of uncompressed field = 1+4 = 5
Example
Decode-Aware Code Placement Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams Begin if !Ready1 and !Ready2 then Assign Stream1 to Slot1 and Stream2 to Slot2 else if !Ready1 and Ready2 then Assign Stream1 to Slot1 and Slot2 else if Ready1 and !Ready2 then Assign Stream2 to Slot1 and Slot2 else Assign Stream 1 to Slot1 and Stream2 to Slot2 End Readyi i’th buffer has sufficient bits
Decompression Mechanism
Outline Introduction Code Compression Techniques Efficient Placement of Compressed Binaries Compression Algorithm Code Placement Algorithm Decompression Mechanism Experiments Conclusion 20
Experimental Setup MediaBench and MiBench Benchmarks Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to, gsm_un, mpeg2enc, mpeg2dec, and pegwit Compiled for four target architectures TI TMS320C6x, PowerPC, SPARC and MIPS Compared our approach with CodePack BPA1: Bitstream placement for Two Streams Two decoders work in parallel BPA2: Bitstream placement for Four Streams Four decoders work in parallel
Decode Bandwidth 2-4 times improvement in decode performance CodePack BPA1 BPA2 2-4 times improvement in decode performance
Compression Penalty Less than 1% penalty in compression performance
Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library Hardware Overhead BPA1 and CodePack uses similar area/power BPA2 requires double area/power Four 16-bit decoders This overhead is negligible 100-1000X smaller compared to typical reduction in overall area and energy by code compression. CodePack BPA1 BPA2 Area (um2) 122263 137529 253586 Power (mW) 7.5 9.8 14.6 Critical path (ns) 6.91 5.76 5.94 Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library 24
More than 4 Decoders? BPA1 – Two Decoders May need 1 startup stall cycle for each branch block BPA2 – Four Decoders May need 2 startup stall cycles for each branch block Proved that BPA1 and BPA2 uses exactly 1 and 2 cycles (respectively) more than optimal placement Too many parallel decoders is not profitable Overall increase in output bandwidth will slow down by more start up stalls Startup stalls may not be negligible with the execution time of the branch block itself.
Conclusion Memory is a major constraint Existing compression methods provide either efficient compression or fast decompression Our approach combines the benefits Efficient placement for parallel decompression Up to 4 times improvement in decode bandwidth Less than 1% impact on compression efficiency Future work Apply it for data compression data values, FPGA bitstream, manufacturing test, …
Thank you ! 27