Code Compression Motivation Efficient Compression

Code Compression Motivation Efficient Compression
Dictionary Based Compression Bitmask Based Compression Mask Selection Dictionary Selection Fast Decompression Code Placement Parallel Decompression CDA 4630/5636 – Spring Copyright © Prabhat Mishra

Compression Lossy Compression Lossless Compression
Widely used in multimedia and related domains Trade off between required size and quality e.g., JPEG for image and MPEG for video compression Audio compression removes non-audible (or less audible) components of the voice signal. Lossless Compression Exploits redundancy to reduce the size without losing any information LZW is popular, not suitable in embedded systems Efficient (complex) compression implies complex (time-consuming) runtime decompression Can lead to unacceptable performance overhead.

Memory in Embedded Systems
Memory is a major design constraint Impacts cost and size of the system Significant contributor to overall power / energy Code compression is a promising approach Code size reduction leads to less memory area Can improve power and performance Embedded systems are everywhere starting from simple day-to-day appliances to high-end medical and military equipments. Design of such systems is constrained by cost, area and power considerations. Memory poses is a critical challenge in embedded systems design since memory contributes significantly to the cost, area and energy requirement of the overall system. Code compression techniques addresses this issue by reducing the memory requirement.

Embedded Systems Design Flow
Hardware Components Hardware Design (Synthesis, Layout, …) HW/SW Partitioning Concept Specification Design (Compilation, …) Estimation - Exploration Software Software Components Validation and Evaluation (area, power, performance, …)

Code Compression Static Encoding (Offline) Application Program
(Binary) Compression Algorithm Dynamic Decoding (Online) In a typical code compression methodology, the application program is compressed offline and placed in the memory. During execution a decompression unit is used to convert the compressed code to original instructions. The goal of the compression algorithm is to generate the best possible compressed code without affecting the decompression performance. The decompression engine needs to be very fast to ensure that it delivers required number of instructions to the processor without introducing any stalls. Processor (Fetch and Execute) Decompression Engine Compressed Code (Memory)

Definitions Compression Ratio Smaller compression ratio is better
Example: Original program size is 1024 instructions (32-bit) Compressed program size 512 entries (32-bit) and 8 dictionary entries (32-bit). Compression ratio = (512 x x 32) / (1024 x 32) = 50.78 Compression ratio is widely used as a metric for measuring compression efficiency and is defined as Compressed code size OVER original program size. In other words, smaller compression ratio implies better compression technique.

Outline Motivation Efficient Compression Dictionary Based Compression
Bitmask Based Compression Mask Selection Dictionary Selection Fast Decompression Code Placement Parallel Decompression

Dictionary-Based Code Compression
Format for Uncompressed Code (32 Bit Code) Format for Compressed Code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) Dictionary Index Original Program Compressed Program Dictionary Original code size = 80 bits Compressed code size = 62 bits Dictionary size = 16 bits Compression ratio = (62+16)/80 = 97.5% 0 0 0 1 Index Entry 1 0 – Compressed 1 – Not Compressed The basic idea of dictionary-based code compression is to use a dictionary to store the most frequently occurring binaries in an application and use that dictionary to compress the application. During compression one extra bit is used to identify whether a particular binary is compressed or not. For example, the application binary shown here has 10 8-bit binaries. It has two binaries ( and ) that appear twice. Therefore, the dictionary has two entries. For instance, the first binary matches with the first location of the dictionary. Therefore, it is compressed as “0 0”. The first 0 indicates that this binary is compressed. The second 0 is the dictionary index which matches with this binary. In this example, the compression ratio is 97.5%

Hamming Distance Approach
Format for Uncompressed Code Format for Compressed Code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) # of Bit Changes Location (5 Bits) … … Location (5 Bits) Dictionary Index Prakash et al., DCC 2003; Ros and Sutton, CASES 2004 Original Program Compressed Program Dictionary Original code size = 80 bits Compressed code size = 60 bits Dictionary size = 16 bits Compression ratio = 95% Index Entry 1 0 – Compressed 1 – Not Compressed 0 – Mismatch 1 – No Action Mismatch Position Hamming distance based approach improves standard dictionary based code compression by remembering mismatches. The basic idea is to compress a binary using a dictionary entry even if it is different in few bit positions and store the information about mismatch positions in the compressed code. We use the same example as before. In this case, we are allowing only 1 bit mismatches. Therefore, we need one bit to indicate whether mismatch scenario is used or not, and another 3 bits to identify the location in a eight-bit binary. For example, the third binary is compressed using the first dictionary entry and the sixth position (from left) is stored since it is different from the dictionary entry. This technique improves the compression ratio to 95%.

Cost-Benefit Analysis
4 Bit Mismatches on a 32-Bit Vector Hamming Distance Approach 2 bits to indicate # of mismatches 5 bits to indicate location of mismatch 2 + 5*4 = 22 bits Bitmask-based Approach (One 4-bit Mask) 3 bits to indicate position (8 possible locations) 4 bits for mask pattern 3 + 4 = 7 bits It is clear that if more positions are stored, it will create more matches but the cost of storing more positions may be counter productive. For example, if four bit changes are allowed, 22 extra bits are required. However, if those four bits are in consecutive locations on a half-byte boundary, one 4-bit mask and 3-bit location (total 7 bits) would be sufficient.

Bitmask-based Code Compression
Hamming Distance – Limited! Profitable only up to 3 bit mismatches Bitmasks for handling mismatches Generate more repeating patterns XOR operation: Simple and Fast Assumes changes in consecutive locations Different binary Hamming distance based approach have limitations in terms of the number of mismatches it can handle. To handle more mismatches requires the use of bitmask. For example, different binary pattern can be matched with a dictionary entry with different bitmasks. Same binary can be matched with a dictionary entry with different bitmasks as well. 10 Dictionary 110001 + 11 1 Same binary

Bitmask Encoding 32-bit instructions Format for uncompressed code
Format for compressed code Decision (1 Bit) Uncompressed Data (32 Bits) Decision (1 Bit) Number of Masks … Mask Type Location Pattern Dictionary Index The encoding for bitmask-based compression is similar to the previous approaches. The only difference is that it stores information regarding one or more bitmasks. The information include the mask type, where to apply and actual mask pattern. Type of the mask e.g., 2-bit, 4-bit etc. Actual mask pattern Location to apply the bitmask

Code Compression with Bitmasks
Seong and Mishra, “A Bitmask-based Code Compression Technique for Embedded Systems”, ICCAD 2006 Bit Mask Position Bit Mask Value Original Program Compressed Program Dictionary Original code size = 80 bits Compressed code size = 54 bits Dictionary size = 16 bits Compression ratio = (54+16)/80 = 87.5% Index Entry 1 0 – Compressed 1 – Not Compressed 0 – Bit Mask Used 1 – No Bit Mask Used For example, the same program discussed earlier can be compressed using bitmasks. The second element in the program can be compressed using the second dictionary entry, applied on the first two bits (from left) with a pattern of ’11’. In this case, the mask type is implicit for the whole program – one 2-bit mask. The compression ratio is 87.5%

Decompression Engine (DCE)
Pre-cache design Between memory and cache Post-cache design Between cache and processor (+) Cache holds compressed data (+) Reduced bus bandwidth and higher cache hits Pre-Cache DCE Post-Cache DCE The design of decompression engine can be pre-cache or post-cache. In the pre-cache implementation, the decompression engine (DCE) is placed between cache and memory, therefore decompression efficiency is not critical. The post-cache placement is beneficial since the cache will hold the compressed data and thereby reduce the bus bandwidth as well as improve cache hits. However, the post-cache placement assumes that the DCE should be able to deliver instructions at the rate of the processor. Main Memory I-Cache Processor D-Cache Bus

Decompression Engine for Bitmasks
prev_comp prev_decomp Dictionary (SRAM) Decoding Logic Read Dictionary Compressed Code Index Bitmask used Bitmask not used This slide shows a simplified block diagram of the DCE. The compressed code is decoded to find the dictionary index as well as the bitmasks (if compressed). The 32-bit mask generation (using the bitmasks used in the compression) and dictionary read is done in parallel. If bitmask is used during compression XOR is performed with the dictionary output and the 32-bit mask. Mask XOR Output Buffer Original Code Uncompressed Code Mask generation is done in parallel with dictionary access Capable of decoding more than one instructions per cycle

Challenges in Bitmask-based Compression
How to find the profitable mask patterns? Larger bitmask generates more matches 4-bit mask can handle up to 16 mismatches 8-bit mask can handle up to 256 mismatches Larger bitmask incurs higher cost 4-bit mask costs 7 bits 8-bit mask costs 10 bits How to perform efficient dictionary selection? Frequency-based selection is not suitable. The approach needs to consider existing as well as bitmask-enabled repetitions. The two important challenges are How to find the profitable mask patterns, and how to perform efficient dictionary selection? It is clear that use of larger bitmask or a large number of bitmasks will produce more matching patterns but it may be costly. It is also obvious that frequency based dictionary selection is best for standard dictionary based compression. However, frequency based technique will not produce good compression since it cannot consider both self repetitions as well as bitmask-enabled repetitions. Seong and Mishra, Bitmask-Based Code Compression for Embedded Systems, IEEE Trans on CAD, 2008

Mask Selection Profitable mask combinations
Sliding mask can be anywhere on 32 bits Fixed mask can be on fixed locations e.g., 8-bit mask can be applied only on byte boundaries Bit Changes Size of Mask Pattern 1 2 Bits 4 8 16 32 32Bits 165 100 59 42 35 16Bits 84 51 30 21 17 8Bits 43 26 15 10 4Bits 22 13 7 2Bits 11 6 1Bit 5 Mask Fixed Sliding 1 Bit X 2 Bits 3 Bits 4 Bits 5 Bits 6 Bits 7 Bits 8 Bits The goal of the mask selection is to find the best possible mask combinations. The table shows the cost of using bitmasks where each row indicates what is the cost of using a particular mask pattern when certain number of bit changes are allowed. For example, if 8 bit changes are allowed and a 4-bit mask is used, the cost is 15 bits. Clearly, the ones shaded with yellow are profitable entries. The second table (on the right) shows all the profitable entries 1-bit or 8-bit. Here, fixed implies that the mask is applied only in fixed boundaries. For example, an 8-bit mask can be applied to 4 places (byte boundaries, requires two bits) on a 32-bit vector. A sliding mask can be applied anywhere (requires 5 bits on a 32-bit vector).

How to Compute Number of Bits
Example: Up to 4 bit changes (row with entry “4 Bits”) using 2-bit masks (column with entry “2 Bits”). Assumption: we need 4 bits to remember all the even locations (there are 16 possibilities) in a 32-bit binary. We need up to two bitmasks to cover up to 4 bits using 2- bit masks. This assumes that the changes are in pair of two bits. Each bitmask costs 2 bits (mask size) + location (4 bits)  6 bits We are allowing up to 4 bit changes, therefore, in a given binary we may need one 2-bit mask, whereas in another binary we may need two 2-bit masks. Therefore we need 1 bit to indicate whether we are using one or two masks in a given binary. Total cost for masks = 1+ 2 x 6 = 13 bits

Mask Selection: Two Questions
How many bitmasks do we need? Up to two mask patterns Minimum cost to store three bitmasks is bits for a 32-bit vector 15 bits (three 1-bit sliding masks) + 2 bits (mask combination) bits (dictionary and codeword) 27-31 bits for 32 bit vectors. Which combinations are profitable? Eleven possibilities 1s, 2s, 2f, 3s, 4s, 4f, 5s, 6s, 7s, 8s, 8f Select one/two from eleven possibilities Number of combinations can be further reduced. There are two important questions that needs to be answered to address mask selection. First, how many bitmasks do we need and Next, which combinations are profitable. It is easy to answer the first question – up to two mask patters are profitable since use of 3 bitmask will lead to use of bits for a 32-bit vector (hardly any compression). To answer the second question we need to find which one or two masks (from 11 patterns) are profitable.

Comparison of Bitmask Combinations
Benchmarks are compiled for TI TMS320C6x (1s, 4f) and (2f, 2s) provide the best compression We have performed various studies on using mask combinations for various benchmarks compiled for different architectures. Here we show 3 benchmarks compiled with TI C6x architecture. It shows an interesting pattern where (1s, 4f) and (2f, 2s) are profitable. Similar trend is also observed in other results. (1s, 4f) (2s, 2f) (1s, 4f) s

Mask Selection: Observations
Which bitmask patterns are profitable? Factors of 32 (1, 2, 4 and 8) produce better results Since they can be applied cost-effectively on fixed locations 8-bit fixed/sliding is not helpful Probability of more than 4 consecutive changes is low Two smaller masks perform better than a larger one 4-bit sliding does not perform better than 4-bit fixed Two bitmasks provide better results than a single one Choose two from four bitmasks: (1s, 2f, 2s, 4s) The observations are that 1, 2, 4 and 8 produce better results since they can be applied on fixed locations. The 8-bit fixed or sliding is not better than their smaller counterpart since it is profitable to use two small ones than a larger one. All these observations lead two choosing two bitmask from four possibilities (1s, 2f, 2s and 4f). Mask Fixed Sliding 1 Bit X 2 Bits 4 Bits

Dictionary Selection Dictionary Selection Dynamic Static Frequency
We are considering a static dictionary selection approach. Clearly, frequency based selection is not suitable for bitmask-based compression. It may seem intuitive to use spanning based approach that selects binary uniformly from the program based on hamming distance with the anticipation that a set of adjacent binaries can be matched with low cost. However, we observed that neither spanning nor an ad-hoc mix of frequency+spanning based approach generates good compression. We proposed a bitsavings based approach that considers the bit savings for self repetitions as well as bitmask-based matches. Spanning Bit Savings Select most frequently occurring binary patterns Select patterns to ensure uniform coverage of all patterns based on hamming distance. Select patterns based on bit savings due to self and mask-matched repetitions

Effect of Dictionary Selection Methods
Frequency-based DS CR = 97.5% Spanning-based DS CR = 87.5% How to choose dictionary entries to maximize compression? If we choose the “ ” entry which most frequent entry, the compression ratio is 97.5%, but using another entry in this case produces 10% improvement since it can gain from both frequency as well as bitmask-enabled matches with low cost.

BitSavings-based Dictionary Selection
Construct a graph of the original code Each node represents a binary in the program An edge represent how one node can be matched with the other node using a bitmask Assign weights based on bit savings The weight of a node is computed based on the frequency of the corresponding binary. The weight of an edge is computed by measuring how this pattern (node) can be used to create matches with the other pattern (node) Compute total savings of each node Summation of node and corresponding edge weights Select the node with the highest savings and delete the node and all the nodes and edges connected to it Repeat this process until the graph is empty or the dictionary is full. The goal is to choose the most profitable entries in the dictionary. The first step is to construct a graph where each node represents a binary in the program and each edge represents that one node can be matched with the other using a bitmask. Next, weights are assigned to each node and edges based on how many bits they can save by using the bitmask-based matching (edge weight) as well as savings due to repetitions (node weight). Total weight is computed as the summation of node weights and the associated edge weights. The algorithms selects the node with highest savings and inserts it in the dictionary. It deletes the selected node as well as the nodes/edges connected to the selected node (THRESHOLD is used to determine which one to delete for improved result). This process is repeated until the graph is empty or the dictionary is full.

Node Weight: number of bits saved due to frequency of the pattern Edge Weight: number of bits saved due to use of the bitmask based match Total weight: node weight + all edge weights (connected to the node) B(7) C(7) A = 0+10 = 10 B = 7+15 = 22 C = 7+15 = 22 D = = 5 E = 0+15 = 15 F = 7+20 = 27 G =14+10 = 24 A(0) 5 10 5 10 D(0) This slide shows an example. The node A has 0 savings since it has no repetitions. It can match with node D that generates savings of 5 bits. Similarly it can match with B with 5 bit savings. Therefore total savings is = 10. Once we compute for all nodes, we see node F has highest savings of 27. The algorithm selects node F as dictionary entry and deletes it as well as all the nodes and edges connected to it. G(14) 5 F(7) 10 E(0)

Node Weight: number of bits saved due to frequency of the pattern Edge Weight: number of bits saved due to use of the bitmask based match Total weight: node weight + all edge weights (connected to the node) B(7) A = 0+10 = 10 B = 7+15 = 22 D = = 5 G =14+10 = 24 A(0) 5 10 5 D(0) Next, all the weights are recomputed and the winner is selected, G in this case. This process continues until the graph is empty of the dictionary is full. G(14) Continues until the dictionary is full or the graph is empty

Application Aware Code Compression
Algorithm: Code Compression using Bitmasks Input: Original code (32-bit vectors) Outputs: Compressed code and dictionary Begin Step 1: Select the mask patterns. for each mask pattern in (1s, 2s, 2f, 4f) Step 2: Select the optimized dictionary. Step 3: Compress 32-bit vectors using constraints. Step 4: Compute compression ratio and compare. endfor Step 5: Adjust and handle the branch targets. Return Compressed code and dictionary. End As we discussed earlier, we need to select two bitmask from a set of four. Therefore the mask selection needs to evaluate sixteen combinations. For each combination it needs to perform bitsavings based dictionary selection and code compression and choose the best one among sixteen compression scenarios. Since the code compression is performed offline and minor improvement will have significant impact, this is acceptable. Finally the compressed code and the dictionary is returned after adjusting the branch targets.

Experiments Experimental Setup Results Benchmarks: TI and MediaBench
Architectures: Sparc, TI TMS320C6x, MIPS Results BCC: Bitmask-based code compression Customized encodings for different architectures Effects of dictionary size selection Comparison with existing techniques ACC: Application-aware code compression Bitmask selection Dictionary selection BCC versus ACC

Compression Ratio for adpcm_en
Encoding2 outperforms others Encoding 1 (one 8-bit mask) Encoding 2 (two 4-bit masks) Encoding 3 (4-bit and 8-bit masks)

Effetcs of Different Dictionary Sizes
Compression ratio - 55~67% Smaller program – small dictionary Bigger program – big dictionary

Comparison with other Techniques
Bitmask Approach Smaller compression ratio is better Outperforms other dictionary-based techniques by 15% Higher decompression bandwidth than existing compression techniques

Comparison of Dictionary Selection Methods
We compare different dictionary selection methods for bitmask based code compression. Clearly, bitsavings (our approach) outperforms frequency and spanning based dictionary selection methods. BitSavings approach outperforms both frequency- and spanning-based techniques

Compression Ratio Comparison
BCC: Bitmask-based Code Compression ACC: Application-aware Code Compression Bitmask-based code compression (BCC) improves traditional dictionary based code compression by 15-20%. Our approach improves bitmask-based code compression by another 5-10% by aggressively creating more matching patterns and thereby improving compression efficiency without introducing any additional decompression penalty. BCC generates 15-20% improvement over other techniques ACC outperforms BCC by another 5-10%

How to Accelerate Decompression?
Divide code into several streams, compress and store each stream separately. Parallel decompression using multiple decoders Unequal compression Wastage of space Difficult to handle branch targets A B A B ?

Another Alternative A B A B Always perform fixed encoding
variable-to-fixed fixed-to-fixed Sacrifices compression efficiency A B A B 38

Overview Divide code into multiple streams
Compress each of them separately Merge them using our placement algorithm Reduce space wastage Ensure that none of the decoders are idle Qin and Mishra, A Universal Placement Technique of Compressed Instructions for Efficient Parallel Decompression, IEEE Transactions on CAD, 28(8), , 2009.

Compression Algorithm

Compression using Huffman Coding
Huffman coding with instruction division and selective compression Compression Ratio: Compressed Size / Original Size = 60/72 = 83.3% CR: 77.8% CR: 88.9%

Example using Two Decoders
Branch Block Instructions between two Consecutive branch targets Slot1 (4 bits) Slot2 (4 bits) Storage Structure Input Compressed Streams Sufficient Decode Length: 1 + length of uncompressed field = 1+4 = 5

Example

Decode-Aware Code Placement
Algorithm: Placement of Two Bitstreams Input: Storage Block Output: Placed Bitstreams Begin if !Ready1 and !Ready2 then Assign Stream1 to Slot1 and Stream2 to Slot2 else if !Ready1 and Ready2 then Assign Stream1 to Slot1 and Slot2 else if Ready1 and !Ready2 then Assign Stream2 to Slot1 and Slot2 else Assign Stream 1 to Slot1 and Stream2 to Slot2 End Readyi  i’th buffer has sufficient bits

Decompression Mechanism

Experimental Setup MediaBench and MiBench Benchmarks
Adpcm_enc, adpcm_de, cjpeg, djpeg, gsm_to, gsm_un, mpeg2enc, mpeg2dec, and pegwit Compiled for four target architectures TI TMS320C6x, PowerPC, SPARC and MIPS Compared our approach with CodePack BPA1: Bitstream placement for Two Streams Two decoders work in parallel BPA2: Bitstream placement for Four Streams Four decoders work in parallel

Decode Bandwidth 2-4 times improvement in decode performance CodePack
BPA1 BPA2 2-4 times improvement in decode performance

Compression Penalty Less than 1% penalty in compression performance

Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library
Hardware Overhead BPA1 and CodePack uses similar area/power BPA2 requires double area/power Four 16-bit decoders This overhead is negligible X smaller compared to typical reduction in overall area and energy by code compression. CodePack BPA1 BPA2 Area (um2) 122263 137529 253586 Power (mW) 7.5 9.8 14.6 Critical path (ns) 6.91 5.76 5.94 Synthsized using Synopsys Design Compiler and TSMC 0.18 cell library 51

More than 4 Decoders? BPA1 – Two Decoders
May need 1 startup stall cycle for each branch block BPA2 – Four Decoders May need 2 startup stall cycles for each branch block Proved that BPA1 and BPA2 uses exactly 1 and 2 cycles (respectively) more than optimal placement Too many parallel decoders is not profitable Overall increase in output bandwidth will slow down by more start up stalls Startup stalls may not be negligible with the execution time of the branch block itself.

Conclusion Code compression is promising to Conflicting requirements
reduce memory size and cost reduce power dissipation / energy requirements improve overall performance Conflicting requirements Complex compression for code size reduction Simple (fast) decompression for better speed Bitmask-based compression provides significant code size reduction Efficient code placement enables parallel and fast decompression

Code Compression Motivation Efficient Compression

Similar presentations

Presentation on theme: "Code Compression Motivation Efficient Compression"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Code Compression Motivation Efficient Compression

Similar presentations

Presentation on theme: "Code Compression Motivation Efficient Compression"— Presentation transcript:

Similar presentations

About project

Feedback