Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz
Overview Goals To explore possible optimizations of data compression algorithms for embedded architectures. Theorize the ideal characteristics of an embedded architecture specialized for compression/decompression.
Optimizing Zlib Based on Lembel-Ziv (LZ77) algorithm Used included minigzip app as our performance benchmark Flat profile Most commonly executed line accounted for ~9% of total cycles Decided to see if several TIE optimizations would produce a significant speedup…
Results: Speedup UPDATE_HASH 48.74% ADLER % SEND_BITS 5.20% Optimized (cycles)Improvement Small Txt % Large Txt % TIFF Img % PPM Img % Original (cycles)
Obstacles for TIE Optimization Most highly executed instructions were branches on value in memory High latency but not computationally intensive Accesses to memory were random Prefetching from memory not an option Zlib implementation already highly optimized
Optimizing LZO Based on Lempel-Ziv Algorithm Differs from Zlib Favors speed over compression Profile was less flat Likely to provide better performance gains than Zlib…
Results: Instruction Speedup: Speedup D_INDEX33.33% ADLER % Original (cycles)Optimized (cycles) Improvement (percent ) Small Txt % Large Txt % TIFF Img % PPM Img %
Optimizing the Cache Size Motivation: Memory intensive, but lacked memory access pattern for prefetching Results: Performance improved linearly as cache size doubled Greater improvement seen when window can fit entirely in cache
Performance as Cache Size Increases
Other Zlib Architectures Our Ideal Architecture Low access latency buffer to hold window Process 2 or more windows in parallel Algorithm is independent across window boundaries IBMLZ1 for storage systems Window stored in CAM, large comparator used to rapidly find longest matches in parallel Arithmetic coding used instead of huffman coding
Reimplementing Longest Match Attempted to fully optimize a subset of the Zlib architecture Reimplemented LongestMatch() routine in C++ Attempted to vectorize the matching loop
Vectra Extensions for Zlib Fetch 8 bytes into a vector register Effectively implements local prefetching Test for equality in parallel Reduces latency overhead of load followed by branch
Conclusions Zlib was already fairly optimized Demonstrated by flat profile TIE extensions best suited for computationally intensive algorithms Memory access latency and control flow instructions were bottleneck in zlib Zlib performance did not scale well with increased cache size Overall, our TIE extensions did not provide enough improvement to justify their cost
Questions? Answers?