Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz

Overview Goals To explore possible optimizations of data compression algorithms for embedded architectures. Theorize the ideal characteristics of an embedded architecture specialized for compression/decompression.

Optimizing Zlib Based on Lembel-Ziv (LZ77) algorithm Used included minigzip app as our performance benchmark Flat profile Most commonly executed line accounted for ~9% of total cycles Decided to see if several TIE optimizations would produce a significant speedup…

Results: Speedup UPDATE_HASH 48.74% ADLER32 53.94% SEND_BITS 5.20% Optimized (cycles)Improvement Small Txt 131543212751383.06% Large Txt 19163172190223360.73% TIFF Img 1178913841153217972.18% PPM Img 13252656130828821.28% Original (cycles)

Obstacles for TIE Optimization Most highly executed instructions were branches on value in memory High latency but not computationally intensive Accesses to memory were random Prefetching from memory not an option Zlib implementation already highly optimized

Optimizing LZO Based on Lempel-Ziv Algorithm Differs from Zlib Favors speed over compression Profile was less flat Likely to provide better performance gains than Zlib…

Results: Instruction Speedup: Speedup D_INDEX33.33% ADLER3254.00% Original (cycles)Optimized (cycles) Improvement (percent ) Small Txt 3824593652374.5% Large Txt 159894515137915.3% TIFF Img 19850136189605314.5% PPM Img 180097117216664.4%

Optimizing the Cache Size Motivation: Memory intensive, but lacked memory access pattern for prefetching Results: Performance improved linearly as cache size doubled Greater improvement seen when window can fit entirely in cache

Performance as Cache Size Increases

Other Zlib Architectures Our Ideal Architecture Low access latency buffer to hold window Process 2 or more windows in parallel Algorithm is independent across window boundaries IBMLZ1 for storage systems Window stored in CAM, large comparator used to rapidly find longest matches in parallel Arithmetic coding used instead of huffman coding

Reimplementing Longest Match Attempted to fully optimize a subset of the Zlib architecture Reimplemented LongestMatch() routine in C++ Attempted to vectorize the matching loop

Vectra Extensions for Zlib Fetch 8 bytes into a vector register Effectively implements local prefetching Test for equality in parallel Reduces latency overhead of load followed by branch

Conclusions Zlib was already fairly optimized Demonstrated by flat profile TIE extensions best suited for computationally intensive algorithms Memory access latency and control flow instructions were bottleneck in zlib Zlib performance did not scale well with increased cache size Overall, our TIE extensions did not provide enough improvement to justify their cost

Questions? Answers?

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

Similar presentations

Presentation on theme: "Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

Similar presentations

Presentation on theme: "Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz."— Presentation transcript:

Similar presentations

About project

Feedback