Download presentation
Presentation is loading. Please wait.
Published byStewart Glenn Modified over 9 years ago
1
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz
2
Overview Goals To explore possible optimizations of data compression algorithms for embedded architectures. Theorize the ideal characteristics of an embedded architecture specialized for compression/decompression.
3
Optimizing Zlib Based on Lembel-Ziv (LZ77) algorithm Used included minigzip app as our performance benchmark Flat profile Most commonly executed line accounted for ~9% of total cycles Decided to see if several TIE optimizations would produce a significant speedup…
4
Results: Speedup UPDATE_HASH 48.74% ADLER32 53.94% SEND_BITS 5.20% Optimized (cycles)Improvement Small Txt 131543212751383.06% Large Txt 19163172190223360.73% TIFF Img 1178913841153217972.18% PPM Img 13252656130828821.28% Original (cycles)
5
Obstacles for TIE Optimization Most highly executed instructions were branches on value in memory High latency but not computationally intensive Accesses to memory were random Prefetching from memory not an option Zlib implementation already highly optimized
6
Optimizing LZO Based on Lempel-Ziv Algorithm Differs from Zlib Favors speed over compression Profile was less flat Likely to provide better performance gains than Zlib…
7
Results: Instruction Speedup: Speedup D_INDEX33.33% ADLER3254.00% Original (cycles)Optimized (cycles) Improvement (percent ) Small Txt 3824593652374.5% Large Txt 159894515137915.3% TIFF Img 19850136189605314.5% PPM Img 180097117216664.4%
8
Optimizing the Cache Size Motivation: Memory intensive, but lacked memory access pattern for prefetching Results: Performance improved linearly as cache size doubled Greater improvement seen when window can fit entirely in cache
9
Performance as Cache Size Increases
10
Other Zlib Architectures Our Ideal Architecture Low access latency buffer to hold window Process 2 or more windows in parallel Algorithm is independent across window boundaries IBMLZ1 for storage systems Window stored in CAM, large comparator used to rapidly find longest matches in parallel Arithmetic coding used instead of huffman coding
11
Reimplementing Longest Match Attempted to fully optimize a subset of the Zlib architecture Reimplemented LongestMatch() routine in C++ Attempted to vectorize the matching loop
12
Vectra Extensions for Zlib Fetch 8 bytes into a vector register Effectively implements local prefetching Test for equality in parallel Reduces latency overhead of load followed by branch
13
Conclusions Zlib was already fairly optimized Demonstrated by flat profile TIE extensions best suited for computationally intensive algorithms Memory access latency and control flow instructions were bottleneck in zlib Zlib performance did not scale well with increased cache size Overall, our TIE extensions did not provide enough improvement to justify their cost
14
Questions? Answers?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.