Implementation of DWT using SSE Instruction Set Mehta, Ami Muller, Gilles
Lifting based 2D-DWT Lifting Fixed point 1D Horizontal lifting 1D Vertical lifting Fixed point (9,7) tap biorthogonal filter Lossy compression High compression levels
2D DWT Matrices layout Mallat Strategy Uses an auxiliary matrix to store the results of the horizontal filtering. No memory scattering: Horizontal high and low frequency components are not interleaved in memory. It allows a better exploitation of the SIMD parallelism.
Optimizations Cache The 2 matrices are aligned on the cache row size (128bits=16B) to allow data fetching in one cycle. Input and output matrices are juxtaposed in the memory to prevent conflicts in Direct Mapped cache. (Associativity conflict) access Cache layout without alignment Cache layout with alignment
Optimizations … SIMD code Using SSE2 Computes 4 pixels in parallel using fixed point arithmetic. Profiling C code showed that column transform and cache access caused the main bottleneck. In DWT intermediate values are reused, instead of recalculating we keep the intermediate computations.
Results Image size of 1024 x 1024 Profiling results done using VTune Analyzer© Cycles per uops improves from 3.38 to 2.28 Improvement of 32.5%
Results …
Thank you