Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir
Ogg Vorbis is a fully open, non-proprietary, patent-and-royalty-free, general-purpose compressed audio format for mid to high quality (8kHz-48.0kHz, 16+ bit, polyphonic) audio and music at fixed and variable bitrates from 16 to 128 kbps/channel. This places Vorbis in the same competitive class as audio representations such as MPEG-4 (AAC), and similar to, but higher performance than MPEG-1/2 audio layer 3, MPEG-4 audio (TwinVQ), WMA and PAC. Ogg Vorbis is a fully open, non-proprietary, patent-and-royalty-free, general-purpose compressed audio format for mid to high quality (8kHz-48.0kHz, 16+ bit, polyphonic) audio and music at fixed and variable bitrates from 16 to 128 kbps/channel. This places Vorbis in the same competitive class as audio representations such as MPEG-4 (AAC), and similar to, but higher performance than MPEG-1/2 audio layer 3, MPEG-4 audio (TwinVQ), WMA and PAC.
Strategies used to increase Ogg Vorbis’ performance * We looked for architectural pitfalls, and created an alternative, optimized code instead. * We used threading in order to use HyperThreading capabilities of the processor. * We used SSE programming, in order to make faster, parallelized calculations. Strategies used to increase Ogg Vorbis’ performance * We looked for architectural pitfalls, and created an alternative, optimized code instead. * We used threading in order to use HyperThreading capabilities of the processor. * We used SSE programming, in order to make faster, parallelized calculations.
Cleaning architectural pitfalls Serialized instructions Cleaning architectural pitfalls Serialized instructions After using VTune to analyze the results, we found that every conversion from float to int (masking), uses “_ftol”. _ftol uses “fldcw”, which causes serialization, and it causes memory stalls. We avoided using _ftol by writing an alternative code for the masking. We also found _ctrlfp, which is used as part of the C function rint. _ctrlfp uses “fldcw”, and we avoided using it, by writing an alternative code for rint, as well.
64K Aliasing 64K Aliasing 64k aliasing happens when a procedure works on two data segments that are placed on cache lines that have exactly (n)mod(64k) between them. The problem is that memory addresses with the same lower 16 bits will be mapped into the same place in the cache. Since both pieces of memory cannot occupy the same cache line simultaneously, the cache thrashes. We found out that some data, which is called and used many times in Ogg Vorbis was not congruent. Ogg Vorbis had a great problem with 64K aliasing. We mapped the data correctly (using different banks) and got better results.
Threading Threading Hyper-Threading Technology enables multi-threaded software applications to execute threads in parallel. We looked at the first two time consuming functions and found out that they can be parallelized.
SIMD Single Instruction Multiple Data (SIMD) method enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively. We looked for loop sequences that contain linear calculations with arrays within the hottest functions.
Yeild gained from each strategy Yeild gained from each strategy Removing architectural pitfalls By writing an alternative code to _ftol, called FLT2INT, we succeeded to gain 4% of performance. By writing an alternative code to rint, we succeeded to gain 4% of performance. By dropping the 64K aliasing, we succeded to gain 6% of performance. That makes a total of 14% gain of performance for the pitfall strategy.
Threading SSE Threading We parallelized the noise masker and the tone masker, which had no dependency between each other (functional decomposition). No special profit was given by doing this optimization, and the total speedup of this optimization was 2% SSE Tuning is still in progress. No profit was seen yet.
Main achievements SIMD Main achievements Architectural pitfalls: By writing the alternative code, we succeeded to remove most of the architectural pitfalls that we found. Threading: Parallelized two functions which were not dependant on each other. SIMD: We translated the loops from using instructions that work with architectural registers into instructions that work with SIMD registers.
Performance boost Performance boost The total performance gained from using all the 3 strategies, was 16%. A sample file of 100MB which was encoded at 50 seconds before the optimization, was encoded at 42 seconds afterwards.