Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.

Slides:



Advertisements
Similar presentations
Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.
Advertisements

DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Multi-Threading LAME MP3 Encoder
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Audio Coding Team Member: ChungMing Yan, Chun Tong.
Time-Frequency Analysis Analyzing sounds as a sequence of frames
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Optimizing single thread performance Dependence Loop transformations.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
IA-32 Processor Architecture
Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.
Programming with CUDA WS 08/09 Lecture 9 Thu, 20 Nov, 2008.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
AMD Opteron - AMD64 Architecture Sean Downes. Description Released April 22, 2003 The AMD Opteron is a 64 bit microprocessor designed for use in server.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Multi-core architectures. Single-core computer Single-core CPU chip.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Multi-Core Architectures
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
History of Microprocessor MPIntroductionData BusAddress Bus
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
ULTRASPARC 2005 INTRODUCTION AND ISA BY JAMES MURITHI.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.
Playstation2 Architecture Architecture Hardware Design.
Intel Multimedia Extensions and Hyper-Threading Michele Co CS451.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Processor Level Parallelism 1
Code Optimization.
Multi-core processors
Embedded Systems Design
The University of Adelaide, School of Computer Science
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
General Optimization Issues
Special Instructions for Graphics and Multi-Media
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Lecture 5: GPU Compute Architecture for the last time
STUDY AND IMPLEMENTATION
General Optimization Issues
Lecture 4: Instruction Set Design/Pipelining
Presentation transcript:

Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir

Ogg Vorbis is a fully open, non-proprietary, patent-and-royalty-free, general-purpose compressed audio format for mid to high quality (8kHz-48.0kHz, 16+ bit, polyphonic) audio and music at fixed and variable bitrates from 16 to 128 kbps/channel. This places Vorbis in the same competitive class as audio representations such as MPEG-4 (AAC), and similar to, but higher performance than MPEG-1/2 audio layer 3, MPEG-4 audio (TwinVQ), WMA and PAC. Ogg Vorbis is a fully open, non-proprietary, patent-and-royalty-free, general-purpose compressed audio format for mid to high quality (8kHz-48.0kHz, 16+ bit, polyphonic) audio and music at fixed and variable bitrates from 16 to 128 kbps/channel. This places Vorbis in the same competitive class as audio representations such as MPEG-4 (AAC), and similar to, but higher performance than MPEG-1/2 audio layer 3, MPEG-4 audio (TwinVQ), WMA and PAC.

Strategies used to increase Ogg Vorbis’ performance * We looked for architectural pitfalls, and created an alternative, optimized code instead. * We used threading in order to use HyperThreading capabilities of the processor. * We used SSE programming, in order to make faster, parallelized calculations. Strategies used to increase Ogg Vorbis’ performance * We looked for architectural pitfalls, and created an alternative, optimized code instead. * We used threading in order to use HyperThreading capabilities of the processor. * We used SSE programming, in order to make faster, parallelized calculations.

Cleaning architectural pitfalls Serialized instructions Cleaning architectural pitfalls Serialized instructions After using VTune to analyze the results, we found that every conversion from float to int (masking), uses “_ftol”. _ftol uses “fldcw”, which causes serialization, and it causes memory stalls. We avoided using _ftol by writing an alternative code for the masking. We also found _ctrlfp, which is used as part of the C function rint. _ctrlfp uses “fldcw”, and we avoided using it, by writing an alternative code for rint, as well.

64K Aliasing 64K Aliasing 64k aliasing happens when a procedure works on two data segments that are placed on cache lines that have exactly (n)mod(64k) between them. The problem is that memory addresses with the same lower 16 bits will be mapped into the same place in the cache. Since both pieces of memory cannot occupy the same cache line simultaneously, the cache thrashes. We found out that some data, which is called and used many times in Ogg Vorbis was not congruent. Ogg Vorbis had a great problem with 64K aliasing. We mapped the data correctly (using different banks) and got better results.

Threading Threading Hyper-Threading Technology enables multi-threaded software applications to execute threads in parallel. We looked at the first two time consuming functions and found out that they can be parallelized.

SIMD Single Instruction Multiple Data (SIMD) method enables the programmer to develop algorithms that can mix packed, single-precision, floating-point and integer using both SSE and MMX instructions respectively. We looked for loop sequences that contain linear calculations with arrays within the hottest functions.

Yeild gained from each strategy Yeild gained from each strategy Removing architectural pitfalls By writing an alternative code to _ftol, called FLT2INT, we succeeded to gain 4% of performance. By writing an alternative code to rint, we succeeded to gain 4% of performance. By dropping the 64K aliasing, we succeded to gain 6% of performance. That makes a total of 14% gain of performance for the pitfall strategy.

Threading SSE Threading We parallelized the noise masker and the tone masker, which had no dependency between each other (functional decomposition). No special profit was given by doing this optimization, and the total speedup of this optimization was 2% SSE Tuning is still in progress. No profit was seen yet.

Main achievements SIMD Main achievements Architectural pitfalls: By writing the alternative code, we succeeded to remove most of the architectural pitfalls that we found. Threading: Parallelized two functions which were not dependant on each other. SIMD: We translated the loops from using instructions that work with architectural registers into instructions that work with SIMD registers.

Performance boost Performance boost The total performance gained from using all the 3 strategies, was 16%. A sample file of 100MB which was encoded at 50 seconds before the optimization, was encoded at 42 seconds afterwards.