Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Recursion 2014 Spring CS32 Discussion Jungseock Joo.
System Integration and Performance
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Computer Abstractions and Technology
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Computer Organization and Assembly Languages Yung-Yu Chuang
Implementation of the Convolution Operation on General Purpose Processors Ernest Jamro AGH Technical University Kraków, Poland.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Computer Organization and Architecture
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.
Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.
Modeling OFDM Radio Channel Sachin Adlakha EE206A Spring 2001.
Chapter 12 Pipelining Strategies Performance Hazards.
Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.
CPSC 231 Sorting Large Files (D.H.)1 LEARNING OBJECTIVES Sorting of large files –merge sort –performance of merge sort –multi-step merge sort.
Estimating Multimedia Instruction Performance Based on Workload Characterization and Measurement Gheewala, A.; Peir, J.-K.; Yen-Kuang Chen; Lai, K.; IEEE.
High Performance Computing Introduction to classes of computing SISD MISD SIMD MIMD Conclusion.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Implementing a FIR-filter algorithm using MMX instructions by Lars Persson.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Ultra sound solution Impact of C++ DSP optimization techniques.
SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.
Today’s topics Parameter passing on the system stack Parameter passing on the system stack Register indirect and base-indexed addressing modes Register.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
MMX technology for Pentium. Introduction Multi Media Extension (MMX) for Pentium Processor Which has built in 80X87 Can be switched for multimedia computations.
Mrs. Ulshafer August, 2013 Java Programming Chapter 1.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.
MMX-accelerated Matrix Multiplication
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Introduction to Intel IA-32 and IA-64 Instruction Set Architectures.
Lecture Overview Shift Register Buffering Direct Memory Access.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
Optimizing Pixomatic For Modern Processors
Parallel Processing - introduction
Exploiting Parallelism
Sega Dreamcast Visual Memory Unit FPGA Implementation
MMX technology for Pentium
MMX Multi Media eXtensions
Special Instructions for Graphics and Multi-Media
Transformation of Beam forming Algorithm Using MMX Instructions
STUDY AND IMPLEMENTATION
MIPS Procedure Calls CSE 378 – Section 3.
Intel MMX™ Technology Accelerating 3D Geometry Transformation
MMX technology for Pentium
Presentation transcript:

Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb

Monkey’s Audio – a lossless audio codec Can Compress at different levels Can be decompressed back to a Wav file Used to save memory while maintaining all the original data Playable MAC – Ape File Encoder

Platform And Benchmark Used Platform: Intel Pentium Core i7 3GB of RAM and with a Windows Vista operating System. Benchmark: - 238MB song. - Original Encoding Duration: 98.9 Sec

Algorithm Description The input file is read frame by frame Every frame contains a constant number of channels Channels encoded with dependency between them Every frame is encoded and immediately written

The Encoding Process MultiThread Here! MultiThread Here! MultiThread Here!

Function Data flow Encoding every Frame Encoding the error for every channel Most time Consuming functions Encode with a Predictor Encoding every Frame

Optimization Method Dealing with the most time consuming functions Two approaches were taken: –Multi-threading –SIMD

Optimization Method 1: Threads Monkey’s Audio was managed by a single thread Threads should maintain 1:1 bit compatibility Changing the flow of the program is required

Changing The Program Flow Originally: Each frame is encoded and written immediately After The Change: Each frame is encoded and written to a buffer The buffer is filled through the encode process Write the buffer once all previous frames have been encoded and written

Our Implementation We use the following threads: Main thread Transfers frame data to the encode thread Write thread Writes the encoded buffers to the output file Encode threads Encodes the frame it is given Note: we use N+2 threads, when N is the number of threads available.

Data Structures Used ThreadParam – a linked list of objects that contains the encoded data EncodeParam – an object containing data needed to encode a frame WriteParam – an object containing data needed to write to the output FramePredictor - global array that signal dependency between frames

Threads Schema

Dependencies Between Frames Once a frame finished encoding, there may be a left over of data, which is dealt with in 2 ways: 1.Writing the left over data after the encoded frame 2.Re encode the left over data with the next frame We always write the left over data after the encoded frame

Dealing With Dependencies Between Frames Using the write thread to start a new encode thread Remove the ‘wrongly encoded’ frame from the list Keep encoding the rest normally Keep writing to the output file in the right order!

The Problem There is also a data leftover between frames This dependency is unpredictable It is impossible to maintain 1:1 bit compatibility We ‘guess’ the best value so we don’t lose data!

Results: Vtune Thread Profiler

Results: Vtune Thread Checker

MultiThreading Conclusion Total speedup from using MT: x3.15!

Explaining The Speedup When considering Amdahl’s law we have 2 serial parts (reading the first frames and encoding the last frame) that takes about 8% of our benchmark so we get: In addition while implementing our solution, in order to deal the dependencies we added ~20% instruction, thus we expect:

Optimization Method 2: SIMD Original Code is written using MMX technology Operations with only 16bit Integer arrays Two main functions we used SSE on: –Adapt() –CalculateDotProduct() Note: These functions written entirely in ASM

Adapt () - Improvements Add and Sub instructions on arrays of 16 bit Integers (supported in MMX) Each iteration goes over 32 sequential array elements The input and output arrays were aligned to prevent ‘Split loads’

Adapt () – Main Loop Old code movq mm0, [eax] paddw mm0, [ecx] movq [eax], mm0 movq mm1, [eax + 8]... movq mm3, [eax + 24] paddw mm3, [ecx + 24] movq [eax + 24], mm3 New code (aligned) movdqa xmm0, [eax] movdqa xmm2, [ecx] paddw xmm0, xmm2 movdqa [eax], xmm0 movdqa xmm1, [eax + 16] movdqa xmm3, [ecx + 16] paddw xmm1, xmm3 movdqa [eax + 16], xmm1 Note: There is equivalent loop with SUB operations MMX register is 8 byte SSE register is 16 byte 16 Vs. 12 instructions per iteration

SIMD - CalculateDotProduct () Multiply-Add of an 16bit Integers array. Each iteration goes over 32 array elements. Speedup will be calculated for both functions together.

CalculateDotProduct () Old code movq mm0, [eax] pmaddwd mm0, [ecx] paddd mm7, mm0 movq mm1, [eax + 8]... movq mm3, [eax + 24] pmaddwd mm3, [ecx + 24] paddd mm7, mm3 New code (aligned) movdqa xmm0, [eax] movdqa xmm4, [ecx] pmaddwd xmm0, xmm4 paddd xmm7, xmm0 movdqa xmm1, [eax + 16] movdqa xmm4, [ecx + 16] pmaddwd xmm1, xmm4 paddd xmm7, xmm1 Multiply-Add Each iteration is Multiply-Adding 32 array elements 16 Vs. 12 instructions per iteration

SIMD Speedup Achieved Adapt () local speedup: x1.72 Overall speedup: x1.2 CalculateDotProduct() local speedup: x1.62 Overall speedup: x1.2 Total speedup using SIMD: x1.4!

Intel Tuning Assistant No Micro-Architectural problems found in the optimized code.

Final Results A total speedup of x4.017 was achieved by using only MT and SIMD