Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.

Slides:



Advertisements
Similar presentations
RISC Instruction Pipelines and Register Windows
Advertisements

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
© 2007 Eaton Corporation. All rights reserved. LabVIEW State Machine Architectures Presented By Scott Sirrine Eaton Corporation.
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Achieving over 50% system speedup with custom instructions and multi-threading. Kaiming Ho Fraunhofer IIS June 3 rd, 2014.
Performance of Cache Memory
Computer Abstractions and Technology
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Contents Even and odd memory banks of 8086 Minimum mode operation
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Copyright © 2003, SAS Institute Inc. All rights reserved. Where's Waldo Uncovering Hard-to-Find Application Killers Claire Cates SAS Institute, Inc
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 
Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.
Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Evgeny Gokhfeld 06/2006.
Chapter XI Reduced Instruction Set Computing (RISC) CS 147 Li-Chuan Fang.
Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.
Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.
Multiscalar processors
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Pipelining By Toan Nguyen.
Adnan Ozsoy & Martin Swany DAMSL - Distributed and MetaSystems Lab Department of Computer Information and Science University of Delaware September 2011.
Compressed Instruction Cache Prepared By: Nicholas Meloche, David Lautenschlager, and Prashanth Janardanan Team Lugnuts.
JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Machine Architecture CMSC 104, Section 4 Richard Chang 1.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Robert Crawford, MBA West Middle School.  Explain how the binary system is used by computers.  Describe how software is written and translated  Summarize.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Processor Architecture
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
Copyright © Curt Hill Parallelism in Processors Several Approaches.
Baum, Boyett, & Garrison Comparing Intel C++ and Microsoft Visual C++ Compilers Michael Baum David Boyett Holly Garrison.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Buffering Techniques Greg Stitt ECE Department University of Florida.
CSE 340 Computer Architecture Summer 2016 Understanding Performance.
Cache Memory and Performance
A Closer Look at Instruction Set Architectures
Hitting the SQL Server “Go Faster” Button
Operating Systems (CS 340 D)
Multicore Computing in ATLAS
CSC 4250 Computer Architectures
Embedded Systems Design
Architecture Background
Operating Systems (CS 340 D)
Lecture 5: GPU Compute Architecture
Chapter 8: Main Memory.
Lecture 5: GPU Compute Architecture for the last time
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Performance Optimization for Embedded Software
STUDY AND IMPLEMENTATION
Programming Languages
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Chapter 12 Pipelining and RISC
CENG 351 Data Management and File Structures
Programming with Shared Memory Specifying parallelism
COMP755 Advanced Operating Systems
Presentation transcript:

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb

WavPack – Description WavPack is a an open source audio compression format. –Allows lossless audio compression. Compresses WAV files to WV files –Average compression ratio is 30-70%. Support for windows and mobile devices. –Cowon A3 PMP, iRiver, iPod, Nokia phones, and more.

Project Goals Enhance the Wavpack performance by: –Working and analyzing with Intel® VTune™ Performance Analyzer. –Studying and applying instructions of Intel®’s new processors. –Implementing multi-threading techniques in order to achieve high performance. Return the source code to the community.

Algorithm Description Input file is processed in blocks of 512kb. –A global context exists for all blocks. –Blocks are divided into sub blocks. 24,000 samples equivalent to 0.5 second of wav at CD quality. –Encodes each block and writes to output. –Updates context data for next block.

Is Lossless & Stereo Configuration stereo/mono bps { 8,16,24,32}, pass count, etc. Configuration stereo/mono bps { 8,16,24,32}, pass count, etc. Go over the buffer. take a block of 24,000 samples Go over the buffer. take a block of 24,000 samples Read buffer of 512kb from Input File Transform l.ch & r.ch to mid, diff … more options Perform wavpack decorralation algorithm on the buffer Perform wavpack decorralation algorithm on the buffer Write the resulted buffer to the output. This is the compression stage. Write the resulted buffer to the output. This is the compression stage. 1 st part of the wavpack algorithm 2 nd Part of the wavpack algorithm This is why parallelizing of the entire flow fails Calculate additional information for compression Perform the compression bit by bit Count ones and zeros until change occurs Each subset of bytes depends on an indeterminate subset of the previous bytes. Context Global Information Passed down to each function Context Global Information Passed down to each function … more options Init x Pass count Finish

Testing Environment Hardware –Core i7 2.66GHz CPU, Quad GHz. –4GB of RAM. Software –Windows XP/Vista. –Visual studio –Intel VTune Toolkit. –Compiled with Microsoft compiler. Tests are done on a 330Mb WAV file.

Original Implementation Single threaded application –Read from disk. –Encode. –Write to disk directly. Old MMX Instructions are used. Processing of 330Mb Wav file takes about 30 seconds.

Optimizations Parallel IO/CPU

General –Separate read, write and processing operations into several threads. Flow –Use the main thread to read input file. Create “jobs” and submit them into a work queue. –Use an additional thread to process the “jobs”. Output is redirected to memory instead of disk. –Another thread writes the processed output to the disk.

Optimizations Parallel IO/CPU – cont. Benchmark –VTune analysis showed the following results –Average running time is about 29 seconds. –Speedup is Refers to original results. Conclusions –No significant improvement. –I/O operations take considerably less time than the blocks processing. Reads are done long before the processing is done. Writing thread is almost never busy.

Optimizations Multi Threaded Processing

General –Obstacle: Each block is dependent on the previous processed block. Parallelizing entire flow is impossible. –Multithreading parts of the algorithm. Locate parts of the code where the program spends most of the time. Parallelize several functions in these parts. Implementation –Using “Thread Pool”. –Work is separated to left and right channel. At each channel, each sample is dependent on the previous sample. Can’t use more than two threads. –Each thread uses different memory area. Results must be combined after work is done.

Is Lossless & Stereo Processing thread more options… Worker thread 2 Fill two new “Thread Args” structures. One with left channel data and one with the Right. Fill two new “Thread Args” structures. One with left channel data and one with the Right. Submit each work to the “Thread Pool” Wait on the “OnComplete” mutex worker thread 1 Wait for work to arrive into the “Thread Pool” and start the work. Wait for work to arrive into the “Thread Pool” and start the work. Perform Wavpack decorrelation algorithm on the buffer Perform Wavpack decorrelation algorithm on the buffer Write the resulted buffer to the output. This is the compression stage. Write the resulted buffer to the output. This is the compression stage. Calculate additional information for compression Perform the compression bit by bit Count ones and zeros until change occurs x Pass count Return to “Thread Pool” Right Channel Wait for work to arrive into the “Thread Pool” and start the work. Wait for work to arrive into the “Thread Pool” and start the work. Return to “Thread Pool” Left Channel Interleave left & right channels data to one output buffer Create a duplicates of each shared data structure to avoid cache conflicts Create a duplicates of each shared data structure to avoid cache conflicts

Optimizations Multi Threaded Processing – cont. Benchmark –VTune analysis showed the following results –Average running time is about 25 seconds. –Speedup is Refers to original results. Conclusions –About 17% of the running time is parallelized. –Total improvement – Due to overhead improvement is a little bit smaller.

Optimizations Moving to SIMD

General –Locate mathematical calculations and loops. Where the program spends most of the time. –Use 128bit width instructions. –Convert four operations of 32bit to one of 128bit. Theoretically, performance can be x4 faster. In practice, there is overhead (load, store). Implementation –Re-factor the code as a basis for adding SIMD operations. –Loop unrolling. Make sure to complete the “leftovers” of the loop. –Re-implement using SIMD code.

Optimizations Moving to SIMD – cont. Benchmark –VTune analysis showed the following results –Average running time is about 28 seconds. –Speedup is Refers to original results. Conclusions –Mathematical calculations can be mainly done with SSE2, SSE3. –SSE4 instructions were not useful for this application. –Improvement alone isn’t significant. More significant when combined with Multi Threading Optimization.

Optimizations Implementation Improvements

General –We found several hot spots of the program that we couldn’t improve using the mentioned methods. Branch misprediction. –Re-implement in a more efficient way. Implementation –Focused on one main function. Lots of branch mispredictions. 16bit Integer was used as buffered output. –Removed most of the branch instructions. –Re-implemented same logic with 64bit Integer buffer. Largest register size. SIMD would require too much overhead.

Optimizations Implementation Improvements – cont. Benchmark –VTune analysis showed the following results –Average running time is about 28 seconds. –Speedup is Refers to original results. Conclusions –Branch instructions and branch mispredictions were reduced. –Improvement in performance – almost 2 seconds less. –Implementation is centered in one method. Easy to re-factor. Requires no major architecture changes.

Summary The most significant optimization was multi threading code sections. –16% speedup. The most insignificant was the multithreaded I/O. –2.6% speedup.

Summary – Cont. Benchmark –VTune analysis showed the following results –Average running time is about 22 seconds. –Total speedup we achieved is The program runs faster by 33.5%.

Summary – Cont. Conclusions –Multithreading is something to be considered in the architectural stages of the application. In this application, the performance improvement does not worth the development and maintenance effort. –SIMD Optimizations should only be used in specific cases. Harder to use and understand the code. –Decreasing branch mispredictions and cache misses is a better way to improve performance. Refactoring only specific methods. Easier to implement and usually simplifies the code. Using VTune and similar analysis tools is a good practice. –Leveraging new CPU instructions should be the compiler’s responsibility. Don’t really need developer to do this job. Code gets clattered.

Sources WavPack official website – Intel® VTune™ Performance Analyzer Sourceforge website – Software lab website – MSDN – Wikipedia – Intel website –