Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion.

Slides:



Advertisements
Similar presentations
Branch prediction Titov Alexander MDSP November, 2009.
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Multi-Threading LAME MP3 Encoder
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Machine/Assembler Language Putting It All Together Noah Mendelsohn Tufts University Web:
Dan Iannuzzi Kevin Pine CS 680. Outline The Problem Recap of CS676 project Goal of this GPU Research Approach Parallelization attempts Results Difficulties.
Optimizing Ogg Vorbis performance using architectural considerations Adir Abraham and Tal Abir.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
Informationsteknologi Friday, November 16, 2007Computer Architecture I - Class 121 Today’s class Operating System Machine Level.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Register Allocation (via graph coloring)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Dr. Muhammed Al-Mulhem 1ICS ICS 535 Design and Implementation of Programming Languages Part 1 OpenMP -Example ICS 535 Design and Implementation.
Memory Management Chapter 5.
Copyright © 1998 Wanda Kunkle Computer Organization 1 Chapter 2.1 Introduction.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Implementing a FIR-filter algorithm using MMX instructions by Lars Persson.
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
Tallinn University of Technology Quantum computer impact on public key cryptography Roman Stepanenko.
DATA STRUCTURES OPTIMISATION FOR MANY-CORE SYSTEMS Matthew Freeman | Supervisor: Maciej Golebiewski CSIRO Vacation Scholar Program
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
A COMPARISON MPI vs POSIX Threads. Overview MPI allows you to run multiple processes on 1 host  How would running MPI on 1 host compare with POSIX thread.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 3 Understanding the memory pipeline issues.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
The ISA Level The Instruction Set Architecture (ISA) is positioned between the microarchtecture level and the operating system level.  Historically, this.
CS 320 Assignment 1 Rewriting the MISC Osystem class to support loading machine language programs at addresses other than 0 1.
Averaging Filter Comparing performance of C++ and ‘our’ ASM Example of program development on SHARC using C++ and assembly Planned for Tuesday 7 rd October.
11 Working with Images Session Session Overview  Find out more about image manipulation and scaling when drawing using XNA  Start to implement.
1 HASHING Course teacher: Moona Kanwal. 2 Hashing Mathematical concept –To define any number as set of numbers in given interval –To cut down part of.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Moving Arrays -- 1 Completion of ideas needed for a general and complete program Final concepts needed for Final Review for Final – Loop efficiency.
Adaptive Multi-Threading for Dynamic Workloads in Embedded Multiprocessors 林鼎原 Department of Electrical Engineering National Cheng Kung University Tainan,
Blackfin Array Handling Part 1 Making an array of Zeros void MakeZeroASM(int foo[ ], int N);
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
U i Modleing SGOMS in ACT-R: Linking Macro- and Microcognition Lee Yong Ho.
Internet Protocol Version 4 VersionHeader Length Type of Service Total Length IdentificationFragment Offset Time to LiveProtocolHeader Checksum Source.
Circuit Placement w/ Multi-core Processors May Mike Drob Grant Furgiuele Ben Winters Advisor: Dr. Chris Chu Client: IBM Design Presentation.
27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
Optimizing Parallel Programming with MPI Michael Chen TJHSST Computer Systems Lab Abstract: With more and more computationally- intense problems.
Senior Project Poster Day 2006, CIS Dept. University of Pennsylvania One if by land… Yosef Weiner, David Charles Pollack Faculty Advisor: C.J. Taylor,
Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Lab Exercises: Lab 1 (Performance measurement)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Tuning Threaded Code with Intel® Parallel Amplifier.
First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B.
MDC-700 Series Modbus Data Concentrator [2016,05,26]
Course Contents KIIT UNIVERSITY Sr # Major and Detailed Coverage Area
Memory Management.
Credits and Disclaimers
A Closer Look at Instruction Set Architectures
Homework Reading Machine Projects
CSC 4250 Computer Architectures
Distributed Dynamic BDD Reordering
Chapter 5 Conclusion CIS 61.
Genomic Data Clustering on FPGAs for Compression
Lecture 5: GPU Compute Architecture
Compiler Construction
Lecture 5: GPU Compute Architecture for the last time
Other time considerations
Multithreading Why & How.
Lecture 17: Register Allocation via Graph Colouring
LLVM Greedy Register Allocator – Improving Region Split Decisions
Credits and Disclaimers
Presentation transcript:

Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion

What is VirtualDub? VirtualDub is an incredibly popular open source video processing tool. It is capable of merging videos, cutting scenes, adding subtitles and applying a wide variety of filters. It also supports third party video compression (i.e. DivX) VirtualDub is constantly being refined, expanded and adapted by it’s original creator, Avery Lee.

VirtualDub’s Benchmark The benchmark chosen was to use the Resize filter, an often-used cpu heavy filter. Choosing a vibrant color animation video, so every flaw, if any, will be visible. The result video was made w/o audio filtering, and with no third party compression utilities.

VTune Performance Analyzer Analyzing the benchmark using VTune: First step - VDFastMemcpyPartialMMX2

Fast Memory Copy This functions handles copying large quantities of data from a memory source address, into a memory destination movq mm0, [edx] movq mm1, [edx+8] movq mm2, [edx+16] movq mm3, [edx+24] movq mm4, [edx+32] movq mm5, [edx+40] movq mm6, [edx+48] movq mm7, [edx+56] movntq [ebx], mm0 movntq [ebx+8], mm1 movntq [ebx+16], mm2 movntq [ebx+24], mm3 movntq [ebx+32], mm4 movntq [ebx+40], mm5 movntq [ebx+48], mm6 movntq [ebx+56], mm7 Each cycle copies the data into the registers, and then into the specified address. Moving to the next 64 bytes, the loop continues, till all the data has been copied. From observations, the function was called to read 2048 bytes every time.

Clockticks Samples Again using VTune it was seen that predictably, the most clockticks were when reading from the memory.

Dummy Loop Seeing that, the solution was to fill the cache before beginning to copy the data I’ve added a dummy loop, reading 1024 bytes ahead, before running blastloop. When the cache empties – if we did not reach the end of the source data, another 1024 bytes would be read. Using the Dummy loop, a speedup of 4.21% was mov edi, [edx+896] mov edi, [edx+768] mov edi, [edx+640] mov edi, [edx+512] mov edi, [edx+384] mov edi, [edx+256] mov edi, [edx+128] mov edi, [edx] mov esi, movq mm0, [edx] movq mm1, [edx+8] movq mm2, [edx+16] movq mm3, [edx+24] movq mm4, [edx+32] movq mm5, [edx+40] movq mm6, [edx+48] movq mm7, [edx+56] movntq [ebx], mm0 movntq [ebx+8], mm1 movntq [ebx+16], mm2 movntq [ebx+24], mm3 movntq [ebx+32], mm4 movntq [ebx+40], mm5 movntq [ebx+48], mm6 movntq [ebx+56], mm7

Threads As stated before, the original VirtualDub is a project in development. The original creator had access to code optimizing programs – VTune included – allowing him to improve the code himself, removing many pitfalls and errors common to non-optimized code. Also, VirtualDub proved to be multithreaded, to a point:

Threads The 1 st thread is the processing thread - however, the 2 nd thread is the audio thread – since we specifically disabled the audio, It did not contain almost any activity: Therefore – theoretically, Multithreading the Process thread was still possible

Threads At first I had high hopes for multithreading VirtualDub – studying the code I came to the conclusion that it processed the video frame by frame, and in each frame it scanned line by line. Two approachs I decided to try were: –Processing two frames in parallel –Cutting a frame in half, and processing the top and bottom in parallel.

Threads At first I had high hopes for multithreading VirtualDub – studying the code I came to the conclusion that it processed the video frame by frame, and in each frame it scanned line by line. Two approachs I decided to try were: –Processing two frames in parallel –Cutting a frame in half, and processing the top and bottom in parallel.

Threads However, All my attempts at hyper threading VirtualDub’s processing failed. At first believing that I’ve encountered global variables being addressed, I’ve discovered them to be private variables to a much higher level class. Attempts to duplicate said class in order to split the workload failed.

Threads Lastly, I’ve turned to OpenMP, hoping to use it’s innate capabilities to duplicate the variables into each thread. VirtualDub’s complexity made it impossible for me to covert it to Intel Compiler – every change resulted in a staggering amount of errors, each requiring many small code changes, and still more that couldn’t be solved. Limiting the use of Intel compiler into the only necessary projects did not show an improvement.

Conclusion A lot of time and effort were put into this project. To my dismay, it is not evident in percent of speedup, but rather as error messages and various versions of code, each a bit closer to a working version, but never quite there. The bottom line, is that despite the promise initially shown by VirtualDub, ultimately too much had already been originally done in it – leaving it optimized, monstrously big and intricate for my optimization.