Speeding up VirtualDub Presented by: Shmuel Habari Advisor: Zvika Guz Software Systems Lab Technion
What is VirtualDub? VirtualDub is an incredibly popular open source video processing tool. It is capable of merging videos, cutting scenes, adding subtitles and applying a wide variety of filters. It also supports third party video compression (i.e. DivX) VirtualDub is constantly being refined, expanded and adapted by it’s original creator, Avery Lee.
VirtualDub’s Benchmark The benchmark chosen was to use the Resize filter, an often-used cpu heavy filter. Choosing a vibrant color animation video, so every flaw, if any, will be visible. The result video was made w/o audio filtering, and with no third party compression utilities.
VTune Performance Analyzer Analyzing the benchmark using VTune: First step - VDFastMemcpyPartialMMX2
Fast Memory Copy This functions handles copying large quantities of data from a memory source address, into a memory destination movq mm0, [edx] movq mm1, [edx+8] movq mm2, [edx+16] movq mm3, [edx+24] movq mm4, [edx+32] movq mm5, [edx+40] movq mm6, [edx+48] movq mm7, [edx+56] movntq [ebx], mm0 movntq [ebx+8], mm1 movntq [ebx+16], mm2 movntq [ebx+24], mm3 movntq [ebx+32], mm4 movntq [ebx+40], mm5 movntq [ebx+48], mm6 movntq [ebx+56], mm7 Each cycle copies the data into the registers, and then into the specified address. Moving to the next 64 bytes, the loop continues, till all the data has been copied. From observations, the function was called to read 2048 bytes every time.
Clockticks Samples Again using VTune it was seen that predictably, the most clockticks were when reading from the memory.
Dummy Loop Seeing that, the solution was to fill the cache before beginning to copy the data I’ve added a dummy loop, reading 1024 bytes ahead, before running blastloop. When the cache empties – if we did not reach the end of the source data, another 1024 bytes would be read. Using the Dummy loop, a speedup of 4.21% was mov edi, [edx+896] mov edi, [edx+768] mov edi, [edx+640] mov edi, [edx+512] mov edi, [edx+384] mov edi, [edx+256] mov edi, [edx+128] mov edi, [edx] mov esi, movq mm0, [edx] movq mm1, [edx+8] movq mm2, [edx+16] movq mm3, [edx+24] movq mm4, [edx+32] movq mm5, [edx+40] movq mm6, [edx+48] movq mm7, [edx+56] movntq [ebx], mm0 movntq [ebx+8], mm1 movntq [ebx+16], mm2 movntq [ebx+24], mm3 movntq [ebx+32], mm4 movntq [ebx+40], mm5 movntq [ebx+48], mm6 movntq [ebx+56], mm7
Threads As stated before, the original VirtualDub is a project in development. The original creator had access to code optimizing programs – VTune included – allowing him to improve the code himself, removing many pitfalls and errors common to non-optimized code. Also, VirtualDub proved to be multithreaded, to a point:
Threads The 1 st thread is the processing thread - however, the 2 nd thread is the audio thread – since we specifically disabled the audio, It did not contain almost any activity: Therefore – theoretically, Multithreading the Process thread was still possible
Threads At first I had high hopes for multithreading VirtualDub – studying the code I came to the conclusion that it processed the video frame by frame, and in each frame it scanned line by line. Two approachs I decided to try were: –Processing two frames in parallel –Cutting a frame in half, and processing the top and bottom in parallel.
Threads At first I had high hopes for multithreading VirtualDub – studying the code I came to the conclusion that it processed the video frame by frame, and in each frame it scanned line by line. Two approachs I decided to try were: –Processing two frames in parallel –Cutting a frame in half, and processing the top and bottom in parallel.
Threads However, All my attempts at hyper threading VirtualDub’s processing failed. At first believing that I’ve encountered global variables being addressed, I’ve discovered them to be private variables to a much higher level class. Attempts to duplicate said class in order to split the workload failed.
Threads Lastly, I’ve turned to OpenMP, hoping to use it’s innate capabilities to duplicate the variables into each thread. VirtualDub’s complexity made it impossible for me to covert it to Intel Compiler – every change resulted in a staggering amount of errors, each requiring many small code changes, and still more that couldn’t be solved. Limiting the use of Intel compiler into the only necessary projects did not show an improvement.
Conclusion A lot of time and effort were put into this project. To my dismay, it is not evident in percent of speedup, but rather as error messages and various versions of code, each a bit closer to a working version, but never quite there. The bottom line, is that despite the promise initially shown by VirtualDub, ultimately too much had already been originally done in it – leaving it optimized, monstrously big and intricate for my optimization.