Download presentation
Presentation is loading. Please wait.
1
1
2
Project goals Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems Results – Speedup overview Conclusions and recommendations Our benefits Next Steps 2
3
The objectives are to introduce our project, to describe the steps of speeding up an audio encoder called “Musepack” over Intel’s state of the art platform. The presentation will also detail the results and benefits derived from this work, direct (process speed) and indirect (enhancing our know-how and expertise in this area) 3
4
◦ Speeding up and optimizing a Musepack encoder while maintaining a bitwise output compatibility: ◦ Examining the encoder’s structure and methods. Analyzing encoder functions time distribution using Intel’s Vtune program. Apply multithreading, SIMD instructions and other techniques in order to achieve speedup using Vtune. ◦ Returning the code back to open source community. 4
5
Project Platform: Intel Core 2 Duo,2.4Ghz,64 Bit, 2 GB of RAM. Windows XP OS. Speedup measurement: 5
6
What is Musepack? ◦ Musepack is an open source audio codec. ◦ It is a lossy encoder. ◦ Musepack has performed well in various listening tests at both lower and higher bitrates. 6
7
Specifically optimized for transparent compression of stereo audio at bitrates of 160–180 Kbit/s. Features: ◦ Huffman coding. ◦ Noise substitution techniques. ◦ Psychoacoustic model which is based on MPEG ISO model 2. 7
8
Thread Level Parallelism technique is used to reduce program execution time by executing multiple code sections on both cores simultaneously. Amdahl’s law – if P is the proportion of parallel program, then the maximum speedup that can be achieved by using 2 processors is: Therefore, P should be maximized. Intel’s Vtune was used to target appropriate time consuming functions for multithreading. 8
9
Functions’ total timer events: Psychoakustic_Modell’s time consumption is high, therefore, should be a target for multithreading. 9
10
Function contains two separate models with same instructions and different data. Each model should be executed in a different thread. 10
11
Problem: Very high dependency between models through local and global variables: Second model uses first one’s output. 11
12
Observation: Psychoakustic function contains left and right channel handling functions. ◦ These functions can be divided into two types: ◦ Single channel functions, for example: FunctionL(Left Param1,Left Param2,.., local param1,Local param2). ◦ Dual channel functions, for example: FunctionLR(Left Param1,Right Param1,…) ◦ Single channel functions does not access opposite channel’s local variables. Timer events distribution: Single – 84% Dual - 16% 12
13
Strategy: ◦ One single channel function in each thread: Left Right Left Time Two Single channel functions Dual channel function Thread B Thread A 13
14
Implementation: ◦ Left channel local variables uses thread A while right ones uses thread B. Shared variables, used by both threads, are being duplicated – one copy for each thread. ◦ Technical problem: Program contains a large amount of global variables. ◦ These are being accessed by both left and right single channel functions and supposed to be accessed from both threads simultaneously. A, About, ANSspec,_L ANSspec_,M ANSspec_,R ANSspec,_S, APE_Version, array, b, Bandwidth, Buffer, BufferBytes, BufferedBits, bump_exp, bump_start, Butfly, __C,c, Ci_,opt,CombPenalities, Cos,_Tab,CosWin, CP_10000,CP_10079, CP_1250, CP_1251 CP_1252 CP_1253 CP_1254 CP_1255 CP_1256 CP_1257 CP_1258 CP_37, CP_42, CP_437, CP_500, CVD_used, __D, d,data,_finished DelInput DisplayUpdateTime 14
15
Solution - “Divide and Conquer approach”: Map all globals - Using globals marking script. Duplicate globals with which are being accessed by functions in the deepest level of function call. After these functions are handled, proceed to a higher level. Process ends when the duplication of global variables, which are being accessed from within the Upper level (Psychoakustic self code), is done. float g_var1 (global/static var) … Function A() { g_var1 = value; } Aligned 64 duplicated struct{float g_var1;} … Function A(thread num) { struct. g_var1 = value; } Psychoakustic() Deepest level Upper level 15 aligned 64 structs (to avoid shared cache lines).
16
After Psychoakustic multithreading, two more functions have been multithreaded, using the same mechanism. Total threading speedup: 1.43X ◦ Parallel part: 73.2%. ◦ Assuming serial part does not change, new exec time of multithreaded part is 57% from it’s original time. ◦ Threading overhead: ◦ Total program IC increased by 2.6%. Total timer event count increased by 0.62%. Intel Thread Checker found no errors. (Thread Profiler) 16
17
Original encoder settings uses “Precise F.P. model” instead of “Fast mode F.P. model”. Precise mode increases calculation time. F.P. model was changed to “fast” (after consulting our instructors). In the original program, sqrt instructions with single F.P. arguments was performed in double precision. These instructions were changed to single precision. Speedup gained so far: 1.77X Output file has a bitwise compatibility only with original “Fast F.P. mode” file: ◦ Around of value difference from “Precise mode” output is due to rounding. ◦ Such minor differences can not be noticed by human ear. 17
18
SIMD is a technique employed to achieve data level parallelism, SIMD instructions enable the execution of 4 F.P. instructions at a time. Function self time distribution: Sqrt function is the main target for SIMD Instructions usage. 18
19
SIMD instructions were used in the four functions that call Sqrt instruction. These functions were transformed into SIMD oriented functions – sqrt as well as other mathematical operations were performed by SIMD instructions. In one of the functions, due to altering loop iteration number, Sqrt array was calculated in advance using SIMD instructions. No calls to original Sqrt remained after applying SIMD. SIMD Gained Speedup: 23% (With multithreading). 19
20
Using VTune’s Tuning Assistance, several micro architecture problems were discovered: ◦ RAT_STALLS.FLAGS – Indicates Partial flag stalls. About Events, each one causes ~10 cycles stalls ~4 sec. Possible solution: command substitution such as INC to ADD. Events occur in ‘fread’ function, therefore can not be modified. ◦ LOAD_BLOCK.OVERLAP_STORE – load instructions are blocked, Cause can be 4K (Page size) aliasing or load-store block overlap. Possible solution: increase 4K sized arrays by block size and use 64 Byte alignment. Solution was applied – Results are Unnoticeable. 20
21
2.03 21
22
Multithreading ◦ Can produce a significant program acceleration. ◦ Global variables can be an obstacle in the process of multithreading. SIMD instructions ◦ Enhance speedup. ◦ Can be implemented only on specific code parts. ◦ Sometimes, implementation should be “creative”. Micro architecture ◦ In this Program no major problems were found. ◦ Vtune tuning assistance is a powerful tool for micro architecture problems tracking. 22
23
Making adjustments for quad core processor by creating 4 threads. Designing a multithreading assistance program that will trace and handle global variables using suggested algorithm. 23
24
Improving our expertise for identifying the dominant factors in a process and handling it. Enhancing our knowledge regarding multithreading technique. Learning how to use SIMD instructions. Being exposed to a few micro architecture problems. 24
25
25
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.