Presentation is loading. Please wait.

Presentation is loading. Please wait.

1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems 

Similar presentations


Presentation on theme: "1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems "— Presentation transcript:

1 1

2  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems  Results – Speedup overview  Conclusions and recommendations  Our benefits  Next Steps 2

3  The objectives are to introduce our project, to describe the steps of speeding up an audio encoder called “Musepack” over Intel’s state of the art platform.  The presentation will also detail the results and benefits derived from this work, direct (process speed) and indirect (enhancing our know-how and expertise in this area) 3

4 ◦ Speeding up and optimizing a Musepack encoder while maintaining a bitwise output compatibility: ◦ Examining the encoder’s structure and methods.  Analyzing encoder functions time distribution using Intel’s Vtune program.  Apply multithreading, SIMD instructions and other techniques in order to achieve speedup using Vtune. ◦ Returning the code back to open source community. 4

5  Project Platform:  Intel Core 2 Duo,2.4Ghz,64 Bit, 2 GB of RAM.  Windows XP OS.  Speedup measurement: 5

6  What is Musepack? ◦ Musepack is an open source audio codec. ◦ It is a lossy encoder. ◦ Musepack has performed well in various listening tests at both lower and higher bitrates. 6

7  Specifically optimized for transparent compression of stereo audio at bitrates of 160–180 Kbit/s.  Features: ◦ Huffman coding. ◦ Noise substitution techniques. ◦ Psychoacoustic model which is based on MPEG ISO model 2. 7

8  Thread Level Parallelism technique is used to reduce program execution time by executing multiple code sections on both cores simultaneously.  Amdahl’s law – if P is the proportion of parallel program, then the maximum speedup that can be achieved by using 2 processors is: Therefore, P should be maximized.  Intel’s Vtune was used to target appropriate time consuming functions for multithreading. 8

9  Functions’ total timer events:  Psychoakustic_Modell’s time consumption is high, therefore, should be a target for multithreading. 9

10  Function contains two separate models with same instructions and different data. Each model should be executed in a different thread. 10

11  Problem: Very high dependency between models through local and global variables: Second model uses first one’s output. 11

12  Observation: Psychoakustic function contains left and right channel handling functions. ◦ These functions can be divided into two types: ◦ Single channel functions, for example: FunctionL(Left Param1,Left Param2,.., local param1,Local param2). ◦ Dual channel functions, for example: FunctionLR(Left Param1,Right Param1,…) ◦ Single channel functions does not access opposite channel’s local variables.  Timer events distribution:  Single – 84%  Dual - 16% 12

13  Strategy: ◦ One single channel function in each thread: Left Right Left Time Two Single channel functions Dual channel function Thread B Thread A 13

14  Implementation: ◦ Left channel local variables uses thread A while right ones uses thread B. Shared variables, used by both threads, are being duplicated – one copy for each thread. ◦ Technical problem: Program contains a large amount of global variables. ◦ These are being accessed by both left and right single channel functions and supposed to be accessed from both threads simultaneously. A, About, ANSspec,_L ANSspec_,M ANSspec_,R ANSspec,_S, APE_Version, array, b, Bandwidth, Buffer, BufferBytes, BufferedBits, bump_exp, bump_start, Butfly, __C,c, Ci_,opt,CombPenalities, Cos,_Tab,CosWin, CP_10000,CP_10079, CP_1250, CP_1251 CP_1252 CP_1253 CP_1254 CP_1255 CP_1256 CP_1257 CP_1258 CP_37, CP_42, CP_437, CP_500, CVD_used, __D, d,data,_finished DelInput DisplayUpdateTime 14

15  Solution - “Divide and Conquer approach”:  Map all globals - Using globals marking script.  Duplicate globals with which are being accessed by functions in the deepest level of function call.  After these functions are handled, proceed to a higher level.  Process ends when the duplication of global variables, which are being accessed from within the Upper level (Psychoakustic self code), is done. float g_var1 (global/static var) … Function A() { g_var1 = value; } Aligned 64 duplicated struct{float g_var1;} … Function A(thread num) { struct. g_var1 = value; } Psychoakustic() Deepest level Upper level 15 aligned 64 structs (to avoid shared cache lines).

16  After Psychoakustic multithreading, two more functions have been multithreaded, using the same mechanism. Total threading speedup: 1.43X ◦ Parallel part: 73.2%. ◦ Assuming serial part does not change, new exec time of multithreaded part is 57% from it’s original time. ◦ Threading overhead: ◦ Total program IC increased by 2.6%.  Total timer event count increased by 0.62%.  Intel Thread Checker found no errors. (Thread Profiler) 16

17  Original encoder settings uses “Precise F.P. model” instead of “Fast mode F.P. model”.  Precise mode increases calculation time.  F.P. model was changed to “fast” (after consulting our instructors).  In the original program, sqrt instructions with single F.P. arguments was performed in double precision.  These instructions were changed to single precision.  Speedup gained so far: 1.77X  Output file has a bitwise compatibility only with original “Fast F.P. mode” file: ◦ Around of value difference from “Precise mode” output is due to rounding. ◦ Such minor differences can not be noticed by human ear. 17

18  SIMD is a technique employed to achieve data level parallelism, SIMD instructions enable the execution of 4 F.P. instructions at a time.  Function self time distribution: Sqrt function is the main target for SIMD Instructions usage. 18

19  SIMD instructions were used in the four functions that call Sqrt instruction.  These functions were transformed into SIMD oriented functions – sqrt as well as other mathematical operations were performed by SIMD instructions.  In one of the functions, due to altering loop iteration number, Sqrt array was calculated in advance using SIMD instructions.  No calls to original Sqrt remained after applying SIMD.  SIMD Gained Speedup: 23% (With multithreading). 19

20  Using VTune’s Tuning Assistance, several micro architecture problems were discovered: ◦ RAT_STALLS.FLAGS – Indicates Partial flag stalls.  About Events, each one causes ~10 cycles stalls ~4 sec.  Possible solution: command substitution such as INC to ADD.  Events occur in ‘fread’ function, therefore can not be modified. ◦ LOAD_BLOCK.OVERLAP_STORE – load instructions are blocked, Cause can be 4K (Page size) aliasing or load-store block overlap.  Possible solution: increase 4K sized arrays by block size and use 64 Byte alignment.  Solution was applied – Results are Unnoticeable. 20

21 2.03 21

22  Multithreading ◦ Can produce a significant program acceleration. ◦ Global variables can be an obstacle in the process of multithreading.  SIMD instructions ◦ Enhance speedup. ◦ Can be implemented only on specific code parts. ◦ Sometimes, implementation should be “creative”.  Micro architecture ◦ In this Program no major problems were found. ◦ Vtune tuning assistance is a powerful tool for micro architecture problems tracking. 22

23  Making adjustments for quad core processor by creating 4 threads.  Designing a multithreading assistance program that will trace and handle global variables using suggested algorithm. 23

24  Improving our expertise for identifying the dominant factors in a process and handling it.  Enhancing our knowledge regarding multithreading technique.  Learning how to use SIMD instructions.  Being exposed to a few micro architecture problems. 24

25 25


Download ppt "1.  Project goals  Project description ◦ What is Musepack? ◦ Using multithreading approach ◦ Applying SIMD ◦ Analyzing Micro-architecture problems "

Similar presentations


Ads by Google