Download presentation
Presentation is loading. Please wait.
1
Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb
2
FLAC – Free Lossless Audio Codec. FLAC specially designed for efficient packing of audio data. Can achieve compression ratios of 30% - 50% for most music. Flake – FLAC encoder
3
Platform and Benchmark Used Platform: Intel 64 bit Pentium Core 2 duo 2.4GHZ, 2GB of RAM and with a Windows XP operating System. Benchmark: - 238MB song. - Original Encoding Duration: 105.156 Sec
4
Algorithm description The input file is read frame by frame. Every frame contains a constant number of channels. Each channel is encoded independently with special Huffman codes called RICE.
5
Flake – Data flow Encoding every Frame Encoding the error for every channel Using LPC Algorithm
6
Flake – Optimization method Dealing with the most time consuming functions. Two approaches were taken: –Multi-threading. –SIMD.
7
Optimization Method 1: Threads Flake was managed by a single thread. Parallelization creates simultaneous work. While paralleling Flake we considered: - The algorithm. - The data flow.
8
Encoding Process In Flake MultiThread Here! MultiThread Here! MultiThread Here!
9
Conclusions: Possible Ways to Parallel Flake 1.Parallel the reads and writes from the file. 2.Parallel the encoding phase for each frame separately. 3.Parallel the encoding phase for each channel separately. Combination of the above.
10
Our Resolution We chose to parallel the channel encoding. Our reasons for doing so: Limit of channels and limit of threads. Limited access to a shared device (the disk) for I/O. Multiple reads of the file needed for frame encoding. Higher synchronization rate needed for frame encoding.
11
Implementing The Solution, First Try Create as many threads as channels Every thread encodes and terminates. This solution achieved a speedup of x1.68. Overhead from opening and closing threads.
12
Vtune Thread Profiler, First Try
13
Implementing The Solution, Second Try Create as many threads as channels. Every thread encodes and waits for a signal. Save thread handlers to recall the same threads. Saving time by not closing the threads! Gaining a bigger speedup!
14
Vtune Thread Profiler, Second Try Note: in our benchmark there are only 2 channels.
15
SpeedUp Gained Through MultiThreading Total speedup from using MT: x1.85!
16
Optimization Method 2: SIMD Mainly used SSE and SSE2 instructions. Operations with Double FP and Integers. Two main functions we used SSE on: –calc_rice_params(). –compute_autocorr().
17
calc_rice_params () - Improvements Logic operations with Integers. The original loop was unrolled by 4. The input and output arrays were aligned to prevent ‘Split loads’.
18
calc_rice_params () – The code Old code for (i=0; i<n; i++) { udata[i] = (2*data[i]) ^ (data[i]>>31); } New code for (i=0;i<n;i+=4) { temp1 = _mm_load_si128((data+i)); temp2 = _mm_slli_epi32(temp1, 1); temp3 = _mm_srai_epi32(temp1, 31); temp1 = _mm_xor_si128(temp2, temp3); _mm_store_si128((udata+i),temp1); } Shift right by 31 bits Bitwise XOR
19
SIMD - compute_autocorr() Contains another Inline function named apply_welch_window() - the first to do calculations. Speedup will be calculated for both functions together.
20
Old code vs. new code: for (i=0; i > 1); i++) { w = 1.0 - ((c-i) * (c-i)); w_data[i] = data[i] * w; w_data[len-1-i] = data[len-1-i] * w; } Conversion to FP and Multiplicationapply_welch_window() iup_align = _mm_load_si128 (data+i); fpup = _mm_cvtepi32_pd (iup_align); fpup = _mm_mul_pd (fpup, w_d_low); _mm_store_pd (w_data+i, fpup); iup_align = _mm_shuffle_epi32 (iup_align, _MM_SHUFFLE(1,0,3,2)); fpup = _mm_cvtepi32_pd (iup_align); fpup = _mm_mul_pd (fpup, w_d_high); _mm_store_pd (w_data+i+2, fpup); Loading 4 Integers at once – Cutting 50% of the load operations
21
compute_autocorr() Uses the output array from apply_welch_window(). Loop unrolling steps 1.Every ‘Inner Loop’ unrolled by 2. 2.‘Main Loop’ unrolled by 2 - every Inner Loop unrolled by 4.
22
compute_autocorr() – The code for (i=0; i<=lag; ++i) { temp = 1.0; temp2 = 1.0; for (j=0; j<=lag-i; ++j) temp += data1[j+i] * data1[j]; for (j=lag+1; j<=len-1; j+=2) { temp += data1[j] * data1[j-i]; temp2 += data1[j+1] * data1[j+1-i]; } autoc[i] = temp + temp2; } Short ‘Inner loop’ ‘Main loop’ If (lag%2==0) { a_high = a_low = _mm_loadu_pd(data1+j); b_low = _mm_loadu_pd(data1+j-i); b_high = _mm_load_pd(data1+j-i-1); } else { a_high = a_low = _mm_load_pd(data1+j); b_low = _mm_load_pd(data1+j-i); b_high = _mm_loadu_pd(data1+j-i-1); } a_low = _mm_mul_pd(a_low, b_low); c_low = _mm_add_pd(a_low, c_low); a_high = _mm_mul_pd(a_high, b_high); c_high = _mm_add_pd(a_high, c_high); } Long ‘Inner loop’ (unrolled in the original code) Using as many aligned loads (and stores) as we can Multiplying and adding the result
23
compute_autocorr() – Speedup Speedups using SIMD summary: calc_rice_params () local speedup: x1.14. Overall speedup: x1.04. compute_autocorr() local speedup: x1.92! Overall speedup: x1.03. Total speedup using SIMD: x1.07.
24
Intel Tuning Assistant When using aligned arrays, split loads didn't occur. No Micro-Architectural problems found in the optimized code.
25
Final Results A total speedup of x1.985 was achieved by using only MT and SIMD.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.