Presentation is loading. Please wait.

Presentation is loading. Please wait.

Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.

Similar presentations


Presentation on theme: "Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel."— Presentation transcript:

1 Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel

2 Project Goal Gain knowledge on software optimization. Learn and implement different optimization techniques. Get acquainted with different performance analysis tools.

3 Optimization Approaches Multithreading (main part). Multithreading (main part). Implementation considerations. Implementation considerations. Architectonic considerations. Architectonic considerations.

4 Chosen program Called EOCF. Implements the “Burrows – Wheeler lossless compression algorithm”, by M. Burrows and D.J. Wheeler. Can compress and decompress files. We chose to work on the compression part..

5 Algorithm Description Compression: Source file is read in blocks of bytes. Burrows – Wheeler transform followed by Move- To-Front transform applied on each block. Each processed block is written to a temp file. After all the blocks have been written, Huffman compression is applied on the temp file..

6 Algorithm Description Decompression: Performing the compression algorithm in reverse order.

7 EOCF – Program Structure Block processing section (for each block in the file) Read Block Write Block To temporary file BW transformation MTF transformation Block Processing section Huffman compression Input File Output File Temp File

8 Code Analysis The following two functions are performed about 2/3 of the runtime:

9 Code Analysis Call graph:

10 Code Analysis The conclusion: The code spends most of the runtime in performing the transformations.

11 Multi Threading Based on the results, the block process section was multi-threaded. Huffman compression section was not multi-threaded. Data decomposition approach was used.

12 Data Decomposition Data decomposition approach in general: Func1 Input Output Func2FuncN Func1 Input/N Output Func2FuncN Input/N Func1Func2FuncN Func1Func2FuncN Thread 1 Thread K Thread 2

13 Data Decomposition Data decomposition approach on EOCF: Huffman InputOutputTemp File BW MTF Thread 1 BW MTF Input Buffers Input File Output Buffers Huffman Output File Temp File Thread 2 BWMTF Thread n BWMTF

14 Thread Design Read a block from the input buffer. Perform the transformations. Write to the output buffer. Fill input buffer our empty output buffer if needed.

15 Thread Design yes no Fill buffer from input file Current block is the last block? no finish yes Read next input block Perform transformations Current write Buffer is full? yes Write buffer to temp file no Write block to output buffer Current Read buffer is empty? finish

16 Implementation WIN32 API was used rather than openMP API. Yields better performance, according to research based on previous projects and internet articles.

17 Synchronization A Critical Section objects were used. Provides a slightly faster, more efficient mechanism for mutual exclusion than Mutex object.

18 Thread Performance Threads share the load almost equally and about 2/3 of the time we spend in the parallel section, as expected.

19 Thread Checker Thread checker found no errors. *The warning is due to the fact that we have a thread that waits for the worker-threads to finish (main).

20 Number of Threads Best performance when number of threads equals number of cores. On Dual Core:

21 Input Buffers Implementing the double buffering technique. When a buffer is being filled, other threads continue to read from the second buffer.

22 Output Buffers To comply with the decompression algorithm, sequential output had to be achieved. Based on empiric observation, we hold enough buffers, so that each thread can write at least four blocks. The minimum number of buffers is two.

23 Buffer Size Based on our observations when using Dual Core processor, the optimal buffer size is 16KB:

24 Data Sharing and Alignment To eliminate False Sharing the following steps were taken: Moving as much as possible of shared data to each threads private data. Aligning to a cache line size shared arrays of data, when each individual element is being accessed by a different thread.

25 Data Sharing and Alignment Runtime without cache alignment: 198.2 sec 198.2 sec Runtime with cache alignment: 197.4 sec 197.4 sec Overall improvement of 0.5%

26 SIMD Usage of SIMD was not implemented.

27 Optimization Achieved Using a Dual Core processor, the ideal speed-up would be X2. Since we have multi-threaded only about 2/3 of the code, we could expect speed-up of:

28 Optimization Achieved We have achieved speed up of: X1.47 Unavoidably, we loose time on managing and synchronizing threads.

29 Comparison to Other Intel Architectures We ran our program on 2 other computers: Intel® Core™2 Quad Intel® Core™2 Quad Intel® Core™ i7 Intel® Core™ i7 X1.47 X1.9 X2.17 X1.96

30 Thank You


Download ppt "Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel."

Similar presentations


Ads by Google