Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006.

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006

Chosen application Refocus-it Iterative refocus plug-in for Gimp Home Page: http://refocus-it.sourceforge.net/ Download Page: http://sourceforge.net/projects/refocus-it http://sourceforge.net/projects/refocus-it GIMP Page: http://www.gimp.org/

Refocus-it Iterative refocus GIMP plug-in can be used to refocus images acquired by a defocused camera, blurred by gaussian or motion blur or any combination of these. Adaptive or static area smoothing can be used to remove the so called "ringing" effect. Example:

Algorithm Description Algorithm runs i iterations (supplied in the command line).Algorithm runs i iterations (supplied in the command line). Every iteration, on each color map hopfield_iteration…() function invoked and do the refocusing.Every iteration, on each color map hopfield_iteration…() function invoked and do the refocusing. Inside hopfield_iteration…() function weight of each pixel of the picture recalculated depending on previous value and values of scanning area.Inside hopfield_iteration…() function weight of each pixel of the picture recalculated depending on previous value and values of scanning area.

hopfield_iteration…() How the function works:

hopfield_iteration…() There are 4 different functions chosen by the parameters in the command line: hopfield_iteration_mirror_lambda()hopfield_iteration_mirror_lambda() hopfield_iteration_mirror()hopfield_iteration_mirror() hopfield_iteration_period_lambda()hopfield_iteration_period_lambda() hopfield_iteration_period()hopfield_iteration_period()

Threading approaches Split by colorSplit by color

Threading approaches Using Open MPUsing Open MP Divide the PictureDivide the Pictureor

Threading approaches Divide pixelsDivide pixels Thread 0 Thread 1

Threading approaches Divide Columns (final)Divide Columns (final) Main Thread Helper Thread

Synchronization (time) Barriers (to provide algorithm consistency)Barriers (to provide algorithm consistency) Threading solution must take into account a data dependencyThreading solution must take into account a data dependency Main Thread Helper Thread Main Thread BarriersHelper Thread Barriers BARRIER !!!

Synchronization (space) Mutexed Areas (to prevent write/read conflicts)Mutexed Areas (to prevent write/read conflicts) Intersection of the scanning areas causes W/R conflictsIntersection of the scanning areas causes W/R conflicts Main Thread Helper Thread BARRIER !!! Main MutexPeriod Mutexes

Randomizer Thread Using rand() in threaded code causes the difference in optimized and original code because of the same random series generated by the threads.Using rand() in threaded code causes the difference in optimized and original code because of the same random series generated by the threads. Solution:Solution: Randomizer Thread Main ThreadHelper Thread Random Buffers

Threads’ Loads Let’s take a look at the Intel® VTune™ Performance Analyzer plot

Threading – holes’ covering Consider an DP or MP system where each core is hyperthreaded.Consider an DP or MP system where each core is hyperthreaded. Problem: the OS can put both of the application’s cores on the same physical core.Problem: the OS can put both of the application’s cores on the same physical core. Solution: take care of the processor affinities.Solution: take care of the processor affinities.

General Code Optimization Get rid of heavy macro image_get_mirror No calculation needed Only Y parameter should be recalculated Only X parameter should be recalculated Only Y parameter should be recalculated Only X parameter should be recalculated Original calculation needed

SIMD Approach The heaviest line in program is classical for SIMD: loop {sum += weights[p, r] * image[i+p, j+r] } Why it didn’t work then? The most inner loop is short.The most inner loop is short. Most of the time weights[curr_ptr] and image[curr_ptr] are unaligned.Most of the time weights[curr_ptr] and image[curr_ptr] are unaligned. Overhead on adding the “SIMD sum” to the “non-SIMD sum”.Overhead on adding the “SIMD sum” to the “non-SIMD sum”.

Results HT machine (P4 3.0GHz) Threading Only

Results HT machine (P4 3.0GHz) Code Optimization Only

Results HT machine (P4 3.0GHz) Full Optimization

The results on Dual Core machine (Pentium D)

Compilation by Intel Compiler

Compilation by Intel Compiler (cont) Intel compiler gives up to 26.1% performanceIntel compiler gives up to 26.1% performance boost. boost. The 64-bit compilation gave similar results as the 32-bit compilation by the Intel Compiler.The 64-bit compilation gave similar results as the 32-bit compilation by the Intel Compiler.

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006.

Similar presentations

Presentation on theme: "Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006.

Similar presentations

Presentation on theme: "Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006."— Presentation transcript:

Similar presentations

About project

Feedback