Download presentation
Presentation is loading. Please wait.
1
Software performance enhancement using multithreading and architectural considerations Prepared by: Andrey Sloutsman Konstantin Muradov 01/2006
2
Chosen application Refocus-it Iterative refocus plug-in for Gimp Home Page: http://refocus-it.sourceforge.net/ Download Page: http://sourceforge.net/projects/refocus-it http://sourceforge.net/projects/refocus-it GIMP Page: http://www.gimp.org/
3
Refocus-it Iterative refocus GIMP plug-in can be used to refocus images acquired by a defocused camera, blurred by gaussian or motion blur or any combination of these. Adaptive or static area smoothing can be used to remove the so called "ringing" effect. Example:
4
Algorithm Description Algorithm runs i iterations (supplied in the command line).Algorithm runs i iterations (supplied in the command line). Every iteration, on each color map hopfield_iteration…() function invoked and do the refocusing.Every iteration, on each color map hopfield_iteration…() function invoked and do the refocusing. Inside hopfield_iteration…() function weight of each pixel of the picture recalculated depending on previous value and values of scanning area.Inside hopfield_iteration…() function weight of each pixel of the picture recalculated depending on previous value and values of scanning area.
5
hopfield_iteration…() How the function works:
6
hopfield_iteration…() There are 4 different functions chosen by the parameters in the command line: hopfield_iteration_mirror_lambda()hopfield_iteration_mirror_lambda() hopfield_iteration_mirror()hopfield_iteration_mirror() hopfield_iteration_period_lambda()hopfield_iteration_period_lambda() hopfield_iteration_period()hopfield_iteration_period()
7
Threading approaches Split by colorSplit by color
8
Threading approaches Using Open MPUsing Open MP Divide the PictureDivide the Pictureor
9
Threading approaches Divide pixelsDivide pixels Thread 0 Thread 1
10
Threading approaches Divide Columns (final)Divide Columns (final) Main Thread Helper Thread
11
Synchronization (time) Barriers (to provide algorithm consistency)Barriers (to provide algorithm consistency) Threading solution must take into account a data dependencyThreading solution must take into account a data dependency Main Thread Helper Thread Main Thread BarriersHelper Thread Barriers BARRIER !!!
12
Synchronization (space) Mutexed Areas (to prevent write/read conflicts)Mutexed Areas (to prevent write/read conflicts) Intersection of the scanning areas causes W/R conflictsIntersection of the scanning areas causes W/R conflicts Main Thread Helper Thread BARRIER !!! Main MutexPeriod Mutexes
13
Randomizer Thread Using rand() in threaded code causes the difference in optimized and original code because of the same random series generated by the threads.Using rand() in threaded code causes the difference in optimized and original code because of the same random series generated by the threads. Solution:Solution: Randomizer Thread Main ThreadHelper Thread Random Buffers
14
Threads’ Loads Let’s take a look at the Intel® VTune™ Performance Analyzer plot
15
Threading – holes’ covering Consider an DP or MP system where each core is hyperthreaded.Consider an DP or MP system where each core is hyperthreaded. Problem: the OS can put both of the application’s cores on the same physical core.Problem: the OS can put both of the application’s cores on the same physical core. Solution: take care of the processor affinities.Solution: take care of the processor affinities.
16
General Code Optimization Get rid of heavy macro image_get_mirror No calculation needed Only Y parameter should be recalculated Only X parameter should be recalculated Only Y parameter should be recalculated Only X parameter should be recalculated Original calculation needed
17
SIMD Approach The heaviest line in program is classical for SIMD: loop {sum += weights[p, r] * image[i+p, j+r] } Why it didn’t work then? The most inner loop is short.The most inner loop is short. Most of the time weights[curr_ptr] and image[curr_ptr] are unaligned.Most of the time weights[curr_ptr] and image[curr_ptr] are unaligned. Overhead on adding the “SIMD sum” to the “non-SIMD sum”.Overhead on adding the “SIMD sum” to the “non-SIMD sum”.
18
Results HT machine (P4 3.0GHz) Threading Only
19
Results HT machine (P4 3.0GHz) Code Optimization Only
20
Results HT machine (P4 3.0GHz) Full Optimization
21
The results on Dual Core machine (Pentium D)
22
Compilation by Intel Compiler
23
Compilation by Intel Compiler (cont) Intel compiler gives up to 26.1% performanceIntel compiler gives up to 26.1% performance boost. boost. The 64-bit compilation gave similar results as the 32-bit compilation by the Intel Compiler.The 64-bit compilation gave similar results as the 32-bit compilation by the Intel Compiler.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.