Download presentation
Presentation is loading. Please wait.
Published byJovany Milk Modified over 10 years ago
1
Performance Tuning Panotools - PTMender
2
Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results
3
Project Goal Gaining performance on PanoTools This goal will be achieved through: 1.Multi-threading the application – using new multi- core machines which is the most significant performance promise. 2.Using SSE code. 3.Trying to find micro-architectural pitfalls and solving them – using VTune tuning assist.
4
About Panotools Panotools is the cross-platform library behind Panorama Tools and many other GUI photo stitchers. Gaining much popularity as back-end engine for many panoramic applications. Selected to participate in the Google Summer Of Code 2007. We focused on the PTMender module of the library. More details on Panotools on: http://panotools.sourceforge.net/
5
Multi-threading Two major approaches in multi-threading an existing single-threaded application: 1.Data decomposition – Dividing data to smaller parts and performing parallel work on each part. This is not always possible due to algorithmic dependencies between divided parts. 2.Functional decomposition – Dividing the work according to functional tasks. Each thread performs a unique predefined task. This is harder to perform and requires deep understanding of original algorithm.
6
Multi-threading – contd. Naturally we started looking for Data decomposition. In theory, because PTMender works on several files we could have processed a number of files simultaneously. Alternatively, we could have divided a single file and processed its parts simultaneously. In practice, using the Call Graph function in VTune, we noticed a native division of each file into independent parts on which the algorithm runs. Clearly, the chosen method was the later because it provides a better scalability.
7
VTune - Call graph
8
The serial (Original) model Serial task
9
The Parallel model thread0 thread1
10
Multi-threading – contd. Data sharing – We created arrays of thread specific data structures. And not: Padding is used to create full cache line separation between array entries and prevent false sharing. typedef struct thread_vars{ Image result; TrformStr transform; int pad[16]; }thread_vars_t; thread_vars_t thread_private[NUM_THREADS] Image result[NUM_THREADS] TrformStr transform[NUM_THREADS];
11
Thread Checker
12
Thread Checker - Debug
13
Noise Effects of data races were later obvious from output observations
14
Thread Checker – Debug - Contd. Adding synchronization around critical sections #ifdef PROTECT_WRITE // Request ownership of mutex. dwWaitResult = WaitForSingleObject( hTiffWriteMutex, // handle to mutex 5000L); // five-second time-out interval if (dwWaitResult == WAIT_OBJECT_0){ __try { // Write to the database. #endif
15
Thread Profiler
16
Thread Profiler – contd.
17
Image comparison
18
SIMD & uArchitecture Unfortunately we did not find good opportunities for vectorizing. Main Micro-architectural issue is Mispredicted indirect calls. This cannot be solves since the panotools mechanism works allot with function pointers for flexibility FP activity is significant. We changed floating point model in compilation from precise to fast and reduced instruction count in benchmark to under 90% from original code generation
19
Results
20
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.