Download presentation
Presentation is loading. Please wait.
1
Threaded Programming Methodology
Intel Software College
2
Threaded Programming Methodology
Objectives After completion of this module you will Be able to rapidly prototype and estimate the effort required to thread time consuming regions Purpose of the Slide States the objectives of this module Details This course module walks through the process of migrating a serial program to a parallel (threaded) one, using the OpenMP model, and the Intel tools VTune (to identify the code section most profitably threaded), Thread Checker (to identify any coding issues specific to threading) and Thread Profiler (to identify performance issues specific to threading). It can stand alone, be used as the sole threading session in a more general course, or as part of an overall threading course. There are 9 lab activities in the sequence. Note: most of the slides use complex builds – be sure to become familiar with them. Threaded Programming Methodology
3
Threaded Programming Methodology
Agenda A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues Purpose of the Slide Outlines the topics addressed to achieve the module objective. Details The “primes” code – finding all prime numbers up to a specified upper bound – is the example used throughout. Be aware that the prime-finding algorithm employed in this case study is deliberately unsophisticated (scales O(N^(3/2)), so that it can be quickly understood by the students; better approaches exist, but are not pertinent to the matters addressed here. Threaded Programming Methodology
4
Threaded Programming Methodology
What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes Communication through Inter-Process Communication (IPC) Single process, multiple threads Communication through shared memory Purpose of the Slide To frame the discussion - the parallel model used in this session is the one highlighted: single process, multiple threads, shared memory. Threaded Programming Methodology
5
Serial code limits speedup Threaded Programming Methodology
Amdahl’s Law Describes the upper bound of parallel execution speedup n = ∞ n = 2 n = number of processors Tparallel = {(1-P) + P/n} Tserial Speedup = Tserial / Tparallel (1-P) P T serial 1.0/0.75 = 1.33 1.0/0.5 = 2.0 P/2 P/∞ … (1-P) Purpose of the Slide Explains and illustrates Gene Amdahl’s observation about the maximum theoretical speedup, from parallelization, for a given algorithm. Details The build starts with a serial code taking time T(serial), composed of a parallelizable portion P, and the rest, 1-P, in this example in equal proportions. If P is perfectly divided into two parallel components (on two cores, for example), then the overall time T(parallel) is reduced to 75%. In the limit of very large, perfect parallelization of P, the overall time approaches 50% of the original T(serial). Questions to Ask Students Does overhead play a role? (yes; primes the next slide, where threads are more efficient than processes) Are unwarranted assumptions built in about scaling? (that is: do the serial and parallel portions increase at the same rate with increasing problem size?). This can lead to a brief aside about the complementary point of view, Gustafson’s law (which assumes the parallel portion grows more quickly than the serial). Serial code limits speedup Threaded Programming Methodology
6
Threaded Programming Methodology
Processes and Threads Modern operating systems load programs as processes Resource holder Execution A process starts executing at its entry point as a thread Threads can create other threads within the process Each thread gets its own stack All threads within a process share code & data segments Stack thread main() … thread Code segment Data segment Purpose of the Slide Defines the relationship of processes and threads. Details A key point: threads share code and data (this will come up later in race conditions). Threaded Programming Methodology
7
Threads – Benefits & Risks
Increased performance and better resource utilization Even on single processor systems - for hiding latency and increasing throughput IPC through shared memory is more efficient Risks Increases complexity of the application Difficult to debug (data races, deadlocks, etc.) Purpose of the Slide List some of the benefits and risks (costs) of threaded applications. Details Benefit “1” refers primarily to task parallelism (discussed in detail in the ISC module “Multi-core programming: Basic Concepts”) Benefit “2” is compared to processes (since threads share data, there is minimal overhead to “inter process communication”). Debugging is difficult since the bugs are non-deterministic, that is, they may not occur during each test, and a QA process designed for serial code will very likely miss bugs in threaded code. Threaded Programming Methodology
8
Commonly Encountered Questions with Threading Applications
Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use? Purpose of the Slide States key considerations for developers beginning to thread an application. Details Where to thread? => where in the application, the hotspots How long would it take to thread? => in developer time (cost estimate) How much re-design/effort is required? => a factor in developer time (refining the cost estimate) Is it worth threading a selected region? => estimating the benefit What should the expected speedup be? => quantitative; want to approach Amdahl’s law limit Will the performance meet expectations? => if that limit is achieved, is the effort worthwhile? Will it scale as more threads/data are added? => This is very important: future platforms are expected to have additional cores. Which threading model to use? => for compiled apps, this is typically a choice between native models or OpenMP Threaded Programming Methodology
9
Prime Number Generation
bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); i factor 63 3 69 3 Purpose of the Slide Explain the prime number algorithm and code to be used for the 9 lab activities. Details Each build step illustrates a step in the while loop. When there is successful division of “i” by a factor (as with 9/3, 15/3), the slide colours “i” (and gPrimesFound is not incremented). At the end of the loop, the slide shows a total of 7 uncoloured numbers. The final popup shows the console output, where the program reports finding 8 total primes between 1 and 20. Threaded Programming Methodology
10
Threaded Programming Methodology
Activity 1 Run Serial version of Prime code Locate PrimeSingle directory Compile with Intel compiler in Visual Studio Run a few times with different ranges Purpose of the Slide Refers students to the 1st lab activity, whose purpose is to build the initial, serial version of the application. Details Detailed instructions are provided in the student lab manual. Background This exercise assumes that the student is familiar with building applications within Visual Studio, and they can invoke the Intel compiler (make sure this is at least approximately true). Though no coding is required for this stage, it’s a good break from the lecture, and prepares the way for further work. Threaded Programming Methodology
11
Development Methodology
Analysis Find computationally intense code Design (Introduce Threads) Determine how to implement threading solution Debug for correctness Detect any problems resulting from using threads Tune for performance Achieve best parallel performance Purpose of the Slide Define the methodology to use when migrating a serial application to a threaded one. Details Don’t rush this slide…each of these four steps will have one or more associated lab activities using the primes code. Threaded Programming Methodology
12
Threaded Programming Methodology
Development Cycle Analysis VTune™ Performance Analyzer Design (Introduce Threads) Intel® Performance libraries: IPP and MKL OpenMP* (Intel® Compiler) Explicit threading (Win32*, Pthreads*) Debug for correctness Intel® Thread Checker Intel Debugger Tune for performance Intel® Thread Profiler Purpose of the Slide Assigns details to, and visually reinforces, the points made on the previous slide; specific tools and threading models are inserted into the general outline made on the previous slide. and to point out the iterative nature of both debugging and the overall development cycle. Details Each of these steps will be addressed in detail during this session. Threaded Programming Methodology
13
Identifies the time consuming regions Threaded Programming Methodology
Analysis - Sampling bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); Use VTune Sampling to find hotspots in application Let’s use the project PrimeSingle for analysis PrimeSingle <start> <end> Usage: ./PrimeSingle Purpose of the Slide Introduce and explain further the role of VTune sampling. Details The slide build: initially states the workload (find all primes between 1 and ) shows an extract from the VTune user interface highlighting the function TestforPrime shows the corresponding source code fragment Identifies the time consuming regions Threaded Programming Methodology
14
Used to find proper level in the call-tree to thread
Analysis - Call Graph This is the level in the call tree where we need to thread Purpose of the Slide Introduce and explain the role of VTune Call Graph. Details The slide build: Initial view is excerpt from the call graph user interface (bold red arrows show busiest branch) Assertion made that FindPrimes is the right level to thread Background Coarse-grained parallel is generally more effective (thread one level higher than the hot spot) Questions to Ask Students Why is this the right level, why not in TestForPrime? (be sure you know the answer yourself – look at the code, imagine the thread call in TestForPrime) Used to find proper level in the call-tree to thread Threaded Programming Methodology
15
Threaded Programming Methodology
Analysis Where to thread? FindPrimes() Is it worth threading a selected region? Appears to have minimal dependencies Appears to be data-parallel Consumes over 95% of the run time Baseline measurement Purpose of the Slide Further analysis of the insertion made in the previous slide. Also: introduces baseline measurement. Details Bullet points illustrate key considerations for threading decision. Note that the final build on the slide, showing a baseline timing, is sudden (really a non sequitur, this sequencing could be better); don’t get surprised. Baseline timing is part of the overall analysis, necessary to measure the impact of any threading efforts; now’s as good a time as any to introduce it. Threaded Programming Methodology
16
Threaded Programming Methodology
Activity 2 Run code with ‘ ’ range to get baseline measurement Make note for future reference Run VTune analysis on serial code What function takes the most time? Purpose of the Slide Refers students to the 2nd lab activity, whose purpose (as stated on the slide) is to generate a baseline serial-code measurement, and run the VTune sampling analysis. Details Detailed instructions are provided in the student lab manual. Threaded Programming Methodology
17
Foster’s Design Methodology
From Designing and Building Parallel Programs by Ian Foster Four Steps: Partitioning Dividing computation and data Communication Sharing data between computations Agglomeration Grouping tasks to improve performance Mapping Assigning tasks to processors/threads Purpose of the Slide Introduce Foster’s design methodology for parallel programming. Details This somewhat long ellipsis in the presentation, 8 slides of parallel design points and examples, is intended to prepare the design discussion for our own primes example. Ian Foster’s 1994 book is well-known to practitioners of this dark art, and his book is available online (free!), at Threaded Programming Methodology
18
Designing Threaded Programs
The Problem Partition Divide problem into tasks Communicate Determine amount and pattern of communication Agglomerate Combine tasks Map Assign agglomerated tasks to created threads Initial tasks Communication Combined Tasks Purpose of the Slide To graphically illustrate the 4 steps in Foster’s design methodology. Final Program Threaded Programming Methodology
19
Parallel Programming Models
Functional Decomposition Task parallelism Divide the computation, then associate the data Independent tasks of the same problem Data Decomposition Same operation performed on different data Divide data into pieces, then associate computation Purpose of the Slide Introduce the primary conceptual partitions in parallel programming: task and data. Details Task parallel has traditionally been used in threaded desktop apps (partition among screen update, disk read, print etc), data parallel in HPC apps; both may be appropriate in different sections of an app. Threaded Programming Methodology
20
Decomposition Methods
Atmosphere Model Ocean Model Land Surface Model Hydrology Model Functional Decomposition Focusing on computations can reveal structure in a problem Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Domain Decomposition Focus on largest or most frequently accessed data structure Data Parallelism Same operation applied to all data Purpose of the Slide Illustrates, by example, the task and data decompositions in one application (weather modeling). Details Each domain (atmosphere, hydrology etc) can be treated independently, leading to a task parallel design; within the domains, data parallel may be applied as appropriate. Threaded Programming Methodology
21
Pipelined Decomposition
Computation done in independent stages Functional decomposition Threads are assigned stage to compute Automobile assembly line Data decomposition Thread processes all stages of single instance One worker builds an entire car Purpose of the Slide Introduce pipelined decomposition, which can apply to either task (called “functional” on this slide) or data decomposition. Details This is the first of 3 slides on the topic; the next two illustrate the concept with an example. Threaded Programming Methodology
22
Threaded Programming Methodology
LAME Encoder Example LAME MP3 encoder Open source project Educational tool used for learning The goal of project is To improve the psychoacoustics quality To improve the speed of MP3 encoding Purpose of the Slide Introduce a particular application, the LAME audio encoder, to set up the next slide showing LAME in a pipelined decomposition. Details This slide serves as a quick backgrounder on the LAME code (not all students will have heard of it). The “Lame MT” project (full description, with source code) is available online at: Threaded Programming Methodology
23
LAME Pipeline Strategy
Prelude Acoustics Encoding Other Fetch next frame Frame characterization Set encode parameters Psycho Analysis FFT long/short Filter assemblage Apply filtering Noise Shaping Quantize & Count bits Add frame header Check correctness Write to disk Frame Frame N Frame N + 1 Time Other N Prelude N Acoustics N Encoding N T 2 T 1 Acoustics N+1 Prelude N+1 Other N+1 Encoding N+1 Acoustics N+2 Prelude N+2 T 3 T 4 Prelude N+3 Hierarchical Barrier Purpose of the Slide Show how the LAME compute sequence maps to a pipelined threading approach. Details Each thread (T1, …T4 on the slide) “specializes” in an operation, using results prepared by another thread. Threaded Programming Methodology
24
Rapid prototyping with OpenMP Threaded Programming Methodology
Design What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design/effort is required? Speedup(2P) = 100/(96/2+4) = ~1.92X Rapid prototyping with OpenMP Purpose of the Slide Return us to the primes example, to approach the design stage. Introduce OpenMP as a “prototyping” thread model. Details Although OpenMP is introduced for prototyping, it may (of course) prove efficient enough to be the thread model of choice for this example. Questions to Ask Students Where does this 2P speedup claim come from? Threaded Programming Methodology
25
Threaded Programming Methodology
OpenMP Fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally Sequential program evolves into a parallel program Parallel Regions Master Thread Purpose of the Slide A conceptual introduction of OpenMP. Details The key point (on the slide): can introduce threading one region at a time, which is not generally true of native threading models. Background OpenMP was launched as a standard Industry collaborators included Intel, but not Microsoft (who were invited but not interested at the time); Microsoft now (2006) supports OpenMP in its compilers. Threaded Programming Methodology
26
Threaded Programming Methodology
Design OpenMP Create threads here for this parallel region Divide iterations of the for loop #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Purpose of the Slide Show a specific syntax of OpenMP implemented into the primes code. Details An key point: because this is introduced by pragmas, the original source code is not touched. (Native thread methods require changes to the serial sources). Note that the parallel region, in this case, is the “for” loop. The final slide build shows results (number of primes and total time) for the image created with this pragma. Threaded Programming Methodology
27
Threaded Programming Methodology
Activity 3 Run OpenMP version of code Locate PrimeOpenMP directory and solution Compile code Run with ‘ ’ for comparison What is the speedup? Purpose of the Slide Refers students to the 3rd lab activity, whose purpose is to build and run an OpenMP version of primes. Details Detailed instructions are provided in the student lab manual. No programming is required for this lab. Threaded Programming Methodology
28
Speedup of 1.40X (less than 1.92X) Threaded Programming Methodology
Design What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design/effort is required? Is this the best speedup possible? Speedup of 1.40X (less than 1.92X) Purpose of the Slide Discuss the results obtained in the previous lab activity. Details Speedup was lower than expected – now what? Transition Quote But inefficient speedup is not our first concern… Threaded Programming Methodology
29
Debugging for Correctness
Purpose of the Slide Introduce and stress the importance of correctness. Details In the example shown, each run produces a different number – the bug is non-deterministic. On some platforms, the answer may be correct 9/10 times, and slip through QA. Students can test their own implementation (previous lab) on multiple runs. Is this threaded implementation right? No! The answers are different each time … Threaded Programming Methodology
30
Debugging for Correctness
Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks VTune™ Performance Analyzer Intel® Thread Checker Primes.exe (Instrumented) Binary Instrumentation Primes.exe Runtime Data Collector +DLLs (Instrumented) Purpose of the Slide Introduce Thread Checker as a tool to address threading correctness; outline its implementation. Details The code is typically instrumented at the binary level, though source instrumentation is also available. From the product FAQ: The Thread Checker library calls record information about threads, including memory accesses and APIs used, in order to find threading diagnostics including errors. Binary instrumentation is added at run-time to an already built (made) binary module, including applications and dynamic or shared libraries. The instrumentation code is automatically inserted when you run an Intel® Thread Checker activity in the VTune™ environment or the Microsoft .NET* Development Environment. Both Microsoft Windows* and Linux* executables can be instrumented for IA-32 processors, but not for Itanium® processors. Binary instrumentation can be used for software compiled with any of the supported compilers. The final build shows the UI page moving ahead to the next slide… Background Be ready to briefly explain the bugs mentioned: data races, stalls, deadlocks. threadchecker.thr (result file) Threaded Programming Methodology
31
Threaded Programming Methodology
Thread Checker Threaded Programming Methodology
32
Threaded Programming Methodology
Activity 4 Use Thread Checker to analyze threaded application Create Thread Checker activity Run application Are any errors reported? Purpose of the Slide Refers students to the 4th lab activity, whose purpose is to run the Thread Checker analysis illustrated on the previous slide. Details Detailed instructions are provided in the student lab manual. Students should see results (race conditions detected) similar to those on the previous slide. Threaded Programming Methodology
33
Debugging for Correctness
How much re-design/effort is required? How long would it take to thread? Thread Checker reported only 2 dependencies, so effort required should be low Purpose of the Slide To address the question, how much effort (cost) will be required to successfully thread this application. Details As asserted in the slide, with only the two dependencies (gPrimesFound and gProgress), the debugging effort should be manageable. Threaded Programming Methodology
34
Debugging for Correctness
#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Will create a critical section for this reference Will create a critical section for both these references Purpose of the Slide To show one way to correct the race conditions on gPrimesFound and gProgress. Details Critical sections can only be accessed by one thread at a time, so this solution should correct the race condition. Note the key point: “one thread at a time” – the critical section is, by design, no longer parallel. Threaded Programming Methodology
35
Threaded Programming Methodology
Activity 5 Modify and run OpenMP version of code Add critical region pragmas to code Compile code Run from within Thread Checker If errors still present, make appropriate fixes to code and run again in Thread Checker Run with ‘ ’ for comparison Compile and run outside Thread Checker What is the speedup? Purpose of the Slide Refers students to the 5th lab activity, whose purpose (as stated on the slide) is to correct the race conditions discovered in the primes code. The resulting image is then checked for results and performance. Details Detailed instructions are provided in the student lab manual. Students will use the critical sections technique described on the previous slide. Threaded Programming Methodology
36
No! From Amdahl’s Law, we expect speedup close to 1.9X
Correctness Correct answer, but performance has slipped to ~1.33X Is this the best we can expect from this algorithm? Purpose of the Slide Show that the critical sections method fixed the bug, but the performance is lower than expected. Details The slide shows a correct answer, but remind the students that this does not guarantee there is no bug (race conditions, if present, may show up only rarely). To be more rigorous, one would re-run the Thread Checker. No! From Amdahl’s Law, we expect speedup close to 1.9X Threaded Programming Methodology
37
Common Performance Issues
Parallel Overhead Due to thread creation, scheduling … Synchronization Excessive use of global data, contention for the same synchronization object Load Imbalance Improper distribution of parallel work Granularity No sufficient parallel work Purpose of the Slide To list some common performance issues (follow up to previous slide, which showed poor threading performance). Details Each item listed is linked to a complete slide (in this set) which shows additional detail; recommend linking to each, one by one. We will see examples of two of these, in the remaining labs. This slide and previous set us up for the final section of this module, performance tuning, which begins on the next slide. Threaded Programming Methodology
38
Tuning for Performance
Thread Profiler pinpoints performance bottlenecks in threaded applications Thread Profiler VTune™ Performance Analyzer Primes.c Primes.exe (Instrumented) Binary Instrumentation Compiler Source Instrumentation /Qopenmp_profile Runtime Data Collector +DLL’s (Instrumented) Purpose of the Slide Introduce Thread Profile as a tool to address threading performance; outline its implementation. Details The slide build: Shows the build-and-link stage for primes, using the flag /Qopenmp_profile. This flag replaces /Qopenmp, and is required. From the user guide: Before you begin, you need to link and instrument your application with calls to the OpenMP* statistics gathering Runtime Engine. The Runtime Engine's calls are required because they collect performance data and write it to a file. 1. Compile your application using an Intel(R) Compiler. 2. Link your application to the OpenMP* Runtime Engine using the -Qopenmp_profile option. The slide then shows “Binary Instrumention”, but we will not be using that feature during this module. Binary instrumentation would be used to investigate the underlying native threads of an OpenMP applications). The resulting runtime, then gui snapshot, shown in detail on the next slide Primes.exe Bistro.tp/guide.gvs (result file) Threaded Programming Methodology
39
Thread Profiler for OpenMP
Threaded Programming Methodology
40
Thread Profiler for OpenMP
Speedup Graph Estimates threading speedup and potential speedup Based on Amdahl’s Law computation Gives upper and lower bound estimates Threaded Programming Methodology
41
Thread Profiler for OpenMP
serial parallel Threaded Programming Methodology
42
Thread Profiler for OpenMP
Threaded Programming Methodology
43
Thread Profiler (for Explicit Threads)
Threaded Programming Methodology
44
Thread Profiler (for Explicit Threads)
Why so many transitions? Threaded Programming Methodology
45
Back to the design stage Threaded Programming Methodology
Performance This implementation has implicit synchronization calls This limits scaling performance due to the resulting context switches Purpose of the Slide Gives additional analysis regarding the Timeline and source views shown in the previous slide; identifies a significant bottleneck in the code. Details The slide build highlights the key portions of the Timeline and source view. Questions to Ask Students Why do we call this a synchronization, and what is implicit about it? Back to the design stage Threaded Programming Methodology
46
Threaded Programming Methodology
Activity 6 Use Thread Profiler to analyze threaded application Use /Qopenmp_profile to compile and link Create Thread Profiler Activity (for explicit threads) Run application in Thread Profiler Find the source line that is causing the threads to be inactive Purpose of the Slide Refers students to the 6th lab activity, whose purpose is to run a Thread Profiler analysis on the primes code. Details Detailed instructions are provided in the student lab manual. This lab exercise repeats the steps demonstrated in the preceding slides; students should expect to see similar results. Threaded Programming Methodology
47
This change should fix the contention issue
Performance Is that much contention expected? The algorithm has many more updates than the 10 needed for showing progress void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone); } Purpose of the Slide To address the performance problem identified in the preceding slides and lab. Details The test, if (percentDone % 10 ==0), does NOT cause printing to be done every 10th step, but much more often. The slide build introduces a fix. Questions to Ask Students Why is the original algorithm not printing as infrequently as intended? Why does the fix correct this problem? => invite/encourage the students walk through the code with you, so these questions are understood. This change should fix the contention issue Threaded Programming Methodology
48
Speedup is 2.32X ! Is that right? Threaded Programming Methodology
Design Goals Eliminate the contention due to implicit synchronization Purpose of the Slide Shows a result of the primes code which implements the correction shown on the previous slide, and shows an apparent anomaly in the resulting timing. Details The answer is correct, but the speedup of 2.32 for 2 cores cannot be right. Encourage the students to speculate as to causes of this (you may hear words like superscalar, cache etc – all red herrings). Speedup is 2.32X ! Is that right? Threaded Programming Methodology
49
Speedup is actually 1.40X (<<1.9X)!
Performance Our original baseline measurement had the “flawed” progress update algorithm Is this the best we can expect from this algorithm? Purpose of the Slide Show the corrected baseline timing; resolves the apparent anomaly of the previous slide. Details The timing shown is a new baseline timing, with the contention correction added to serial version of primes (note the directory name in the command window). The original baseline timing was 11.73s; this version shows 7.09s, giving us the new speedup ration of This is significantly lower than the speedup of 1.9 predicted by Amdahl’s law. Speedup is actually 1.40X (<<1.9X)! Threaded Programming Methodology
50
Threaded Programming Methodology
Activity 7 Modify ShowProgress function (both serial and OpenMP) to print only the needed output Recompile and run the code Be sure no instrumentation flags are used What is speedup from serial version now? if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } Purpose of the Slide Refers students to the 7th lab activity, whose purpose is to introduce the performance fix outlined in the preceding slides. Details Detailed instructions are provided in the student lab manual. Students will implement the code shown on previous slides, measure new timings, and derive a new speedup number. While unlikely to precisely match the 1.40x speedup shown on the slides (since platforms used for this class will vary), it should be similar. Threaded Programming Methodology
51
Performance Re-visited
Still have 62% of execution time in locks and synchronization Threaded Programming Methodology
52
Performance Re-visited
Let’s look at the OpenMP locks… void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Lock is in a loop Purpose of the Slide Examine the lock protecting gPrimesFound, to understand the performance impact. Details As stated in the slide, the real issue is putting the critical section (lock) inside a loop. A fix is proposed in the slide build sequence. The slide build: Points out the lock within a loop Introduces a fix, using the Windows threading function InterlockedIncrement. This function is defined as LONG InterlockedIncrement( LONG volatile* Addend ); where Addend [in, out] is a pointer to the variable to be incremented. This is an atomic operation, less disruptive than a critical section. The final build shows a timing result from a primes image incorporating this fix. A key point: it is possible – and sometimes desirable - to mix OpenMP and native threading calls! Threaded Programming Methodology
53
Performance Re-visited
Let’s look at the second lock void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; This lock is also being called within a loop Purpose of the Slide Examine the lock protecting gProgress, to understand the performance impact. Details The same fix, interlockedIncrement, is used for this critical section. The slide build: Points out the critical section is in a loop (Question: what loop?) Introduces a different solution, using the Windows API. Note that 3 lines of code need to be modified. The final build shows a timing result from a primes image incorporating this fix. Threaded Programming Methodology
54
Threaded Programming Methodology
Activity 8 Modify OpenMP critical regions to use InterlockedIncrement instead Re-compile and run code What is speedup from serial version now? Purpose of the Slide Refers students to the 8th lab activity, whose purpose is to introduce the code change cited, and measure its impact. Details Detailed instructions are provided in the student lab manual. Threaded Programming Methodology
55
Thread Profiler for OpenMP
342 factors to test 500000 250000 750000 Thread 1 612 factors to test Thread 2 789 factors to test Thread 3 934 factors to test Purpose of the Slide To examine the causes of the load imbalance observed in the profile of the primes code. Details Using 4 threads makes the imbalance more obvious. The slide build: Overlaying the Threads view, a “stair step” is drawn to illustrate that each successive thread takes additional time. A bar is drawn to illustrate that the iterations were divided among the threads in equal amounts. “Didn’t we divide the iterations evenly? Let’s look at the work being done for a ‘middle’ prime in each group.” Boxes with the precise workload stated for each thread appear, showing explicitly that there are more steps required as the algorithm searches for primes in larger numbers A triangle is drawn to illustrate (conceptually, not precisely) the nature of the workload, which increases with increasing number range Threaded Programming Methodology
56
Fixing the Load Imbalance
Distribute the work more evenly void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } Speedup achieved is 1.68X Purpose of the Slide Introduce a method to address the load imbalance inherent in the primes algorithm. Details The slide build: The triangle from the previous slide is redrawn, illustrating the different “sizes” of work for each thread An new triangle is shown, with the workload interleaved to achieve a more even distribution An OpenMP schedule pragma is added, which achieves that interleaving (no code change is required) A sample run is shown for this approach, with the time now 4.22s, a speedup of 1.68 Threaded Programming Methodology
57
Threaded Programming Methodology
Activity 9 Modify code for better load balance Add schedule (static, 8) clause to OpenMP parallel for pragma Re-compile and run code What is speedup from serial version now? Purpose of the Slide Refers students to the 9th lab activity, whose purpose is to introduce static OpenMP scheduling, and measure its impact. Details Detailed instructions are provided in the student lab manual. As before, results achieved should be similar to (though probably not exactly the same as) those shown in the slides. Threaded Programming Methodology
58
Final Thread Profiler Run
Purpose of the Slide Show the performance profile of the final version of primes, with all corrections and load balancing implemented. Details Note that the speedup, 1.80x, is faster than the 1.68x cited in a preceding slide; the difference is this final run is the “Release” version, free of the overhead of the “Debug” version shown previously (note the directories shown: here it is c:\classfiles\PrimeOpenMP\Release, previously it is …\Debug). Speedup achieved is 1.80X Threaded Programming Methodology
59
Threaded Programming Methodology
Comparative Analysis Threading applications require multiple iterations of going through the software development cycle Purpose of the Slide Summarizes the results at each step in the performance tuning process; emphasizes the iterative nature of the process. Threaded Programming Methodology
60
Threading Methodology What’s Been Covered
Four step development cycle for writing threaded code from serial and the Intel® tools that support each step Analysis Design (Introduce Threads) Debug for correctness Tune for performance Threading applications require multiple iterations of designing, debugging and performance tuning steps Use tools to improve productivity Purpose of the Slide Summarizes the key points covered in this module. Threaded Programming Methodology
61
This should always be the last slide of all presentations.
62
Threaded Programming Methodology
Backup Slides Threaded Programming Methodology
63
Threaded Programming Methodology
Parallel Overhead Thread Creation overhead Overhead increases rapidly as the number of active threads increases Solution Use of re-usable threads and thread pools Amortizes the cost of thread creation Keeps number of active threads relatively constant Threaded Programming Methodology
64
Threaded Programming Methodology
Synchronization Heap contention Allocation from heap causes implicit synchronization Allocate on stack or use thread local storage Atomic updates versus critical sections Some global data updates can use atomic operations (Interlocked family) Use atomic updates whenever possible Critical Sections versus mutual exclusion Critical Section objects reside in user space Use CRITICAL SECTION objects when visibility across process boundaries is not required Introduces lesser overhead Has a spin-wait variant that is useful for some applications Threaded Programming Methodology
65
Threaded Programming Methodology
Load Imbalance Unequal work loads lead to idle threads and wasted time Time Busy Idle Threaded Programming Methodology
66
Parallelizable portion Threaded Programming Methodology
Granularity Coarse grain Fine grain Scaling: ~2.5X Scaling: ~3X Serial Parallelizable portion Parallelizable portion Scaling: ~1.05X Scaling: ~1.10X Serial Threaded Programming Methodology
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.