Threading Methodology using OpenMP*

Threading Methodology using OpenMP*
Intel Software College

Threading Methodology with OpenMP*
Objectives After completion of this module you will Be able to rapidly prototype and estimate the effort required to thread time consuming regions Threading Methodology with OpenMP*

Agenda A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues Threading Methodology with OpenMP*

What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures Multiple processes Communication through Inter-Process Communication (IPC) Single process, multiple threads Communication through shared memory Threading Methodology with OpenMP*

Serial code limits speedup Threading Methodology with OpenMP*
Amdahl’s Law Describes the upper bound of parallel execution speedup n = ∞ n = 2 n = number of processors Tparallel = {(1-P) + P/n} Tserial Speedup = Tserial / Tparallel (1-P) P T serial 1.0/0.75 = 1.33 1.0/0.5 = 2.0 P/2 P/∞ … (1-P) Serial code limits speedup Threading Methodology with OpenMP*

Processes and Threads Modern operating systems load programs as processes Resource holder Execution A process starts executing at its entry point as a thread Threads can create other threads within the process Each thread gets its own stack All threads within a process share code & data segments Stack thread main() … thread Code segment Data segment Threading Methodology with OpenMP*

Threads – Benefits & Risks
Increased performance and better resource utilization Even on single processor systems - for hiding latency and increasing throughput IPC through shared memory is more efficient Risks Increases complexity of the application Difficult to debug (data races, deadlocks, etc.) Threading Methodology with OpenMP*

Commonly Encountered Questions with Threading Applications
Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use? Threading Methodology with OpenMP*

Prime Number Generation
bool TestForPrime(int val) { // let’s start checking from 2 int limit, factor = 2; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); i factor 3 2 5 2 Threading Methodology with OpenMP*

Activity 1 Run Serial version of Prime code Locate PrimeSingle directory Compile with make Run a few times with different ranges Threading Methodology with OpenMP*

Development Methodology
Analysis Find computationally intense code Design (Introduce Threads) Determine how to implement threading solution Debug for correctness Detect any problems resulting from using threads Tune for performance Achieve best parallel performance Threading Methodology with OpenMP*

Development Cycle Analysis VTune™ Performance Analyzer Design (Introduce Threads) Intel® Performance libraries: IPP and MKL OpenMP* (Intel® Compiler) Explicit threading (Pthreads*, Win32*) Debug for correctness Intel® Thread Checker Intel Debugger Tune for performance Thread Profiler Threading Methodology with OpenMP*

Identifies the time consuming regions
Analysis - Sampling bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); Use VTune Sampling to find hotspots in application Let’s use the project PrimeSingle for analysis PrimeSingle <start> <end> Usage: ./PrimeSingle Identifies the time consuming regions Threading Methodology with OpenMP*

Used to find proper level in the call-tree to thread
Analysis - Call Graph This is the level in the call tree where we need to thread Used to find proper level in the call-tree to thread Threading Methodology with OpenMP*

Analysis Where to thread? FindPrimes() Is it worth threading a selected region? Appears to have minimal dependencies Appears to be data-parallel Consumes over 95% of the run time Baseline measurement Threading Methodology with OpenMP*

Activity 2 Run code with ‘ ’ range to get baseline measurement Make note for future reference Run VTune analysis on serial code What function takes the most time? Threading Methodology with OpenMP*

Foster’s Design Methodology
From Designing and Building Parallel Programs by Ian Foster Four Steps: Partitioning Dividing computation and data Communication Sharing data between computations Agglomeration Grouping tasks to improve performance Mapping Assigning tasks to processors/threads Threading Methodology with OpenMP*

Designing Threaded Programs
The Problem Partition Divide problem into tasks Communicate Determine amount and pattern of communication Agglomerate Combine tasks Map Assign agglomerated tasks to created threads Initial tasks Communication Combined Tasks Designing Parallel Programs Partition Divide problem into tasks Communicate Determine amount and pattern of communication Agglomerate Combine tasks Map Assign agglomerated tasks to physical processors Final Program Threading Methodology with OpenMP*

Parallel Programming Models
Functional Decomposition Task parallelism Divide the computation, then associate the data Independent tasks of the same problem Data Decomposition Same operation performed on different data Divide data into pieces, then associate computation Threading Methodology with OpenMP*

Decomposition Methods
Atmosphere Model Ocean Model Land Surface Model Hydrology Model Functional Decomposition Focusing on computations can reveal structure in a problem Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Domain Decomposition Focus on largest or most frequently accessed data structure Data Parallelism Same operation applied to all data Threading Methodology with OpenMP*

Pipelined Decomposition
Computation done in independent stages Functional decomposition Threads are assigned stage to compute Automobile assembly line Data decomposition Thread processes all stages of single instance One worker builds an entire car Threading Methodology with OpenMP*

LAME Encoder Example LAME MP3 encoder Open source project Educational tool used for learning The goal of project is To improve the psychoacoustics quality To improve the speed of MP3 encoding Threading Methodology with OpenMP*

LAME Pipeline Strategy
Prelude Acoustics Encoding Other Fetch next frame Frame characterization Set encode parameters Psycho Analysis FFT long/short Filter assemblage Apply filtering Noise Shaping Quantize & Count bits Add frame header Check correctness Write to disk Frame Frame N Frame N + 1 Time Other N Prelude N Acoustics N Encoding N T 2 T 1 Acoustics N+1 Prelude N+1 Other N+1 Encoding N+1 Acoustics N+2 Prelude N+2 T 3 T 4 Prelude N+3 Hierarchical Barrier Threading Methodology with OpenMP*

Rapid prototyping with OpenMP Threading Methodology with OpenMP*
Prime Number Generation Design What is the expected benefit? How do you determine this with the least effort? How long would it take to thread? How much re-design/effort is required? Speedup(2P) = 100/(97/2+3) = ~1.94X Rapid prototyping with OpenMP Threading Methodology with OpenMP*

Fork-join parallelism: Master thread spawns a team of threads as needed Parallelism is added incrementally sequential program evolves into a parallel program Parallel Regions Master Thread Threading Methodology with OpenMP*

Design OpenMP Create threads here for this parallel region Divide iterations of the for loop #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Threading Methodology with OpenMP*

Lab Activity 3 Run OpenMP version of code Locate PrimeOpenMP directory Compile code with make Run with ‘ ’ for comparison What is the speedup? Threading Methodology with OpenMP*

Speedup of 1.12X (less than 1.94X) Threading Methodology with OpenMP*
Design What is the expected benefit? How do you determine this with the least effort? How long would it take to thread? How much re-design/effort is required? Is this the best scaling possible? Speedup of 1.12X (less than 1.94X) Threading Methodology with OpenMP*

Debugging for Correctness
Is this threaded implementation right? No! The answers are different each time … Threading Methodology with OpenMP*

Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks Intel® Thread Checker VTune™ Performance Analyzer Primes.c Primes.exe (Instrumented) Binary Instrumentation Compiler Source Instrumentation -openmp -tcheck Runtime Data Collector +.so’s (Instrumented) Primes.exe threadchecker.thr (result file) Threading Methodology with OpenMP*

If application has source instrumentation, it can be run from command prompt threadchecker.thr (result file) Primes.c Compiler Source Instrumentation Primes.exe Threading Methodology with OpenMP*

Lab Activity 4 Use Thread Checker to analyze threaded application Add –tcheck to compile and link commands in Makefile Run instrumented application Copy threadchecker.thr, source, and object files to Windows platform Launch VTune (Thread Checker) Read in threadchecker.thr file and analyze Threading Methodology with OpenMP*

How much re-design/effort is required? How long would it take to thread? Thread Checker reported only 2 dependencies, so effort required should be low Threading Methodology with OpenMP*

#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Will create a critical section for this reference Will create a critical section for both these references Threading Methodology with OpenMP*

Lab Activity 5 Modify and run OpenMP version of code Add critical region pragmas to code Compile code with make If still not correct, re-instrument for Thread Checker, run, and copy files for analysis Run with ‘ ’ for comparison Be sure to remove –tcheck before final build What is the speedup? Threading Methodology with OpenMP*

No! From Amdahl’s Law, we expect scaling close to 1.9X
Correctness Correct answer, but performance has slipped to ~1.03X ! Is this the best we can expect from this algorithm? No! From Amdahl’s Law, we expect scaling close to 1.9X Threading Methodology with OpenMP*

Common Performance Issues
Parallel Overhead Due to thread creation, scheduling … Synchronization Excessive use of global data, contention for the same synchronization object Load Imbalance Improper distribution of parallel work Granularity No sufficient parallel work Threading Methodology with OpenMP*

Tuning for Performance
Thread Profiler pinpoints performance bottlenecks in threaded applications Thread Profiler VTune™ Performance Analyzer Primes.c Primes.exe (Instrumented) Binary Instrumentation Compiler Source Instrumentation -openmp_profile Runtime Data Collector +.so’s (Instrumented) Primes.exe Bistro.tp/guide.gvs (result file) Threading Methodology with OpenMP*

Tuning for Performance
If application has source instrumentation, it can be run from command prompt Bistro.tp/guide.gvs (result file) Primes.c Compiler Source Instrumentation Primes.exe Threading Methodology with OpenMP*

Thread Profiler for OpenMP

Speedup Graph Estimates threading speedup and potential speedup Based on Amdahl’s Law computation Gives upper and lower bound estimates Threading Methodology with OpenMP*

serial parallel Threading Methodology with OpenMP*

Lab Activity 6 Use Thread Profiler to analyze threaded application Use –openmp_profile to compile and link commands in Makefile Run instrumented application Copy guide.gvs, source, and object files to Windows platform Launch VTune (Thread Profiler) Read in guide.gvs file and analyze Threading Methodology with OpenMP*

342 factors to test 500000 250000 750000 Thread 1 612 factors to test Thread 2 789 factors to test Thread 3 934 factors to test Threading Methodology with OpenMP*

Fixing the Load Imbalance
Distribute the work more evenly void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Scaling achieved is 1.02X Threading Methodology with OpenMP*

Lab Activity 7 Modify code for better load balance Add schedule (static, 8) clause to OpenMP parallel for pragma Re-compile and run code What is speedup from serial version now? Threading Methodology with OpenMP*

Back to Thread Profiler
Too much time in locks and synchronization? If no sync, would only achieve 1.17X Why are we not getting any speedup? Threading Methodology with OpenMP*

Is it Possible to do Better?
Using top we see Serial code runs at 100% Threaded code runs poorly… …but has “flashes” of good performance Threading Methodology with OpenMP*

Other Causes to Consider
Problems Cache Thrashing False Sharing Excessive context switching I/O Tools VTune Performance Analyzer Thread Profiler (as explicit threads) Linux tools vmstat and sar Disk i/o, all CPUs, context switches, network traffic, interrupts Threading Methodology with OpenMP*

Linux sar Output Acceptable context switching Little paging activity
Less than 5000 context switches per second should be acceptable Little paging activity Memory doesn’t seem a problem Threading Methodology with OpenMP*

Performance This implementation has implicit synchronization calls Thread-safe libraries may have “hidden” synchronization to protect shared resources Will serialize access to resources I/O libraries share file pointers Limits number of operations in parallel Physical limits Threading Methodology with OpenMP*

This change should fix the contention issue
Performance How many times is printf() executed? Hypothesis: The algorithm has many more updates than the 10 needed for showing progress. Why? void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; void ShowProgress( int val, int range ) { int percentDone; #pragma omp critical gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%%", percentDone); This change should fix the contention issue Threading Methodology with OpenMP*

Lab Activity 8 Modify ShowProgress function to print only the needed output Recompile and run the code Be sure no instrumentation flags are used What is speedup from serial version now? if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%%", percentDone); lastPercentDone++; } Threading Methodology with OpenMP*

Design Eliminate the contention due to implicit synchronization Speedup is 17.87X! Can that be right? Threading Methodology with OpenMP*

Speedup achieved is 1.49X (<1.9X)
Performance Our original baseline measurement had the “flawed” progress update algorithm Is this the best we can expect from this algorithm? Speedup achieved is 1.49X (<1.9X) Threading Methodology with OpenMP*

Performance Re-visited
Let’s use Thread Profiler again… 70% of time spent in synchronization and overhead Threading Methodology with OpenMP*

OpenMP Critical Regions
#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical (one) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } #pragma omp critical (two) { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Naming regions will create a different critical section for code regions Threading Methodology with OpenMP*

Performance Re-visited
Any better? Threading Methodology with OpenMP*

Reducing Critical Region Size
Two operations (read & write) Only gPrimesFound update needs protection #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical (one) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } #pragma omp critical (two) { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) Use local variables for read access outside critical region Write and read are separate Threading Methodology with OpenMP*

Reducing Critical Region Size
#pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) { #pragma omp critical (one) lPrimesFound = gPrimesFound++; globalPrimes[lPrimesFound] = i; } ShowProgress(i, range); #pragma omp critical (two) lProgress = gProgress++; percentDone = (int)(lProgress/range *200.0f+0.5f) Speedup is up to 1.89X Threading Methodology with OpenMP*

Lab Activity 9 Modify code to reduce size of critical regions Name the two different critical regions Use local variable for read-access after global is updated What is speedup versus serial version? Threading Methodology with OpenMP*

Scaling Performance 1.96X 1.97X Threading Methodology with OpenMP*

Summary Threading applications require multiple iterations of designing, debugging and performance tuning steps Use tools to improve productivity Unleash the power of dual-core and multi-core processors with multithreaded applications Threading Methodology with OpenMP*

This should always be the last slide of all presentations. Threading Methodology with OpenMP*

Backup Slides Threading Methodology with OpenMP*

Parallel Overhead Thread Creation overhead Overhead increases rapidly as the number of active threads increases Solution Use of re-usable threads and thread pools Amortizes the cost of thread creation Keeps number of active threads relatively constant Threading Methodology with OpenMP*

Synchronization Heap contention Allocation from heap causes implicit synchronization Allocate on stack or use thread local storage Atomic updates versus critical sections Some global data updates can use atomic operations (Interlocked family) Use atomic updates whenever possible Threading Methodology with OpenMP*

Load Imbalance Unequal work loads lead to idle threads and wasted time Time Busy Idle Threading Methodology with OpenMP*

Granularity Loosely defined as the ratio of computation to synchronization Be sure there is enough work to merit parallel computation Example: Two farmers divide a field. How many more farmers can be added? Granularity is more difficult to quantify than parallel speedup or efficiency. The ratio of computation to synchronization is only a loose measure. Granularity is better explained by example. The harvest illustration is meant to show that work cannot be divided indefinitely. The problem is initially coarse-grained but becomes successively finer as the field is divided. At some point the partitions become so small that the farmers get in each other’s way. The overhead of synchronizing so many farmers begins to outweigh the benefit of parallelism. Second click brings in 20 extra farmers. Can 20 be supported? What about 100? Threading Methodology with OpenMP*

Threading Methodology using OpenMP*

Similar presentations

Presentation on theme: "Threading Methodology using OpenMP*"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Threading Methodology using OpenMP*

Similar presentations

Presentation on theme: "Threading Methodology using OpenMP*"— Presentation transcript:

Similar presentations

About project

Feedback