Multi-core Programming Threading Concepts. 2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Practical techniques & Examples

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

NewsFlash!! Earth Simulator no longer #1. In slightly less earthshaking news… Homework #1 due date postponed to 10/11.

The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.

Multi-core Programming Thread Checker. 2 Topics What is Intel® Thread Checker? Detecting race conditions Thread Checker as threading assistant Some other.

11Sahalu JunaiduICS 573: High Performance Computing5.1 Analytical Modeling of Parallel Programs Sources of Overhead in Parallel Programs Performance Metrics.

PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming with MPI and OpenMP Michael J. Quinn.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Parallel Programming Models and Paradigms

1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.

Threaded Programming Methodology Intel Software College.

SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.

Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.

ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.

Multi-core Programming Threading Methodology. 2 Topics A Generic Development Cycle.

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

Hybrid MPI and OpenMP Parallel Programming

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.

Thinking in Parallel – Implementing In Code New Mexico Supercomputing Challenge in partnership with Intel Corp. and NM EPSCoR.

CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.

MPI and OpenMP.

Finding concurrency Jakub Yaghob. Finding concurrency design space Starting point for design of a parallel solution Analysis The patterns will help identify.

1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

Parallel Computing Presented by Justin Reschke

SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.

Concurrency and Performance Based on slides by Henri Casanova.

1/46 PARALLEL SOFTWARE ( SECTION 2.4). 2/46 The burden is on software From now on… In shared memory programs: Start a single process and fork threads.

Tuning Threaded Code with Intel® Parallel Amplifier.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

Introduction to OpenMP

Computer Science Department

Parallel Algorithm Design

Threaded Programming Methodology

Threaded Programming Methodology

Intel® Parallel Studio and Advisor

Chapter 4: Threads.

Optimization Techniques For Maximizing Application Performance on Multi-Core Processors Kittur Ganesh 11/16/2018.

Threading Methodology using OpenMP*

Threading Methodology using OpenMP*

Multithreading Why & How.

Chapter 4: Threads & Concurrency

Lecture 2 The Art of Concurrency

Parallel Programming in C with MPI and OpenMP

Lecture 11: Machine-Dependent Optimization

CSC Multiprocessor Programming, Spring, 2011

Presentation transcript:

Multi-core Programming Threading Concepts

2 Basics of VTune™ Performance Analyzer Topics A Generic Development Cycle Case Study: Prime Number Generation Common Performance Issues

3 Threaded Programming Methodology What is Parallelism? Two or more processes or threads execute at the same time Parallelism for threading architectures – Multiple processes Communication through Inter-Process Communication (IPC) – Single process, multiple threads Communication through shared memory

4 Threaded Programming Methodology n = number of processors T parallel = { (1-P) + P/n } T serial Speedup = T serial / T parallel Amdahl’s Law Describes the upper bound of parallel execution speedup Serial code limits speedup (1-P) P T serial (1-P) P/ /0.75 = 1.33 n = 2 n = ∞ ∞ P/ ∞ … /0.5 = 2.0 P - proportion of a computation where parallelization may occur..

5 Threaded Programming Methodology Processes and Threads Modern operating systems load programs as processes Resource holder Execution A process starts executing at its entry point as a thread Threads can create other threads within the process – Each thread gets its own stack All threads within a process share code & data segments Code segment Data segment thread main() … thread Stack

6 Threaded Programming Methodology Threads – Benefits & Risks Benefits – Increased performance and better resource utilization Even on single processor systems - for hiding latency and increasing throughput – IPC through shared memory is more efficient Risks – Increases complexity of the application – Difficult to debug (data races, deadlocks, etc.)

7 Threaded Programming Methodology Commonly Encountered Questions with Threading Applications Where to thread? How long would it take to thread? How much re-design/effort is required? Is it worth threading a selected region? What should the expected speedup be? Will the performance meet expectations? Will it scale as more threads/data are added? Which threading model to use?

8 Threaded Programming Methodology Prime Number Generation bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor) ) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { int range = end - start + 1; for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } ifactor

9 Threaded Programming Methodology Development Methodology Analysis – Find computationally intense code Design (Introduce Threads) – Determine how to implement threading solution Debug for correctness – Detect any problems resulting from using threads Tune for performance – Achieve best parallel performance

10 Threaded Programming Methodology Development Cycle Analysis –VTune™ Performance Analyzer Design (Introduce Threads) –Intel® Performance libraries: IPP and MKL –OpenMP* (Intel® Compiler) –Explicit threading (Win32*, Pthreads*) Debug for correctness –Intel® Thread Checker –Intel Debugger Tune for performance –Intel® Thread Profiler –VTune™ Performance Analyzer

11 Threaded Programming Methodology Let’s use the project PrimeSingle for analysis PrimeSingle Usage:./PrimeSingle Analysis - Sampling Use VTune Sampling to find hotspots in application bool TestForPrime(int val) { // let’s start checking from 3 int limit, factor = 3; limit = (long)(sqrtf((float)val)+0.5f); while( (factor <= limit) && (val % factor)) factor ++; return (factor > limit); } void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Identifies the time consuming regions

12 Threaded Programming Methodology Analysis - Call Graph This is the level in the call tree where we need to thread Used to find proper level in the call-tree to thread

13 Threaded Programming Methodology Analysis Where to thread? – FindPrimes() Is it worth threading a selected region? – Appears to have minimal dependencies – Appears to be data-parallel – Consumes over 95% of the run time Baseline measurement

14 Threaded Programming Methodology Foster’s Design Methodology From Designing and Building Parallel Programs by Ian Foster Four Steps: – Partitioning Dividing computation and data – Communication Sharing data between computations – Agglomeration Grouping tasks to improve performance – Mapping Assigning tasks to processors/threads

15 Threaded Programming Methodology Designing Threaded Programs Partition Partition – Divide problem into tasks Communicate Communicate – Determine amount and pattern of communication Agglomerate Agglomerate – Combine tasks Map Map – Assign agglomerated tasks to created threads The Problem Initial tasksCommunicationCombined Tasks Final Program

16 Threaded Programming Methodology Parallel Programming Models Functional Decomposition – Task parallelism – Divide the computation, then associate the data – Independent tasks of the same problem Data Decomposition – Same operation performed on different data – Divide data into pieces, then associate computation

17 Threaded Programming Methodology Decomposition Methods Functional Decomposition – Focusing on computations can reveal structure in a problem Grid reprinted with permission of Dr. Phu V. Luong, Coastal and Hydraulics Laboratory, ERDC Domain Decomposition Focus on largest or most frequently accessed data structure Data Parallelism Same operation applied to all data Atmosphere Model Ocean Model Land Surface Model Hydrology Model

18 Threaded Programming Methodology Pipelined Decomposition Computation done in independent stages Functional decomposition – Threads are assigned stage to compute – Automobile assembly line Data decomposition – Thread processes all stages of single instance – One worker builds an entire car

19 Threaded Programming Methodology LAME Pipeline Strategy Frame N Frame N + 1 Time Other N Prelude N Acoustics N Encoding N T 2 T 1 Acoustics N+1 Prelude N+1 Other N+1 Encoding N+1 Acoustics N+2 Prelude N+2 T 3 T 4 Prelude N+3 Hierarchical Barrier OtherPreludeAcousticsEncoding Frame Fetch next frame Frame characterization Set encode parameters Psycho Analysis FFT long/short Filter assemblage Apply filtering Noise Shaping Quantize & Count bits Add frame header Check correctness Write to disk

20 Threaded Programming Methodology Design What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design/effort is required? Rapid prototyping with OpenMP Speedup(2P) = 100/(96/2+4) = ~1.92X

21 Threaded Programming Methodology OpenMP Fork-join parallelism: – Master thread spawns a team of threads as needed – Parallelism is added incrementally Sequential program evolves into a parallel program Parallel Regions Master Thread

22 Threaded Programming Methodology Design #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } OpenMP Create threads here for this parallel region for Divide iterations of the for loop

23 Threaded Programming Methodology Design What is the expected benefit? How do you achieve this with the least effort? How long would it take to thread? How much re-design/effort is required? Is this the best speedup possible? Speedup of 1.40X (less than 1.92X)

24 Threaded Programming Methodology Debugging for Correctness Is this threaded implementation right? No! The answers are different each time …

25 Threaded Programming Methodology Debugging for Correctness Intel® Thread Checker pinpoints notorious threading bugs like data races, stalls and deadlocks Intel ® Thread Checker VTune™ Performance Analyzer +DLLs (Instrumented) Binary Instrumentation Primes.exe (Instrumented) Runtime Data Collector threadchecker.thr (result file)

26 Threaded Programming Methodology Thread Checker

27 Threaded Programming Methodology Debugging for Correctness How much re-design/effort is required? How long would it take to thread? Thread Checker reported only 2 dependencies, so effort required should be low

28 Threaded Programming Methodology Debugging for Correctness #pragma omp parallel for for( int i = start; i <= end; i+= 2 ){ if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } #pragma omp critical { gProgress++; percentDone = (int)(gProgress/range *200.0f+0.5f) } Will create a critical section for this reference Will create a critical section for both these references

29 Threaded Programming Methodology Correctness 1.33 Correct answer, but performance has slipped to ~1.33X Is this the best we can expect from this algorithm? No! From Amdahl’s Law, we expect speedup close to 1.9X

30 Threaded Programming Methodology Common Performance Issues Parallel Overhead – Due to thread creation, scheduling … Synchronization – Excessive use of global data, contention for the same synchronization object Load Imbalance – Improper distribution of parallel work Granularity – No sufficient parallel work

31 Threaded Programming Methodology Tuning for Performance Thread Profiler pinpoints performance bottlenecks in threaded applications Thread Profiler Thread Profiler VTune™ Performance Analyzer +DLL’s (Instrumented) Binary Instrumentation Primes.c Primes.exe (Instrumented) Runtime Data Collector Bistro.tp/guide.gvs (result file) Compiler Source Instrumentation Primes.exe /Qopenmp_profile

32 Threaded Programming Methodology Thread Profiler for OpenMP

33 Threaded Programming Methodology Thread Profiler for OpenMP Speedup Graph Estimates threading speedup and potential speedup – Based on Amdahl’s Law computation Gives upper and lower bound estimates

34 Threaded Programming Methodology Thread Profiler for OpenMP serial parallel

35 Threaded Programming Methodology Thread Profiler for OpenMP Thread 0Thread 1Thread 2Thread 3

36 Threaded Programming Methodology Thread Profiler (for Explicit Threads)

37 Threaded Programming Methodology Thread Profiler (for Explicit Threads) Why so many transitions?

38 Threaded Programming Methodology Performance This implementation has implicit synchronization calls This limits scaling performance due to the resulting context switches Back to the design stage

Is that much contention expected? The algorithm has many more updates than the 10 needed for showing progress 39 Threaded Programming Methodology Performance void ShowProgress( int val, int range ) { int percentDone; gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 ) printf("\b\b\b\b%3d%", percentDone); } void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; } This change should fix the contention issue

40 Threaded Programming Methodology Design Goals – Eliminate the contention due to implicit synchronization 2.32X Speedup is 2.32X ! Is that right?

41 Threaded Programming Methodology Performance Our original baseline measurement had the “flawed” progress update algorithm Is this the best we can expect from this algorithm? 1.40X Speedup is actually 1.40X (<<1.9X)!

42 Threaded Programming Methodology Performance Re-visited Still have 62% of execution time in locks and synchronization

43 Threaded Programming Methodology Performance Re-visited Let’s look at the OpenMP locks… void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) #pragma omp critical globalPrimes[gPrimesFound++] = i; ShowProgress(i, range); } Lock is in a loop void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); }

44 Threaded Programming Methodology Performance Re-visited Let’s look at the second lock void ShowProgress( int val, int range ) { int percentDone; static int lastPercentDone = 0; #pragma omp critical { gProgress++; percentDone = (int)((float)gProgress/(float)range*200.0f+0.5f); } if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; } This lock is also being called within a loop void ShowProgress( int val, int range ) { long percentDone, localProgress; static int lastPercentDone = 0; localProgress = InterlockedIncrement(&gProgress); percentDone = (int)((float)localProgress/(float)range*200.0f+0.5f); if( percentDone % 10 == 0 && lastPercentDone < percentDone / 10){ printf("\b\b\b\b%3d%", percentDone); lastPercentDone++; }

45 Thread Profiler for OpenMP Thread factors to test Thread factors to test Thread factors to test Thread factors to test

46 Threaded Programming Methodology Fixing the Load Imbalance Distribute the work more evenly void FindPrimes(int start, int end) { // start is always odd int range = end - start + 1; #pragma omp parallel for schedule(static, 8) for( int i = start; i <= end; i += 2 ) { if( TestForPrime(i) ) globalPrimes[InterlockedIncrement(&gPrimesFound)] = i; ShowProgress(i, range); } 1.68X Speedup achieved is 1.68X

47 Threaded Programming Methodology Final Thread Profiler Run 1.80X Speedup achieved is 1.80X

48 Threaded Programming Methodology Comparative Analysis Threading applications require multiple iterations of going through the software development cycle

49 Threaded Programming Methodology Threading Methodology What’s Been Covered Four step development cycle for writing threaded code from serial and the Intel® tools that support each step – Analysis – Design (Introduce Threads) – Debug for correctness – Tune for performance Threading applications require multiple iterations of designing, debugging and performance tuning steps Use tools to improve productivity