Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:http://www.joehummel.net/downloads.html.

Slides:



Advertisements
Similar presentations
OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Advertisements

Introductions to Parallel Programming Using OpenMP
Parallel Extensions to the.NET Framework Daniel Moth Microsoft
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Adjunct Professor: U. of Illinois, Chicago and Loyola University Chicago
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 4: Threads CS 170 TY, Sept 2011.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.
Introduction to OpenMP Introduction OpenMP basics OpenMP directives, clauses, and library routines.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Programming with Shared Memory Introduction to OpenMP
Parallel Programming in Java with Shared Memory Directives.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Joe Hummel, PhD Technical Staff: Pluralsight Adjunct Professor: UIC, LUC
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
4.1 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 4: Threads Overview Multithreading Models Thread Libraries  Pthreads  Windows.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
Parallel Programming: Responsiveness vs. Performance Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois,
Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Threaded Programming Lecture 4: Work sharing directives.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
Multithreading in Java Sameer Singh Chauhan Lecturer, I. T. Dept., SVIT, Vasad.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
Introduction to OpenMP
DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition, Chapter 4: Multithreaded Programming.
ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.
Silberschatz, Galvin and Gagne ©2013Operating System Concepts – 9 th Edition Chapter 4: Threads Modified.
CSE Operating System Principles
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Concurrency and Performance Based on slides by Henri Casanova.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Joe Hummel, PhD U. Of Illinois, Chicago Loyola University Chicago
Tuning Threaded Code with Intel® Parallel Amplifier.
CS240A, T. Yang, Parallel Programming with OpenMP.
Lecturer 3: Processes multithreaded Operating System Concepts Process Concept Process Scheduling Operation on Processes Cooperating Processes Interprocess.
CS 110 Computer Architecture Lecture 20: Thread-Level Parallelism (TLP) and OpenMP Intro Instructor: Sören Schwertfeger School.
Parallelisation of Desktop Environments Nasser Giacaman Supervised by Dr Oliver Sinnen Department of Electrical and Computer Engineering, The University.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Threads.
Processes and Threads Processes and their scheduling
CS427 Multicore Architecture and Parallel Computing
Async or Parallel? No they aren’t the same thing!
Chapter 4: Multithreaded Programming
Computer Engg, IIT(BHU)
Introduction to OpenMP
Chapter 4: Threads Overview Multithreading Models Thread Libraries
Multi-core CPU Computing Straightforward with OpenMP
Chapter 4: Threads.
Parallel Programming with OpenMP
12 Asynchronous Programming
Lab. 3 (May 11th) You may use either cygwin or visual studio for using OpenMP Compiling in cygwin “> gcc –fopenmp ex1.c” will generate a.exe Execute :
Introduction to OpenMP
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Presentation transcript:

Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:

2  Async programming:  Better responsiveness…  GUIs (desktop, web, mobile)  Cloud access (login, data, …)  Disk and network I/O  Parallel programming:  Better performance…  Engineering  Oil and Gas  Pharma  Science  Social media / big data Unpredictable operations Compute-intensive workloads

Common Solution: multithreading Involves running code on separate threads 3 Operating System Rapidly switches CPU from one thread to the other, so both execute & make forward progress… Operating System Rapidly switches CPU from one thread to the other, so both execute & make forward progress… Main thread C Main GUI > interact with user Main GUI > interact with user Work Stmt1; Stmt2; Stmt3; Work Stmt1; Stmt2; Stmt3; Worker thread

 Asian options financial modeling… 4

Issue Long-running event handlers pose a problem… – If the current event handler takes a long time… – … then remaining events wait in queue — app feels unresponsive 5 event App Current event being processed

Async programming with threads… Attempt #1… 6 void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); listbox.Items.Add(result); }); t.Start(); Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); listbox.Items.Add(result); }); t.Start(); using System.Threading; Boom!

Attempt #2… UI thread owns the UI, so worker threads must delegate access… 7 void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); this.Dispatcher.Invoke( () => { listbox.Items.Add(result); } ); }); t.Start(); Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); this.Dispatcher.Invoke( () => { listbox.Items.Add(result); } ); }); t.Start();

async / await Language-based solution in C#... 8 void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } async void button1_Click(…) { var result = await Task.Run(() => DoLongRunningOp()); lstBox.Items.Add(result); } async void button1_Click(…) { var result = await Task.Run(() => DoLongRunningOp()); lstBox.Items.Add(result); } Method *may* perform async, long- running op Tells compiler "don't wait for this task to finish" — set aside the code that follows so it happens later — and just return. No blocking! using System.Threading.Tasks;

Threads are… ◦ expensive to create (involves OS) ◦ easy to over-subscribe (i.e. create too many) ◦ an obfuscation (intent of code is harder to recognize) ◦ tedious to program  Lambda expressions in C#/C++/Java make threading slightly better  Exception handling is particularly difficult 9

 Asian options financial modeling… 10

11  Parallel programming to speed up the simulation? Parallel.For // for(…) Parallel.For(0, sims, (index) => {. }); // for(…) Parallel.For(0, sims, (index) => {. });

12 Parallel.For(0, N, (i) => for (int i = 0; i < N; i++) {. } ); Parallel.For(0, N, (i) => for (int i = 0; i < N; i++) {... } ); fork join Sequentia l Paralle l Structured (“Fork-Join”) Parallelism using System.Threading.Tasks;

Task-based execution model 13 Windows Process (.NET) App Domain App Domain App Domain App Domain App Domain App Domain.NET Thread Pool worker thread worker thread worker thread worker thread global work queue Task Parallel Library Resource Manager Task Scheduler Windows Parallel.For(…); task

14 Where's the other data race? double sum = 0.0; string sumLock = "sumLock"; // for(…) Parallel.For(0, sims, (index) => { lock (sumLock) { sum += callPayOff; } }); double sum = 0.0; string sumLock = "sumLock"; // for(…) Parallel.For(0, sims, (index) => { lock (sumLock) { sum += callPayOff; } }); ?

 Various solutions…  use synchronization (e.g. locking)  use thread-safe entities (e.g. parallel collections)  redesign to eliminate (e.g. reduction) 15 least preferred most preferred

 Reduction is a common parallel design pattern: ◦ Each task computes its own, local result — no shared resource ◦ Merge (“reduce”) results at the end — minimal locking 16 Parallel.For (0, sims, () => { return new TLS(); }, // init: create thread-local storage (index, control, tls) => // loop body: one simulation {... tls.sum += callPayOff; return tls; }, (tls) => // thread has finished: integrate partial result { lock(sumLock) { sum += tls.sum; } } ); Parallel.For (0, sims, () => { return new TLS(); }, // init: create thread-local storage (index, control, tls) => // loop body: one simulation {... tls.sum += callPayOff; return tls; }, (tls) => // thread has finished: integrate partial result { lock(sumLock) { sum += tls.sum; } } );

State of mainstream parallel programming 17 LanguageSupport for Parallelism Technologies CNo use Pthreads or other library C++ (before 2011)No use Pthreads or other library C++14Minimal Built-in support for threads, async JavaBetter Threads, Tasks, Fork/Join, Parallel data structures C#Better++ Threads, Tasks, Async, Parallel loops and data structures

 libraries:  MPI, TBB, Boost, Actors, PLINQ, TPL, PPL, Pthreads, Thrust, …  language extensions:  OpenMP, TBB (Thread Building Blocks), AMP, OpenACC, …  parallel languages:  CPU-based: Chapel, X10, High Performance Fortran, …  GPU-based: CUDA, OpenCL, … 18 Other options for parallel performance?

 Mandelbrot with OpenMP 19

OpenMP  OpenMP == Open Multiprocessing (Multithreading) ◦ an open standard for platform-neutral multithreading ◦ very popular, with widespread support in most compilers (e.g. gcc 4.2) ◦ programmer directs parallelization via code annotations ◦ compiler implements 20 sum = 0.0; for (int i=0; i < N; ++i) sum = sum + A[i]; sum = 0.0; for (int i=0; i < N; ++i) sum = sum + A[i]; sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i=0; i < N; ++i) sum = sum + A[i]; sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i=0; i < N; ++i) sum = sum + A[i];

 OpenMP supports: ◦ parallel regions and loops ◦ reductions ◦ load balancing ◦ critical sections ◦... void dot_product(int64 *z, int32 x[], int32 y[], int32 N) { int64 sum = 0; #pragma omp parallel for reduction(+:sum) for (int32 i = 0; i < N; ++i) sum += (x[i] * y[i]); *z = sum; } void dot_product(int64 *z, int32 x[], int32 y[], int32 N) { int64 sum = 0; #pragma omp parallel for reduction(+:sum) for (int32 i = 0; i < N; ++i) sum += (x[i] * y[i]); *z = sum; } OpenMP version generates same lock-free reduction we did by hand…

 By default you get static scheduling… ◦ iteration space is divided evenly before execution ◦ more efficient, but assumes uniform workload void Mandelbrot() { #pragma omp parallel for for (int row=0; row < N; ++row)) {. } void Mandelbrot() { #pragma omp parallel for for (int row=0; row < N; ++row)) {. } Mandelbrot has non-uniform distribution of work…

 OpenMP also supports dynamic scheduling ◦ iteration space is divided into small pieces, assigned dynamically ◦ slightly more overhead, but handles non-uniform workloads void Mandelbrot() { #pragma omp parallel for schedule(dynamic) for (int row=0; row < N; ++row)) {. } void Mandelbrot() { #pragma omp parallel for schedule(dynamic) for (int row=0; row < N; ++row)) {. } divide iteration space dynamically to load-balance

 Matrix Multiplication with parallel_for 24

#include // // Naïve parallel solution using parallel_for: result is structured parallelism, with // static division of workload by row. // //for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } ); #include // // Naïve parallel solution using parallel_for: result is structured parallelism, with // static division of workload by row. // //for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } ); x y z

 Very good! ◦ matrix multiplication is "embarrassingly parallel" ◦ linear speedup — 2x on 2 cores, 4x on 4 cores, … Version CoresTime (secs)Speedup Sequential130 Parallel

 What's the other half of the chip? ◦ cache!  Are we using it effectively? ◦ we are not… Memory cache…

 No one solves MM using the naïve algorithm ◦ horrible cache behavior X HW prefetches data assuming program will go Left -> Right or Right -> Left. Do this whenever possible…

for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; #pragma omp parallel for for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; #pragma omp parallel for for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); Another factor of 2- 10x improvement!

#pragma omp parallel for for (int jj=0; jj<N; jj+=BS) // for each column block: { int jjEND = Min(jj+BS, N); // initialize: for (int i=0; i<N; i++) for (int j=jj; j < jjEND; j++) C[i][j] = 0.0; // block multiply: for (int kk=0; kk<N; kk+=BS) // for each row block: { int kkEND = Min(kk+BS, N); for (int i=0; i<N; i++) for (int k=kk; k < kkEND; k++) for (int j=jj; j < jjEND; j++) C[i][j] += (A[i][k] * B[k][j]); } #pragma omp parallel for for (int jj=0; jj<N; jj+=BS) // for each column block: { int jjEND = Min(jj+BS, N); // initialize: for (int i=0; i<N; i++) for (int j=jj; j < jjEND; j++) C[i][j] = 0.0; // block multiply: for (int kk=0; kk<N; kk+=BS) // for each row block: { int kkEND = Min(kk+BS, N); for (int i=0; i<N; i++) for (int k=kk; k < kkEND; k++) for (int j=jj; j < jjEND; j++) C[i][j] += (A[i][k] * B[k][j]); }

 Caching impacts all programs, sequential & parallel… Version CoresTime (secs)Speedup Sequential Naive130 Blocked1310 OpenMP Naïve Blocked

 Parallelism alone is not enough… 32 HPC == Parallelism + Memory Hierarchy ─ Contention Expose parallelism Maximize data locality: network disk RAM cache core Minimize interaction: false sharing locking synchronization

33  Thank for attending / listening  Presenter: Joe Hummel ◦ ◦ Materials: