Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:http://www.joehummel.net/downloads.html.

Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago email:joe@joehummel.net stuff:http://www.joehummel.net/downloads.html

2  Async programming:  Better responsiveness…  GUIs (desktop, web, mobile)  Cloud access (login, data, …)  Disk and network I/O  Parallel programming:  Better performance…  Engineering  Oil and Gas  Pharma  Science  Social media / big data Unpredictable operations Compute-intensive workloads

Common Solution: multithreading Involves running code on separate threads 3 Operating System Rapidly switches CPU from one thread to the other, so both execute & make forward progress… Operating System Rapidly switches CPU from one thread to the other, so both execute & make forward progress… Main thread C Main GUI > interact with user Main GUI > interact with user Work Stmt1; Stmt2; Stmt3; Work Stmt1; Stmt2; Stmt3; Worker thread

 Asian options financial modeling… 4

Issue Long-running event handlers pose a problem… – If the current event handler takes a long time… – … then remaining events wait in queue — app feels unresponsive 5 event App Current event being processed

Async programming with threads… Attempt #1… 6 void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); listbox.Items.Add(result); }); t.Start(); Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); listbox.Items.Add(result); }); t.Start(); using System.Threading; Boom!

Attempt #2… UI thread owns the UI, so worker threads must delegate access… 7 void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); this.Dispatcher.Invoke( () => { listbox.Items.Add(result); } ); }); t.Start(); Thread t = new Thread( () => // lambda expression: { var result = DoLongRunningOp(); this.Dispatcher.Invoke( () => { listbox.Items.Add(result); } ); }); t.Start();

async / await Language-based solution in C#... 8 void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } void button1_Click(…) { var result = DoLongRunningOp(); lstBox.Items.Add(result); } async void button1_Click(…) { var result = await Task.Run(() => DoLongRunningOp()); lstBox.Items.Add(result); } async void button1_Click(…) { var result = await Task.Run(() => DoLongRunningOp()); lstBox.Items.Add(result); } Method *may* perform async, long- running op Tells compiler "don't wait for this task to finish" — set aside the code that follows so it happens later — and just return. No blocking! using System.Threading.Tasks;

Threads are… ◦ expensive to create (involves OS) ◦ easy to over-subscribe (i.e. create too many) ◦ an obfuscation (intent of code is harder to recognize) ◦ tedious to program  Lambda expressions in C#/C++/Java make threading slightly better  Exception handling is particularly difficult 9

 Asian options financial modeling… 10

11  Parallel programming to speed up the simulation? Parallel.For // for(…) Parallel.For(0, sims, (index) => {. }); // for(…) Parallel.For(0, sims, (index) => {. });

12 Parallel.For(0, N, (i) => for (int i = 0; i < N; i++) {. } ); Parallel.For(0, N, (i) => for (int i = 0; i < N; i++) {... } ); fork join Sequentia l Paralle l Structured (“Fork-Join”) Parallelism using System.Threading.Tasks;

Task-based execution model 13 Windows Process (.NET) App Domain App Domain App Domain App Domain App Domain App Domain.NET Thread Pool worker thread worker thread worker thread worker thread global work queue Task Parallel Library Resource Manager Task Scheduler Windows Parallel.For(…); task

14 Where's the other data race? double sum = 0.0; string sumLock = "sumLock"; // for(…) Parallel.For(0, sims, (index) => { lock (sumLock) { sum += callPayOff; } }); double sum = 0.0; string sumLock = "sumLock"; // for(…) Parallel.For(0, sims, (index) => { lock (sumLock) { sum += callPayOff; } }); ?

 Various solutions…  use synchronization (e.g. locking)  use thread-safe entities (e.g. parallel collections)  redesign to eliminate (e.g. reduction) 15 least preferred most preferred

 Reduction is a common parallel design pattern: ◦ Each task computes its own, local result — no shared resource ◦ Merge (“reduce”) results at the end — minimal locking 16 Parallel.For (0, sims, () => { return new TLS(); }, // init: create thread-local storage (index, control, tls) => // loop body: one simulation {... tls.sum += callPayOff; return tls; }, (tls) => // thread has finished: integrate partial result { lock(sumLock) { sum += tls.sum; } } ); Parallel.For (0, sims, () => { return new TLS(); }, // init: create thread-local storage (index, control, tls) => // loop body: one simulation {... tls.sum += callPayOff; return tls; }, (tls) => // thread has finished: integrate partial result { lock(sumLock) { sum += tls.sum; } } );

State of mainstream parallel programming 17 LanguageSupport for Parallelism Technologies CNo use Pthreads or other library C++ (before 2011)No use Pthreads or other library C++14Minimal Built-in support for threads, async JavaBetter Threads, Tasks, Fork/Join, Parallel data structures C#Better++ Threads, Tasks, Async, Parallel loops and data structures

 libraries:  MPI, TBB, Boost, Actors, PLINQ, TPL, PPL, Pthreads, Thrust, …  language extensions:  OpenMP, TBB (Thread Building Blocks), AMP, OpenACC, …  parallel languages:  CPU-based: Chapel, X10, High Performance Fortran, …  GPU-based: CUDA, OpenCL, … 18 Other options for parallel performance?

 Mandelbrot with OpenMP 19

OpenMP  OpenMP == Open Multiprocessing (Multithreading) ◦ an open standard for platform-neutral multithreading ◦ very popular, with widespread support in most compilers (e.g. gcc 4.2) ◦ programmer directs parallelization via code annotations ◦ compiler implements 20 sum = 0.0; for (int i=0; i < N; ++i) sum = sum + A[i]; sum = 0.0; for (int i=0; i < N; ++i) sum = sum + A[i]; sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i=0; i < N; ++i) sum = sum + A[i]; sum = 0.0; #pragma omp parallel for reduction(+:sum) for (int i=0; i < N; ++i) sum = sum + A[i];

 OpenMP supports: ◦ parallel regions and loops ◦ reductions ◦ load balancing ◦ critical sections ◦... void dot_product(int64 *z, int32 x[], int32 y[], int32 N) { int64 sum = 0; #pragma omp parallel for reduction(+:sum) for (int32 i = 0; i < N; ++i) sum += (x[i] * y[i]); *z = sum; } void dot_product(int64 *z, int32 x[], int32 y[], int32 N) { int64 sum = 0; #pragma omp parallel for reduction(+:sum) for (int32 i = 0; i < N; ++i) sum += (x[i] * y[i]); *z = sum; } OpenMP version generates same lock-free reduction we did by hand…

 By default you get static scheduling… ◦ iteration space is divided evenly before execution ◦ more efficient, but assumes uniform workload void Mandelbrot() { #pragma omp parallel for for (int row=0; row < N; ++row)) {. } void Mandelbrot() { #pragma omp parallel for for (int row=0; row < N; ++row)) {. } Mandelbrot has non-uniform distribution of work…

 OpenMP also supports dynamic scheduling ◦ iteration space is divided into small pieces, assigned dynamically ◦ slightly more overhead, but handles non-uniform workloads void Mandelbrot() { #pragma omp parallel for schedule(dynamic) for (int row=0; row < N; ++row)) {. } void Mandelbrot() { #pragma omp parallel for schedule(dynamic) for (int row=0; row < N; ++row)) {. } divide iteration space dynamically to load-balance

 Matrix Multiplication with parallel_for 24

#include // // Naïve parallel solution using parallel_for: result is structured parallelism, with // static division of workload by row. // //for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } ); #include // // Naïve parallel solution using parallel_for: result is structured parallelism, with // static division of workload by row. // //for (int i = 0; i < N; i++) Concurrency::parallel_for(0, N, [&](int i) { for (int j = 0; j < N; j++) { C[i][j] = 0.0; for (int k = 0; k < N; k++) C[i][j] += (A[i][k] * B[k][j]); } ); x y z

 Very good! ◦ matrix multiplication is "embarrassingly parallel" ◦ linear speedup — 2x on 2 cores, 4x on 4 cores, … Version CoresTime (secs)Speedup Sequential130 Parallel47.63.9

 What's the other half of the chip? ◦ cache!  Are we using it effectively? ◦ we are not… Memory cache…

 No one solves MM using the naïve algorithm ◦ horrible cache behavior X HW prefetches data assuming program will go Left -> Right or Right -> Left. Do this whenever possible…

for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; #pragma omp parallel for for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); for (int i = 0; i < N; i++) for (int j = 0; j < N; j++) C[i][j] = 0.0; #pragma omp parallel for for (int i = 0; i < N; i++) for (int k = 0; k < N; k++) for (int j = 0; j < N; j++) C[i][j] += (A[i][k] * B[k][j]); Another factor of 2- 10x improvement!

#pragma omp parallel for for (int jj=0; jj<N; jj+=BS) // for each column block: { int jjEND = Min(jj+BS, N); // initialize: for (int i=0; i<N; i++) for (int j=jj; j < jjEND; j++) C[i][j] = 0.0; // block multiply: for (int kk=0; kk<N; kk+=BS) // for each row block: { int kkEND = Min(kk+BS, N); for (int i=0; i<N; i++) for (int k=kk; k < kkEND; k++) for (int j=jj; j < jjEND; j++) C[i][j] += (A[i][k] * B[k][j]); } #pragma omp parallel for for (int jj=0; jj<N; jj+=BS) // for each column block: { int jjEND = Min(jj+BS, N); // initialize: for (int i=0; i<N; i++) for (int j=jj; j < jjEND; j++) C[i][j] = 0.0; // block multiply: for (int kk=0; kk<N; kk+=BS) // for each row block: { int kkEND = Min(kk+BS, N); for (int i=0; i<N; i++) for (int k=kk; k < kkEND; k++) for (int j=jj; j < jjEND; j++) C[i][j] += (A[i][k] * B[k][j]); }

 Caching impacts all programs, sequential & parallel… Version CoresTime (secs)Speedup Sequential Naive130 Blocked1310 OpenMP Naïve47.63.9 Blocked40.837.5

 Parallelism alone is not enough… 32 HPC == Parallelism + Memory Hierarchy ─ Contention Expose parallelism Maximize data locality: network disk RAM cache core Minimize interaction: false sharing locking synchronization

33  Thank for attending / listening  Presenter: Joe Hummel ◦ Email: joe@joehummel.net ◦ Materials: http://www.joehummel.net/downloads.html

Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:http://www.joehummel.net/downloads.html.

Similar presentations

Presentation on theme: "Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:http://www.joehummel.net/downloads.html."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:http://www.joehummel.net/downloads.html.

Similar presentations

Presentation on theme: "Joe Hummel, PhD Microsoft MVP Visual C++ Technical Staff: Pluralsight, LLC Professor: U. of Illinois, Chicago stuff:http://www.joehummel.net/downloads.html."— Presentation transcript:

Similar presentations

About project

Feedback