Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.

Slides:



Advertisements
Similar presentations
Implementing Domain Decompositions Intel Software College Introduction to Parallel Programming – Part 3.
Advertisements

Chapter 14 Recursion Lecture Slides to Accompany An Introduction to Computer Science Using Java (2nd Edition) by S.N. Kamin, D. Mickunas,, E. Reingold.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Open[M]ulti[P]rocessing Pthreads: Programmer explicitly define thread behavior openMP: Compiler and system defines thread behavior Pthreads: Library independent.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Computer Architecture II 1 Computer architecture II Programming: POSIX Threads OpenMP.
Quicksort. Quicksort I To sort a[left...right] : 1. if left < right: 1.1. Partition a[left...right] such that: all a[left...p-1] are less than a[p], and.
Introduction to OpenMP For a more detailed tutorial see: Look at the presentations.
 Monday, 9/30/02, Slide #1 CS106 Introduction to CS1 Monday, 9/30/02  QUESTIONS (on HW02, etc.)??  Today: Libraries, program design  More on Functions!
Quicksort.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
CSCI-6964: High Performance Parallel & Distributed Computing (HPDC) AE 216, Mon/Thurs 2-3:20 p.m. Pthreads (reading Chp 7.10) Prof. Chris Carothers Computer.
A Very Short Introduction to OpenMP Basile Schaeli EPFL – I&C – LSP Vincent Keller EPFL – STI – LIN.
INTEL CONFIDENTIAL OpenMP for Domain Decomposition Introduction to Parallel Programming – Part 5.
CS 470/570 Lecture 7 Dot Product Examples Odd-even transposition sort More OpenMP Directives.
Budapest, November st ALADIN maintenance and phasing workshop Short introduction to OpenMP Jure Jerman, Environmental Agency of Slovenia.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 5 Shared Memory Programming with OpenMP An Introduction to Parallel Programming Peter Pacheco.
Parallel Programming in Java with Shared Memory Directives.
Chapter 17 Shared-Memory Programming. Introduction OpenMP is an application programming interface (API) for parallel programming on multiprocessors. It.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
CUDA Programming David Monismith CS599 Based on notes from the Udacity Parallel Programming (cs344) Course.
OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.
OpenMP: Open specifications for Multi-Processing What is OpenMP? Join\Fork model Join\Fork model Variables Variables Explicit parallelism Explicit parallelism.
OpenMP – Introduction* *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
04/10/25Parallel and Distributed Programming1 Shared-memory Parallel Programming Taura Lab M1 Yuuki Horita.
CS 838: Pervasive Parallelism Introduction to OpenMP Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from online references.
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
OpenMP fundamentials Nikita Panov
High-Performance Parallel Scientific Computing 2008 Purdue University OpenMP Tutorial Seung-Jai Min School of Electrical and Computer.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
9/22/2011CS4961 CS4961 Parallel Programming Lecture 9: Task Parallelism in OpenMP Mary Hall September 22,
Threaded Programming Lecture 2: Introduction to OpenMP.
Intro To Algorithms Searching and Sorting. Searching A common task for a computer is to find a block of data A common task for a computer is to find a.
Synchronization These notes introduce:
Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Constructs for Data Organization and Program Control, Scope, Binding, and Parameter Passing. Expression Evaluation.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
COMP7330/7336 Advanced Parallel and Distributed Computing OpenMP: Programming Model Dr. Xiao Qin Auburn University
OpenMP – Part 2 * *UHEM yaz çalıştayı notlarından derlenmiştir. (uhem.itu.edu.tr)
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
1 Programming with Shared Memory - 2 Issues with sharing data ITCS 4145 Parallel Programming B. Wilkinson Jan 22, _Prog_Shared_Memory_II.ppt.
B. Estrade, LSU – High Performance Computing Enablement Group OpenMP II B. Estrade.
Introduction to OpenMP
CS427 Multicore Architecture and Parallel Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Improving Barrier Performance Dr. Xiao Qin.
Loop Parallelism and OpenMP CS433 Spring 2001
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing A bug in the rwlock program Dr. Xiao Qin.
Open[M]ulti[P]rocessing
Computer Engg, IIT(BHU)
Introduction to OpenMP
Shared-Memory Programming
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Computer Science Department
Parallel Sorting Algorithms
Unit-2 Divide and Conquer
Introduction to High Performance Computing Lecture 20
Programming with Shared Memory Introduction to OpenMP
Parallel Computation Patterns (Reduction)
DNA microarrays. Infinite Mixture Model-Based Clustering of DNA Microarray Data Using openMP.
Introduction to OpenMP
Quicksort.
Synchronization These notes introduce:
OpenMP Parallel Programming
Presentation transcript:

Uses some of the slides for chapters 7 and 9 accompanying “Introduction to Parallel Computing”, Addison Wesley, And some slides from OpenMP Usage / Orna Agmon Ben-Yehuda

OpenMP Shared memory parallel applications API for C, C++ and Fortran. High-level directives for threads manipulation. OpenMP directives provide support for concurrency, synchronization, and data handling. Based on the #pragma compiler directive. 2

Reminder – Quicksort Choose one of the elements to be the pivot. Rearrange list so that all smaller elements are before the pivot, and all larger one are after it. After rearrangement, pivot is in its final location! Recursively rearrange smaller and larger elements. Divide and conquer algorithm – suitable for parallel execution. 3

Serial Quicksort code void quicksort(int* values, int from, int to) { int pivotVal, i, j; if (from < to) { pivotVal = values[from]; i = from; for (j = from + 1; j < to; j++) { if (values[j] <= pivotVal) { i = i + 1; swap(values[i], values[j]); } swap(values[from], values[i]); quicksort(values, from, i); quicksort(values, i + 1, to); } (a) (b) (c) (d) (e) Pivot Final position 4

Naïve parallel Quicksort – Try 1 void quicksort(int* values, int from, int to) { int pivotVal, i, j; if (from < to) { pivotVal = values[from]; i = from; for (j = from + 1; j < to; j++) { if (values[j] <= pivotVal) { i = i + 1; swap(values[i], values[j]); } swap(values[from], values[i]); #pragma omp parallel sections { #pragma omp parallel sections quicksort(values, from, i); #pragma omp section quicksort(values, i + 1, to); } Sections are performed in parallel Without omp_set_nested(1) no threads will be created recursively Is this efficient? 5

Naïve parallel Quicksort – Try 2 void quicksort(int* values, int from, int to) { int pivotVal, i, j, create_thread; int max_threads = omp_get_max_threads(); static int threads_num = 1; if (from < to) { // … choose pivot and do swaps … #pragma omp atomic create_thread = (threads_num++ < max_threads); if (create_thread) { #pragma omp parallel sections { #pragma omp section quicksort(values, from, i); #pragma omp section quicksort(values, i + 1, to); } #pragma omp atomic threads_num--; } else { #pragma omp atomic threads_num--; quicksort(values, from, i); quicksort(values, i + 1, to); } Does this compile? What’s the atomic action? 6

Naïve parallel Quicksort – Try 3 void quicksort(int* values, int from, int to) { int pivotVal, i, j, create_thread; if (from < to) { // … choose pivot and do swaps … create_thread = change_threads_num(1); if (create_thread) { #pragma omp parallel sections { #pragma omp section quicksort(values, from, i); #pragma omp section quicksort(values, i + 1, to); } change_threads_num(-1); } else { quicksort(values, from, i); quicksort(values, i + 1, to); } int change_threads_num(int change) { static int threads_num = 0; static int max_threads = omp_get_max_threads(); int do_change; #pragma omp critical { do_change = (threads_num += change) < max_threads; threads_num = min( max_threads, threads_num); } return do_change; } Can’t return from block Is this always better than non-atomic actions? What goes wrong if thread count is inaccurate? Is this always better than non-atomic actions? What goes wrong if thread count is inaccurate? 7

Breaking Out Breaking out of parallel loops is not allowed and will not compile. A continue can replace many breaks, pending on the probabilities known to programmer. An exit using a wrapper function will not be identified at compile time, but results will be undefined. If exit data is important, better to collect status in a variable and exit at the end of the loop. Need to treat the possibility that more than one thread reaches a certain error state. Need to be careful in other constructs (e.g. barriers). 8

Lessons from naïve Quicksort Control the number of threads when creating them recursively. No compare-and-swap operations in OpenMP  Critical sections are relatively expensive. Implicit flush occurs after barriers, critical sections, atomic operations (only to the accessed address). 9

Efficient Quicksort Naïve implementation performs partitioning serially. We would like to parallelize partitioning In each step Pivot selection Local rearrangement (why?) Global rearrangement 10

Efficient global rearrangement Without efficient global rearrangement the partitioning is still serial. Each processor will keep track of smaller (Si) and larger (Li) numbers than pivot. Then we will sum prefixes to find the location for each value. Prefix sum can be done in parallel (PPC) 11

Code for the brave… void p_quicksort(int* A, int n) { struct proc *d; int i, p, s, ss, p_loc, done = 0; int *temp, *next, *cur = A; next = temp = malloc(sizeof(int) * n); #pragma omp parallel shared(next, done, d, p) private(i, s, ss, p_loc) { int c = omp_get_thread_num(); p = omp_get_num_threads(); #pragma omp single d = malloc(sizeof(struct proc) * p); d[c].start = c * n / p; d[c].end = (c + 1) * n / p – 1; d[c].sec = d[c].done = 0; do { s = d[c].start; if (c == d[c].sec && !d[c].done) { int p_loc = quick_select(cur+s, d[c].end-s+1); swap(cur[s], cur[s+p_loc]); /* Set pivot at start */ } ss = d[d[c].sec].start; /* pivot location */ #pragma omp barrier /* Local Rearrangment */ for (i = d[c].start; i <= d[c].end; i++) if (cur[i] <= cur[ss]) { swap(cur[s], cur[i]); s++; } Block executed by all threads in parallel Use thread ID and threads num to divide work. Some work must be done only once. Implicit barrier follows. Wait for all threads to arrive before moving on. 12

More code for the brave… /* Sum prefix */ d[c].s_i = max(s - d[c].start, 0); d[c].l_i = max(d[c].end - s + 1, 0); #pragma omp barrier if (c == d[c].sec) d[c].p_loc = s_prefix_sum(d, d[c].sec, p) + d[c].start – 1; /* Global rearrengment */ #pragma omp barrier p_loc = d[c].p_loc = d[d[c].sec].p_loc; memcpy(next + ss + d[c].s_i, cur + d[c].start, sizeof(int)*(s – d[c].start)); memcpy(next + p_loc + d[c].l_i + 1, cur + s, sizeof(int)*(d[c].end - s + 1)); #pragma omp barrier /* Put pivot in place */ if (d[c].sec == c && !d[c].done) { swap(next[ss], next[p_loc]); cur[p_loc] = next[p_loc]; /* the pivot in its final current */ } #pragma omp barrier #pragma omp single { done = p_rebalance(d, p); swap(cur, next); } } while (!done); /* One more iteration sequentially */ seq_quicksort(cur+d[c].start, 0, d[c].end - d[c].start + 1); #pragma omp barrier if (A != cur) memcpy(A + n*c/p, cur + n*c/p, siz eof(int)*(n*(c+1)/p – n*c/p)); } free(d); free(temp); } Out of parallel block – implicit barrier. 13

Lessons from efficient Quicksort Implicit barrier takes place at the entry and exit of work-sharing construct – parallel, critical, ordered and the exit of single construct. Sometimes we need to rethink parallel algorithms, and make them different from serial ones. Algorithms should be chosen by actual performance and not only complexity. Pivot selection has higher importance in the efficient Quicksort. 14

Moving data between threads Barriers may be overkill (overhead...). OpenMP flush (list) directive can be used for specific variables: 1. Write variable on thread A 2. Flush variable on thread A 3. Flush variable on thread B 4. Read variable on thread B Parameters are flushed by the order of the list. More about OpenMP memory model later on this course. 15

Backup 16

Efficient Quicksort – last bits int p_rebalance(struct proc* d, int p) { int c, new_sec, sec, done = 1; new_sec = 1; for (c = 0; c < p; c++) { d[c].sec = sec = new_sec ? c : max(sec, d[c].sec); new_sec = 0; if (c+1 = d[c].start && d[c].p_loc <= d[c].end) { d[c+1].start = d[c].p_loc + 1; d[c].end = d[c].p_loc – 1; new_sec = 1; } if (c+2 = d[c+1].start && d[c].p_loc <= d[c+1].end && d[c+2].end - d[c].p_loc > d[c].p_loc - d[c].start) { d[c+1].start = d[c].p_loc + 1; d[c].end = d[c].p_loc – 1; new_sec = 1; } if (d[c].end < d[c].start) { d[c].sec = c; new_sec = 1; d[c].done = 1; } else if (d[c].end > d[sec].start && d[c].sec != c) done = 0; } return done; } This is the serial implementation. Could also be done in parallel. This is the serial implementation. Could also be done in parallel. 17