ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Dependency Analysis We want to “parallelize” program to make it run faster. For that, we need dependence analysis to ensure correctness.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 26, 2013, DyanmicParallelism.ppt CUDA Dynamic Parallelism These notes will outline CUDA.
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
PARALLEL PROGRAMMING WITH OPENMP Ing. Andrea Marongiu
Pthreads Topics Introduction to Pthreads Data Parallelism Task Parallelism: Pipeline Task Parallelism: Task queue Examples.
CS 584. Review n Systems of equations and finite element methods are related.
1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.
Tutorial 12 Working with Arrays, Loops, and Conditional Statements
Computer Science 1620 Multi-Dimensional Arrays. we used arrays to store a set of data of the same type e.g. store the assignment grades for a particular.
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Parallelizing Compilers Presented by Yiwei Zhang.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.
Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.
ECE1747 Parallel Programming Shared Memory Multithreading Pthreads.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Lecture 8 – Stencil Pattern Stencil Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Multi-Dimensional Arrays
1 Chapter 24 Developing Efficient Algorithms. 2 Executing Time Suppose two algorithms perform the same task such as search (linear search vs. binary search)
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
Iterative Algorithm Analysis & Asymptotic Notations
ECE 1747 Parallel Programming Shared Memory: OpenMP Environment and Synchronization.
1 OpenMP Writing programs that use OpenMP. Using OpenMP to parallelize many serial for loops with only small changes to the source code. Task parallelism.
Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.
ECE 1747H : Parallel Programming Message Passing (MPI)
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Array Dependence Analysis COMP 621 Special Topics By Nurudeen Lameed
Java Software Solutions Lewis and Loftus Chapter 14 1 Copyright 1997 by John Lewis and William Loftus. All rights reserved. Advanced Flow of Control --
Sept COMP60611 Fundamentals of Concurrency Lab Exercise 2 Notes Notes on the finite difference performance model example – for the lab… Graham Riley,
Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,
Work Replication with Parallel Region #pragma omp parallel { for ( j=0; j
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
1 Asymptotic Notations Iterative Algorithms and their analysis Asymptotic Notations –Big O,  Notations Review of Discrete Math –Summations –Logarithms.
ECE 1747H: Parallel Programming Lecture 2: Data Parallelism.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
Lecture 9 Architecture Independent (MPI) Algorithm Design
HPF (High Performance Fortran). What is HPF? HPF is a standard for data-parallel programming. Extends Fortran-77 or Fortran-90. Similar extensions exist.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Case Study 5: Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, forces acting on them... Have.
CSC 212 – Data Structures Lecture 15: Big-Oh Notation.
CPE779: Shared Memory and OpenMP Based on slides by Laxmikant V. Kale and David Padua of the University of Illinois.
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
Auburn University
ECE 1747 Parallel Programming
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Presented by: Huston Bokinsky Ying Zhang 25 April, 2013
ECE1747 Parallel Programming
ECE1747 Parallel Programming
Jacobi Project Salvatore Orlando.
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson Oct 14, 2014 slides6b.ppt 1.
Stencil Pattern ITCS 4/5145 Parallel computing, UNC-Charlotte, B. Wilkinson StencilPattern.ppt Oct 14,
Parallel Programming in C with MPI and OpenMP
Data Parallel Pattern 6c.1
MPI (continue) An example for designing explicit message passing programs Emphasize on the difference between shared memory code and distributed memory.
Mattan Erez The University of Texas at Austin
Presentation transcript:

ECE 1747H: Parallel Programming Lecture 2-3: More on parallelism and dependences -- synchronization

Synchronization All programming models give the user the ability to control the ordering of events on different processors. This facility is called synchronization.

Example 1 f() { a = 1; b = 2; c = 3;} g() { d = 4; e = 5; f = 6; } main() { f(); g(); } No dependences between f and g. Thus, f and g can be run in parallel.

Example 2 f() { a = 1; b = 2; c = 3; } g() { a = 4; b = 5; c = 6; } main() { f(); g(); } Dependences between f and g. Thus, f and g cannot be run in parallel.

Example 2 (continued) f() { a = 1; b = 2; c = 3; } g() { a = 4 ; b = 5; c = 6; } main() { f(); g(); } Dependences are between assignments to a, assignments to b, assignments to c. No other dependences. Therefore, we only need to enforce these dependences.

Synchronization Facility Suppose we had a set of primitives, signal(x) and wait(x). wait(x) blocks unless a signal(x) has occurred. signal(x) does not block, but causes a wait(x) to unblock, or causes a future wait(x) not to block.

Example 2 (continued) f() { a = 1; b = 2; c = 3; } g() { a = 4; b = 5; c = 6; } main() { f(); g(); } f() { a = 1; signal(e_a); b = 2; signal(e_b); c = 3; signal(e_c); } g() { wait(e_a); a = 4; wait(e_b); b = 5; wait(e_c); c = 6; } main() { f(); g(); }

Example 2 (continued) a = 1;wait(e_a); signal(e_a); b = 2;a = 4; signal(e_b);wait(e_b); c = 3;b = 5; signal(e_c);wait(e_c); c = 6; Execution is (mostly) parallel and correct. Dependences are “covered” by synchronization.

About synchronization Synchronization is necessary to make some programs execute correctly in parallel. However, synchronization is expensive. Therefore, needs to be reduced, or sometimes need to give up on parallelism.

Example 3 f() { a=1; b=2; c=3; } g() { d=4; e=5; a=6; } main() { f(); g(); } f() { a=1; signal(e_a); b=2; c=3; } g() { d=4; e=5; wait(e_a); a=6; } main() { f(); g(); }

Example 4 for( i=1; i<100; i++ ) { a[i] = …; …; … = a[i-1]; } Loop-carried dependence, not parallelizable

Example 4 (continued) for( i=...; i<...; i++ ) { a[i] = …; signal(e_a[i]); …; wait(e_a[i-1]); … = a[i-1]; }

Example 4 (continued) Note that here it matters which iterations are assigned to which processor. It does not matter for correctness, but it matters for performance. Cyclic assignment is probably best.

Example 5 for( i=0; i<100; i++ ) a[i] = f(i); x = g(a); for( i=0; i<100; i++ ) b[i] = x + h( a[i] ); First loop can be run in parallel. Middle statement is sequential. Second loop can be run in parallel.

Example 5 (contimued) We will need to make parallel execution stop after first loop and resume at the beginning of the second loop. Two (standard) ways of doing that: –fork() - join() –barrier synchronization

Fork-Join Synchronization fork() causes a number of processes to be created and to be run in parallel. join() causes all these processes to wait until all of them have executed a join().

Example 5 (continued) fork(); for( i=...; i<...; i++ ) a[i] = f(i); join(); x = g(a); fork(); for( i=...; i<...; i++ ) b[i] = x + h( a[i] ); join();

Example 6 sum = 0.0; for( i=0; i<100; i++ ) sum += a[i]; Iterations have dependence on sum. Cannot be parallelized, but...

Example 6 (continued) for( k=0; k<...; k++ ) sum[k] = 0.0; fork(); for( j=…; j<…; j++ ) sum[k] += a[j]; join(); sum = 0.0; for( k=0; k<...; k++ ) sum += sum[k];

Reduction This pattern is very common. Many parallel programming systems have explicit support for it, called reduction. sum = reduce( +, a, 0, 100 );

Final word on synchronization Many different synchronization constructs exist in different programming models. Dependences have to be “covered” by appropriate synchronization. Synchronization is often expensive.

ECE 1747H: Parallel Programming Lecture 2-3: Data Parallelism

Previously Ordering of statements. Dependences. Parallelism. Synchronization.

Goal of next few lectures Standard patterns of parallel programs. Examples of each. Later, code examples in various programming models.

Flavors of Parallelism Data parallelism: all processors do the same thing on different data. –Regular –Irregular Task parallelism: processors do different tasks. –Task queue –Pipelines

Data Parallelism Essential idea: each processor works on a different part of the data (usually in one or more arrays). Regular or irregular data parallelism: using linear or non-linear indexing. Examples: MM (regular), SOR (regular), MD (irregular).

Matrix Multiplication Multiplication of two n by n matrices A and B into a third n by n matrix C

Matrix Multiply for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) c[i][j] = 0.0; for( i=0; i<n; i++ ) for( j=0; j<n; j++ ) for( k=0; k<n; k++ ) c[i][j] += a[i][k]*b[k][j];

Parallel Matrix Multiply No loop-carried dependences in i- or j-loop. Loop-carried dependence on k-loop. All i- and j-iterations can be run in parallel.

Parallel Matrix Multiply (contd.) If we have P processors, we can give n/P rows or columns to each processor. Or, we can divide the matrix in P squares, and give each processor one square.

Data Distribution: Examples BLOCK DISTRIBUTION

Data Distribution: Examples BLOCK DISTRIBUTION BY ROW

Data Distribution: Examples BLOCK DISTRIBUTION BY COLUMN

Data Distribution: Examples CYCLIC DISTRIBUTION BY COLUMN

Data Distribution: Examples BLOCK CYCLIC

Data Distribution: Examples COMBINATIONS

SOR SOR implements a mathematical model for many natural phenomena, e.g., heat dissipation in a metal sheet. Model is a partial differential equation. Focus is on algorithm, not on derivation.

Problem Statement x y F = 1 F = 0 F(x,y) = 0 2

Discretization Represent F in continuous rectangle by a 2- dimensional discrete grid (array). The boundary conditions on the rectangle are the boundary values of the array The internal values are found by the relaxation algorithm.

Discretized Problem Statement i j

Relaxation Algorithm For some number of iterations for each internal grid point compute average of its four neighbors Termination condition: values at grid points change very little (we will ignore this part in our example)

Discretized Problem Statement for some number of timesteps/iterations { for (i=1; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=1; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

Parallel SOR No dependences between iterations of first (i,j) loop nest. No dependences between iterations of second (i,j) loop nest. Anti-dependence between first and second loop nest in the same timestep. True dependence between second loop nest and first loop nest of next timestep.

Parallel SOR (continued) First (i,j) loop nest can be parallelized. Second (i,j) loop nest can be parallelized. We must make processors wait at the end of each (i,j) loop nest. Natural synchronization: fork-join.

Parallel SOR (continued) If we have P processors, we can give n/P rows or columns to each processor. Or, we can divide the array in P squares, and give each processor a square to compute.

Molecular Dynamics (MD) Simulation of a set of bodies under the influence of physical laws. Atoms, molecules, celestial bodies,... Have same basic structure.

Molecular Dynamics (Skeleton) for some number of timesteps { for all molecules i for all other molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (continued) To reduce amount of computation, account for interaction only with nearby molecules.

Molecular Dynamics (continued) for some number of timesteps { for all molecules i for all nearby molecules j force[i] += f( loc[i], loc[j] ); for all molecules i loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (continued) for each molecule i number of nearby molecules count[i] array of indices of nearby molecules index[j] ( 0 <= j < count[i])

Molecular Dynamics (continued) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Molecular Dynamics (continued) No loop-carried dependence in first i-loop. Loop-carried dependence (reduction) in j- loop. No loop-carried dependence in second i- loop. True dependence between first and second i-loop.

Molecular Dynamics (continued) First i-loop can be parallelized. Second i-loop can be parallelized. Must make processors wait between loops. Natural synchronization: fork-join.

Molecular Dynamics (continued) for some number of timesteps { for( i=0; i<num_mol; i++ ) for( j=0; j<count[i]; j++ ) force[i] += f(loc[i],loc[index[j]]); for( i=0; i<num_mol; i++ ) loc[i] = g( loc[i], force[i] ); }

Irregular vs. regular data parallel In SOR, all arrays are accessed through linear expressions of the loop indices, known at compile time [regular]. In MD, some arrays are accessed through non-linear expressions of the loop indices, some known only at runtime [irregular].

Irregular vs. regular data parallel No real differences in terms of parallelization (based on dependences). Will lead to fundamental differences in expressions of parallelism: –irregular difficult for parallelism based on data distribution –not difficult for parallelism based on iteration distribution.

Molecular Dynamics (continued) Parallelization of first loop: –has a load balancing issue –some molecules have few/many neighbors –more sophisticated loop partitioning necessary