Parallel Programming using the PGAS Approach

Slides:



Advertisements
Similar presentations
Introduction to C Programming
Advertisements

MPI Message Passing Interface
Introduction to Openmp & openACC
Practical techniques & Examples
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
Names and Bindings.
Chapter 7: User-Defined Functions II
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 ITCS4145/5145, Parallel Programming B. Wilkinson Feb 21, 2012 Programming with Shared Memory Introduction to OpenMP.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Programming with Shared Memory Introduction to OpenMP
CS470/570 Lecture 5 Introduction to OpenMP Compute Pi example OpenMP directives and options.
Parallel Programming in Java with Shared Memory Directives.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Hybrid MPI and OpenMP Parallel Programming
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin Cascaval** Gheorghe Almasi**, Jose Nelson Amaral* *University.
Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.
MPI and OpenMP.
 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.
How to execute Program structure Variables name, keywords, binding, scope, lifetime Data types – type system – primitives, strings, arrays, hashes – pointers/references.
3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
These slides are based on the book:
Examples (D. Schmidt et al)
The Machine Model Memory
Support for Program Analysis as a First-Class Design Constraint in Legion Michael Bauer 02/22/17.
Chapter 7: User-Defined Functions II
Introduction to OpenMP
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
Conception of parallel algorithms
Lecture 5: Shared-memory Computing with Open MP
CS399 New Beginnings Jonathan Walpole.
3- Parallel Programming Models
Computer Engg, IIT(BHU)
MPI Message Passing Interface
Introduction to OpenMP
September 4, 1997 Parallel Processing (CS 667) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Parallel Processing.
Team 1 Aakanksha Gupta, Solomon Walker, Guanghong Wang
CS212: Object Oriented Analysis and Design
User-Defined Functions
Advanced Operating Systems
Chapter 5 - Functions Outline 5.1 Introduction
7 Arrays.
MPI-Message Passing Interface
Message Passing Models
Outline Midterm results summary Distributed file systems – continued
EKT150 : Computer Programming
CS 267 Unified Parallel C (UPC)
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson Wed. Jan. 31, 2001 *Parts.
September 4, 1997 Parallel Processing (CS 730) Lecture 5: Shared Memory Parallel Programming with OpenMP* Jeremy R. Johnson *Parts of this lecture.
Dr. Bhargavi Dept of CS CHRIST
Shared Memory Programming
May 19 Lecture Outline Introduce MPI functionality
Introduction to parallelism and the Message Passing Interface
Programming with Shared Memory
Introduction to OpenMP
7 Arrays.
An Orchestration Language for Parallel Objects
Programming Parallel Computers
Presentation transcript:

Parallel Programming using the PGAS Approach

UPC (Unified Parallel C) Outline Introduction Programming parallel systems: threading, message passing PGAS as a middle ground UPC (Unified Parallel C) History of UPC Shared scalars and arrays Work-sharing in UPC: parallel loop DASH – PGAS in the form of a C++ template library A quick overview of the project Conclusion

Programming Parallel Machines The two most widely used approaches for parallel programming: Shared Memory Programming using Threads Message Passing Memory System … … Mem Process/thread Physical memory Memory Access (read/write) Explicit Message Private data Shared data

Shared Memory Programming using Threads Examples: OpenMP, Pthreads, C++ threads, Java threads Limited to shared memory systems Shared data can be directly accessed Implicit communication, direct reads, writes Advantages Typically easier to program, natural extension of sequential programming Disadvantages Subtle bugs, race conditions False sharing as a performance problem Memory System … Process/thread Physical memory Memory Access (read/write) Explicit Message Private data Shared data

Message Passing Example Disadvantages Advantages MPI (message passing interface) Disadvantages Complex programming paradigm Manual data partitioning required Explicit coordination of communication (send/receive pairs) Data replication (memory requirement) Advantages Highly efficient and scalable (to the largest machines in use today) Data locality is “automatic” No false sharing, no race conditions Runs everywhere … Mem Process/thread Physical memory Memory Access (read/write) Explicit Message Private data Shared data

Partitioned Global Address Space Best of both worlds Can be used on large scale distributed memory machines but also on shared memory machines A PGAS program looks much like a regular threaded program, but Sharing data is declared explicitly The data partitioning is made explicit Both needed for performance! PGAS Layer shared data space is partitioned! Process/thread Physical memory Memory Access (put/get) Explicit Message Private data Shared data

Partitioned Global Address Space Example Let’s call the members of our program threads Let’s assume we use the SPMD (single program multiple data) paradigm Let’s assume we have a new keyword “shared” that puts variables in the shared global address space This is how PGAS is ex- pressed in UPC (Unified Parallel C)! (more later) shared int ours; int mine; Global Address Space Private Shared Thread 0 Thread 1 Thread n-1 … mine ours n copies of mine (one per thread) Each thread can only access its own copy 1 copy of ours Accessible by every thread

Example: a shared array (UPC) Shared Arrays Example: a shared array (UPC) shared int[4] ours; int mine; Thread 0 Thread 1 Thread 3 mine ours[0] Thread 2 ours[1] ours[2] ours[3] Shared Global Address Space Private Affinity – in which partition a data item “lives” ours (previous slide) lives in partition 0 (by convention) ours[i] lives in partition i

Local-view vs. Global-view Two ways to organize access to shared data: Global-view E.g., Unified Parallel C Local-view E.g., Co-array Fortran X is declared in terms of its global size X is accessed in terms of global indices process is not specified explicitly shared int X[100]; X[i]=23; Global size, Global index a, b are declared in terms of their local size a,b are accessed in terms of local indices process (image) is specified explicitly (the co-index) integer :: a(100)[*], b(100)[*] b(17) = a(17)[2] Local size, Local index co-dimension / co-index in square brackets

UPC is an extension to ANSI C UPC History and Status UPC is an extension to ANSI C New keywords, library functions Developed in the late 1990s and early 2000s Based on previous projects at UCB, IDA, LLNL, … Status Berkeley UPC GCC version Vendor compilers (Cray, IBM, …) Most often used on graph problems, irregular parallelism

A number of threads working independently in a SPMD fashion UPC Execution Model A number of threads working independently in a SPMD fashion Number of threads specified at compile-time or run-time; available as program variable THREADS Note: “thread” is the UPC terminology. UPC threads are most often implemented as a full OS processes MYTHREAD specifies thread index (0...THREADS-1) upc_barrier is a global synchronization: all wait before continuing There is a form of parallel loop (later) There are two compilation modes Static threads mode: THREADS is specified at compile time by the user The program may use THREADS as a compile-time constant Dynamic threads mode: Compiled code may be run with varying numbers of thread

Any legal C program is also a legal UPC program SPMD Model Hello World in UPC Any legal C program is also a legal UPC program SPMD Model If you compile and run it as UPC with N threads, it will run N copies of the program (same model as MPI) #include <upc.h> /* needed for UPC extensions */ #include <stdio.h> main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); } Thread 0 of 4: hello UPC world Thread 1 of 4: hello UPC world Thread 3 of 4: hello UPC world Thread 2 of 4: hello UPC world

A Bigger Example in UPC: Estimate p Estimate Pi by throwing darts at a unit square Calculate percentage that fall in the unit circle Area of square = r2 = 1 Area of circle quadrant = ¼ p r2 = p/4 Randomly throw darts at (x,y) positions If x2 + y2 < 1, then point is inside circle Compute ratio R: R = # points inside / # points total p ≈ 4 R r =1

Pi in UPC, First version #include <stdio.h> #include <math.h> #include <upc.h> main(int argc, char *argv[]) { int i, hits, trials = 0; double pi; if (argc != 2) trials = 1000000; else trials = atoi(argv[1]); srand(MYTHREAD*17); for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); } Each thread gets its own copy of these variables Each thread can use the input arguments Initialize RNG in math library hit() : get random numbers and return 1 if inside circle This program computes N independent estimates of Pi (when run with N threads)

Pi in UPC, Shared Memory Style shared variable to record hits shared int hits=0; main(int argc, char **argv) { int i, my_trials = 0; int trials = atoi(argv[1]); my_trials = (trials + THREADS-1)/THREADS; srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); upc_barrier(); if (MYTHREAD == 0) { printf("PI estimated to %f.", 4.0*hits/trials); } divide up work evenly accumulate hits There is a problem with this program… Problem with this program: race condition! Reading/writing to hits is not synchronized

Fixing the Race Condition A possible fix for the race condition Have a separate counter per thread (use a shared array) One thread computes the total sum int hits=0; shared int all_hits[THREADS]; main(int argc, char **argv) { // declarations and initialization code omitted for (i=0; i < my_trials; i++) all_hits[MYTHREAD] += hit(); upc_barrier(); if (MYTHREAD == 0) { for (i=0; i < THREADS; i++) hits += all_hits[i]; printf("PI estimated to %f.", 4.0*hits/trials); } Shared array: 1 element per thread Each thread accesses its local element, no race condition, no remote communication Thread 0 computes overall sum

Customizable layout of one and multi-dimensional arrays Other UPC Features Locks upc_lock_t: pairwise synchronization between threads Can also be used to fix race condition in previous example Customizable layout of one and multi-dimensional arrays Blocked, cyclic, block-cyclic; cyclic is the default Split-phase barrier upc_notify() and upc_wait() instead of upc_barrier() Shared pointer and pointer to shared Work-sharing (parallel loop)

Worksharing: Vector Addition Example #include <upc_relaxed.h> #define N 100*THREADS shared int v1[N], v2[N], sum[N]; void main() { int i; for(i=0; i<N; i++) { if(MYTHREAD == i%THREADS) { sum[i]=v1[i]+v2[i]; } Default layout: cyclic (round robin) 1 2 3 v1 … 1 2 3 v2 … 1 2 3 sum … Access local elements only: sum[i] has affinity to thread i Each thread iterates over the indices that it “owns” This is a common idiom called “owner computes” UPC supports this idiom directly with a parallel version of the for loop: upc_forall

UPC work-sharing with forall upc_forall(init; test; loop; affinity) statement; upc_forall init, test, loop: same as regular C for loop: defines loop start, increment, and end affinity: defines which iterations a thread is responsible for Syntactic sugar for loop on previous slide: Loop over all Work on those with affinity to this thread Programmer guarantees that the iterations are independent Undefined if there are dependencies across threads Affinity expression: two options Integer: affinity%THREADS is MYTHREAD Pointer: upc_threadof(affinity) is MYTHREAD

Vector Addition with upc_forall The vector addition example can be rewritten as follows Equivalent code could use „&sum[i]“ for the affinity test The code would be correct (but slow) if the affinity expression is i+1 rather than i. #define N 100*THREADS shared int v1[N], v2[N], sum[N]; void main() { int i; upc_forall(i=0; i<N; i++; i) sum[i]=v1[i]+v2[i]; } Affinity expression

UPC is an extention to C, implementing the PGAS model UPC Summary UPC is an extention to C, implementing the PGAS model Available as a gcc version, Berkeley UPC, from some vendors Today most often used for graph problems, irregular parallelism PGAS is a concept realized in UPC and other languages Co-array Fortran, Titanium Chapel, X10, Fortress (HPCS languages) Not covered Collective operations (reductions, etc. similar to MPI) Dynamic memory allocation in shared space UPC shared pointers

DASH – PGAS in the form of a C++ Template library DASH – Overview DASH – PGAS in the form of a C++ Template library Focus on data structures Array a can be stored in the memory of several nodes a[i] transparently refers to local memory or to remote memory via operator overloading dash::array<int> a(1000); a[23]=412; std::cout<<a[42]<<std::endl; Not a new language to learn Can be integrated with existing (MPI) applications Support for hierarchical locality Team hierarchies and locality iterators Node e.g., STL vector, array DASH array

Hierarchical Machines Machines are getting increasingly hierarchical Both within nodes and between nodes Data locality is the most crucial factor for performance and energy efficiency Source: LRZ SuperMUC system description. Source: Bhatele et al.: Avoiding hot-spots in two-level direct networks. SC 2011. Source: Steve Keckler et al.: Echelon System Sketch Hierarchical locality not well supported by current approaches. PGAS languages usually only offer a two-level differentiation (local vs. remote).

DASH – Overview and Project Partners LMU Munich (K. Fürlinger) HLRS Stuttgart (J. Gracia) TU Dresden (A. Knüpfer) KIT Karlsruhe (J. Tao) CEODE Beijing (L. Wang, associated) DASH Runtime DASH C++ Template Library DASH Application Tools and Interfaces Hardware: Network, Processor, Memory, Storage One-sided Communication Substrate MPI GASnet GASPI ARMCI Component of DASH Existing component/ Software

DART: The DASH Runtime Interface The DART API Plain-C based interface Follows the SPMD execution model Defines Units and Teams Defines a global memory abstraction Provides a global pointer Defines one-sided access operations (puts and gets) Provides collective and pair-wise synchronization mechanisms DASH Runtime (DART) DASH C++ Template Library DASH Application Tools and Interfaces Hardware: Network, Processor, Memory, Storage One-sided Communication Substrate MPI GASnet GASPI ARMCI DART API

Unit: individual participants in a DASH/DART program Units and Teams Unit: individual participants in a DASH/DART program Unit ≈ process (MPI) ≈ thread (UPC) ≈ image (CAF) Execution model follows the classical SPMD (single program multiple data) paradigm Each unit has a global ID that remains unchanged during the execution Team: Ordered subset of units Identified by an integer ID DART_TEAM_ALL represents all units in a program Units that are members of a team have a local ID with respect to that team

Communication: One-sided puts and gets Blocking and non-blocking versions Performance of blocking puts and gets closely matches MPI performance

DASH (C++ Template Library) 1D array as the basic data type DASH follows a global-view approach, but local-view programming is supported too Standard algorithms can be used but may not yield best performance lbegin(), lend() allow iteration over local elements

Data Distribution Patterns A Pattern controls the mapping of an index space onto units A team can be specified (the default team is used otherwise) No datatype is specified for a pattern Patterns guarantee a similar mapping for different containers Patterns can be used to specify parallel execution

Accessing and Working with Data in DASH (1) GlobalPtr<T> abstraction that serves as the global iterator GlobalRef<T> abstraction “reference to element in global memory” that is returned by subscript and iterator dereferencing

Accessing and Working with Data in DASH (2) Range based for works on the global object per default Proxy object can be used instead to access the local part of the data

Parallel programming is difficult Summary Parallel programming is difficult PGAS is an interesting middle ground between message passing and shared memory programming with threads Inherits the advantages of both but also shares some of the disadvantages – specifically race conditions PGAS Today mostly used when working with applications with irregular parallelism - random data accesses UPC is the most widely used PGAS approach today Co-array Fortran and other new PGAS languages DASH and other C++ libraries Thank you for your attention!