Parallel Programming using the PGAS Approach

Parallel Programming using the PGAS Approach

UPC (Unified Parallel C)
Outline Introduction Programming parallel systems: threading, message passing PGAS as a middle ground UPC (Unified Parallel C) History of UPC Shared scalars and arrays Work-sharing in UPC: parallel loop DASH – PGAS in the form of a C++ template library A quick overview of the project Conclusion

Programming Parallel Machines
The two most widely used approaches for parallel programming: Shared Memory Programming using Threads Message Passing Memory System … … Mem Process/thread Physical memory Memory Access (read/write) Explicit Message Private data Shared data

Shared Memory Programming using Threads
Examples: OpenMP, Pthreads, C++ threads, Java threads Limited to shared memory systems Shared data can be directly accessed Implicit communication, direct reads, writes Advantages Typically easier to program, natural extension of sequential programming Disadvantages Subtle bugs, race conditions False sharing as a performance problem Memory System … Process/thread Physical memory Memory Access (read/write) Explicit Message Private data Shared data

Message Passing Example Disadvantages Advantages
MPI (message passing interface) Disadvantages Complex programming paradigm Manual data partitioning required Explicit coordination of communication (send/receive pairs) Data replication (memory requirement) Advantages Highly efficient and scalable (to the largest machines in use today) Data locality is “automatic” No false sharing, no race conditions Runs everywhere … Mem Process/thread Physical memory Memory Access (read/write) Explicit Message Private data Shared data

Partitioned Global Address Space
Best of both worlds Can be used on large scale distributed memory machines but also on shared memory machines A PGAS program looks much like a regular threaded program, but Sharing data is declared explicitly The data partitioning is made explicit Both needed for performance! PGAS Layer shared data space is partitioned! Process/thread Physical memory Memory Access (put/get) Explicit Message Private data Shared data

Partitioned Global Address Space
Example Let’s call the members of our program threads Let’s assume we use the SPMD (single program multiple data) paradigm Let’s assume we have a new keyword “shared” that puts variables in the shared global address space This is how PGAS is ex- pressed in UPC (Unified Parallel C)! (more later) shared int ours; int mine; Global Address Space Private Shared Thread 0 Thread 1 Thread n-1 … mine ours n copies of mine (one per thread) Each thread can only access its own copy 1 copy of ours Accessible by every thread

Example: a shared array (UPC)
Shared Arrays Example: a shared array (UPC) shared int[4] ours; int mine; Thread 0 Thread 1 Thread 3 mine ours[0] Thread 2 ours[1] ours[2] ours[3] Shared Global Address Space Private Affinity – in which partition a data item “lives” ours (previous slide) lives in partition 0 (by convention) ours[i] lives in partition i

Local-view vs. Global-view
Two ways to organize access to shared data: Global-view E.g., Unified Parallel C Local-view E.g., Co-array Fortran X is declared in terms of its global size X is accessed in terms of global indices process is not specified explicitly shared int X[100]; X[i]=23; Global size, Global index a, b are declared in terms of their local size a,b are accessed in terms of local indices process (image) is specified explicitly (the co-index) integer :: a(100)[*], b(100)[*] b(17) = a(17)[2] Local size, Local index co-dimension / co-index in square brackets

UPC is an extension to ANSI C
UPC History and Status UPC is an extension to ANSI C New keywords, library functions Developed in the late 1990s and early 2000s Based on previous projects at UCB, IDA, LLNL, … Status Berkeley UPC GCC version Vendor compilers (Cray, IBM, …) Most often used on graph problems, irregular parallelism

A number of threads working independently in a SPMD fashion
UPC Execution Model A number of threads working independently in a SPMD fashion Number of threads specified at compile-time or run-time; available as program variable THREADS Note: “thread” is the UPC terminology. UPC threads are most often implemented as a full OS processes MYTHREAD specifies thread index (0...THREADS-1) upc_barrier is a global synchronization: all wait before continuing There is a form of parallel loop (later) There are two compilation modes Static threads mode: THREADS is specified at compile time by the user The program may use THREADS as a compile-time constant Dynamic threads mode: Compiled code may be run with varying numbers of thread

Any legal C program is also a legal UPC program SPMD Model
Hello World in UPC Any legal C program is also a legal UPC program SPMD Model If you compile and run it as UPC with N threads, it will run N copies of the program (same model as MPI) #include <upc.h> /* needed for UPC extensions */ #include <stdio.h> main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); } Thread 0 of 4: hello UPC world Thread 1 of 4: hello UPC world Thread 3 of 4: hello UPC world Thread 2 of 4: hello UPC world

A Bigger Example in UPC: Estimate p
Estimate Pi by throwing darts at a unit square Calculate percentage that fall in the unit circle Area of square = r2 = 1 Area of circle quadrant = ¼ p r2 = p/4 Randomly throw darts at (x,y) positions If x2 + y2 < 1, then point is inside circle Compute ratio R: R = # points inside / # points total p ≈ 4 R r =1

Pi in UPC, First version #include <stdio.h>
#include <math.h> #include <upc.h> main(int argc, char *argv[]) { int i, hits, trials = 0; double pi; if (argc != 2) trials = ; else trials = atoi(argv[1]); srand(MYTHREAD*17); for (i=0; i < trials; i++) hits += hit(); pi = 4.0*hits/trials; printf("PI estimated to %f.", pi); } Each thread gets its own copy of these variables Each thread can use the input arguments Initialize RNG in math library hit() : get random numbers and return 1 if inside circle This program computes N independent estimates of Pi (when run with N threads)

Pi in UPC, Shared Memory Style
shared variable to record hits shared int hits=0; main(int argc, char **argv) { int i, my_trials = 0; int trials = atoi(argv[1]); my_trials = (trials + THREADS-1)/THREADS; srand(MYTHREAD*17); for (i=0; i < my_trials; i++) hits += hit(); upc_barrier(); if (MYTHREAD == 0) { printf("PI estimated to %f.", 4.0*hits/trials); } divide up work evenly accumulate hits There is a problem with this program… Problem with this program: race condition! Reading/writing to hits is not synchronized

Fixing the Race Condition
A possible fix for the race condition Have a separate counter per thread (use a shared array) One thread computes the total sum int hits=0; shared int all_hits[THREADS]; main(int argc, char **argv) { // declarations and initialization code omitted for (i=0; i < my_trials; i++) all_hits[MYTHREAD] += hit(); upc_barrier(); if (MYTHREAD == 0) { for (i=0; i < THREADS; i++) hits += all_hits[i]; printf("PI estimated to %f.", 4.0*hits/trials); } Shared array: 1 element per thread Each thread accesses its local element, no race condition, no remote communication Thread 0 computes overall sum

Customizable layout of one and multi-dimensional arrays
Other UPC Features Locks upc_lock_t: pairwise synchronization between threads Can also be used to fix race condition in previous example Customizable layout of one and multi-dimensional arrays Blocked, cyclic, block-cyclic; cyclic is the default Split-phase barrier upc_notify() and upc_wait() instead of upc_barrier() Shared pointer and pointer to shared Work-sharing (parallel loop)

Worksharing: Vector Addition Example
#include <upc_relaxed.h> #define N 100*THREADS shared int v1[N], v2[N], sum[N]; void main() { int i; for(i=0; i<N; i++) { if(MYTHREAD == i%THREADS) { sum[i]=v1[i]+v2[i]; } Default layout: cyclic (round robin) 1 2 3 v1 … 1 2 3 v2 … 1 2 3 sum … Access local elements only: sum[i] has affinity to thread i Each thread iterates over the indices that it “owns” This is a common idiom called “owner computes” UPC supports this idiom directly with a parallel version of the for loop: upc_forall

UPC work-sharing with forall
upc_forall(init; test; loop; affinity) statement; upc_forall init, test, loop: same as regular C for loop: defines loop start, increment, and end affinity: defines which iterations a thread is responsible for Syntactic sugar for loop on previous slide: Loop over all Work on those with affinity to this thread Programmer guarantees that the iterations are independent Undefined if there are dependencies across threads Affinity expression: two options Integer: affinity%THREADS is MYTHREAD Pointer: upc_threadof(affinity) is MYTHREAD

Vector Addition with upc_forall
The vector addition example can be rewritten as follows Equivalent code could use „&sum[i]“ for the affinity test The code would be correct (but slow) if the affinity expression is i+1 rather than i. #define N 100*THREADS shared int v1[N], v2[N], sum[N]; void main() { int i; upc_forall(i=0; i<N; i++; i) sum[i]=v1[i]+v2[i]; } Affinity expression

UPC is an extention to C, implementing the PGAS model
UPC Summary UPC is an extention to C, implementing the PGAS model Available as a gcc version, Berkeley UPC, from some vendors Today most often used for graph problems, irregular parallelism PGAS is a concept realized in UPC and other languages Co-array Fortran, Titanium Chapel, X10, Fortress (HPCS languages) Not covered Collective operations (reductions, etc. similar to MPI) Dynamic memory allocation in shared space UPC shared pointers

DASH – PGAS in the form of a C++ Template library
DASH – Overview DASH – PGAS in the form of a C++ Template library Focus on data structures Array a can be stored in the memory of several nodes a[i] transparently refers to local memory or to remote memory via operator overloading dash::array<int> a(1000); a[23]=412; std::cout<<a[42]<<std::endl; Not a new language to learn Can be integrated with existing (MPI) applications Support for hierarchical locality Team hierarchies and locality iterators Node e.g., STL vector, array DASH array

Hierarchical Machines
Machines are getting increasingly hierarchical Both within nodes and between nodes Data locality is the most crucial factor for performance and energy efficiency Source: LRZ SuperMUC system description. Source: Bhatele et al.: Avoiding hot-spots in two-level direct networks. SC 2011. Source: Steve Keckler et al.: Echelon System Sketch Hierarchical locality not well supported by current approaches. PGAS languages usually only offer a two-level differentiation (local vs. remote).

DASH – Overview and Project Partners
LMU Munich (K. Fürlinger) HLRS Stuttgart (J. Gracia) TU Dresden (A. Knüpfer) KIT Karlsruhe (J. Tao) CEODE Beijing (L. Wang, associated) DASH Runtime DASH C++ Template Library DASH Application Tools and Interfaces Hardware: Network, Processor, Memory, Storage One-sided Communication Substrate MPI GASnet GASPI ARMCI Component of DASH Existing component/ Software

DART: The DASH Runtime Interface
The DART API Plain-C based interface Follows the SPMD execution model Defines Units and Teams Defines a global memory abstraction Provides a global pointer Defines one-sided access operations (puts and gets) Provides collective and pair-wise synchronization mechanisms DASH Runtime (DART) DASH C++ Template Library DASH Application Tools and Interfaces Hardware: Network, Processor, Memory, Storage One-sided Communication Substrate MPI GASnet GASPI ARMCI DART API

Unit: individual participants in a DASH/DART program
Units and Teams Unit: individual participants in a DASH/DART program Unit ≈ process (MPI) ≈ thread (UPC) ≈ image (CAF) Execution model follows the classical SPMD (single program multiple data) paradigm Each unit has a global ID that remains unchanged during the execution Team: Ordered subset of units Identified by an integer ID DART_TEAM_ALL represents all units in a program Units that are members of a team have a local ID with respect to that team

Communication: One-sided puts and gets
Blocking and non-blocking versions Performance of blocking puts and gets closely matches MPI performance

DASH (C++ Template Library)
1D array as the basic data type DASH follows a global-view approach, but local-view programming is supported too Standard algorithms can be used but may not yield best performance lbegin(), lend() allow iteration over local elements

Data Distribution Patterns
A Pattern controls the mapping of an index space onto units A team can be specified (the default team is used otherwise) No datatype is specified for a pattern Patterns guarantee a similar mapping for different containers Patterns can be used to specify parallel execution

Accessing and Working with Data in DASH (1)
GlobalPtr<T> abstraction that serves as the global iterator GlobalRef<T> abstraction “reference to element in global memory” that is returned by subscript and iterator dereferencing

Accessing and Working with Data in DASH (2)
Range based for works on the global object per default Proxy object can be used instead to access the local part of the data

Parallel programming is difficult
Summary Parallel programming is difficult PGAS is an interesting middle ground between message passing and shared memory programming with threads Inherits the advantages of both but also shares some of the disadvantages – specifically race conditions PGAS Today mostly used when working with applications with irregular parallelism - random data accesses UPC is the most widely used PGAS approach today Co-array Fortran and other new PGAS languages DASH and other C++ libraries Thank you for your attention!

Parallel Programming using the PGAS Approach

Similar presentations

Presentation on theme: "Parallel Programming using the PGAS Approach"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming using the PGAS Approach

Similar presentations

Presentation on theme: "Parallel Programming using the PGAS Approach"— Presentation transcript:

Similar presentations

About project

Feedback