Download presentation
Presentation is loading. Please wait.
Published byXavier Casebolt Modified over 9 years ago
1
2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu
2
2003 Michigan Technological University March 19, 2003 2 Overview Background Collective operations in the UPC language The V1.0 UPC collectives specification Relocalization operations Computational operations Performance and implementation issues Extensions Other work
3
2003 Michigan Technological University March 19, 2003 3 Background UPC is an extension of C that provides a partitioned shared memory programming model. The V1.1 UPC spec was adopted on March 25. Processes in UPC are called threads. Each thread has a private (local) address space. All threads share a global address space that is partitioned among the threads. A shared object that resides in thread i’s partition is said to have affinity to thread i. If thread i has affinity to a shared object x, it is expected that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.
4
2003 Michigan Technological University March 19, 2003 4 UPC programming model shared A[0]=7; 7 local th 0 th 1 th 2 shared [5] int A[10*THREADS]; 0 15 105 2025 int i; iii i=3; 3 A[i]=A[0]+2; 9
5
2003 Michigan Technological University March 19, 2003 5 Collective operations in UPC If any thread calls a collective function, then all threads must also call that function. Collectives arguments are single-valued: corresponding function arguments have the same value. V1.1 UPC contains several collective functions: upc_notify and upc_wait upc_barrier upc_all_alloc upc_all_lock_alloc These collectives provide synchronization and memory allocation across all threads.
6
2003 Michigan Technological University March 19, 2003 6 shared void *upc_all_alloc(nblocks, nbytes); This function allocates shared [nbytes] char[nblocks*nbytes] shared local ppp th 0 th 1 th 2 p=upc_all_alloc(4,5 ); p=upc_all_alloc(4,5 ); 0 15 105 p=upc_all_alloc(4,5); shared [5] char *p;
7
2003 Michigan Technological University March 19, 2003 7 The V1.0 UPC Collectives Spec First draft by Wiebel and Greenberg, March 2002. Spec discussed at May, 2002, and SC’02 UPC workshops. Many helpful comments from Dan Bonachea and Brian Wibecan. V1.0 will be released shortly.
8
2003 Michigan Technological University March 19, 2003 8 Collective functions Initialization upc_all_init “Relocalization” collectives change data affinity. upc_all_broadcast upc_all_scatter upc_all_gather upc_all_gather_all upc_all_exchange upc_all_permute “Computational” collectives for reduction and sorting. upc_all_reduce upc_all_prefix_reduce upc_all_sort
9
2003 Michigan Technological University March 19, 2003 9 void upc_all_broadcast(dst, src, blk); shared local th 0 th 1 th 2 dst src } blk Thread 0 sends the same block of data to each thread. shared [] char src[blk]; shared [blk] char dst[blk*THREADS];
10
2003 Michigan Technological University March 19, 2003 10 void upc_all_scatter(dst, src, blk); shared local th 0 th 1 th 2 dst src Thread 0 sends a unique block of data to each thread. shared [] char src[blk*THREADS]; shared [blk] char dst[blk*THREADS];
11
2003 Michigan Technological University March 19, 2003 11 void upc_all_gather(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends a block of data to thread 0. shared [blk] char src[blk*THREADS]; shared [] char dst[blk*THREADS];
12
2003 Michigan Technological University March 19, 2003 12 void upc_all_gather_all(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends one block of data to all threads.
13
2003 Michigan Technological University March 19, 2003 13 void upc_all_exchange(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends a unique block of data to each thread.
14
2003 Michigan Technological University March 19, 2003 14 void upc_all_permute(dst, src, perm, blk); shared local th 0 th 1 th 2 1 2 0 dst src perm Thread i sends a block of data to thread perm(i).
15
2003 Michigan Technological University March 19, 2003 15 Computational collectives Reduce and prefix reduce One function for each C scalar type, e.g., upc_all_reduceI(…) returns an integer Operations +, *, &, |, XOR, &&, ||, min, max user-defined binary function Sort User-defined comparison function void upc_all_sort(shared void *A, void upc_all_sort(shared void *A, size_t size, size_t n, size_t blk, size_t size, size_t n, size_t blk, int (*func)(shared void *, shared void *)); int (*func)(shared void *, shared void *));
16
2003 Michigan Technological University March 19, 2003 16 int upc_all_reduceI(src, UPC_ADD, n, blk, NULL); shared local src th 0 th 1 th 2 0 9 63 i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); shared [3] int src[4*THREADS]; int i; iii 4281163264128256 51210242048 4 2 8 1 16 32 64 128 256 448 56 3591 512 1024 2048 4095 i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); Thread 0 receives UPC_OP src[i]. i=0 n
17
2003 Michigan Technological University March 19, 2003 17 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared local th 0 th 1 th 2 dst src shared [*] int src[3*THREADS], dst[3*THREADS]; 036 036 13216428641282561321642864128256 1 32 4162864128256 1 127 63 15 31 511 255 3 3 7 7 1531 63 127 255 Thread k receives UPC_OP src[i]. i=0 k
18
2003 Michigan Technological University March 19, 2003 18 Performance and implementation issues “Push” or “pull”? Synchronization semantics Effects of data distribution
19
2003 Michigan Technological University March 19, 2003 19 shared local th 0 th 1 th 2 A “pull” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } 021 dst src
20
2003 Michigan Technological University March 19, 2003 20 shared local th 0 th 1 th 2 A “push” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk ); } 021 dst src iii 0 1 2
21
2003 Michigan Technological University March 19, 2003 21 Synchronization semantics When are function arguments ready? When are function results available?
22
2003 Michigan Technological University March 19, 2003 22 local shared Synchronization semantics Arguments with affinity to thread i are ready when thread i calls the function; results with affinity to thread i are ready when thread i returns. This is appealing but it is incorrect: In a broadcast, thread 1 does not know when thread 0 is ready. 021 dst src
23
2003 Michigan Technological University March 19, 2003 23 Synchronization semantics Require the implementation to provide barriers at function entry and exit. This is convenient for the programming but it is likely to adversely affect performance. void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_barrier; // pull upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); upc_barrier; }
24
2003 Michigan Technological University March 19, 2003 24 Synchronization semantics V1.0 spec: Synchronization is a user responsibility. #define numelems 10 shared [] int A[numelems]; shared [numelems] int B[numelems*THREADS]; void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); }. // Initialize A.. upc_barrier; upc_all_broadcast( B, A, sizeof(int)*numelems ); upc_barrier;
25
2003 Michigan Technological University March 19, 2003 25 Performance and implementation issues Data distribution affects both performance and implementation.
26
2003 Michigan Technological University March 19, 2003 26 shared 127 local th 0 th 1 th 2 dst src shared int src[3*THREADS], dst[3*THREADS]; 012 012 1321642864128256, src, UPC_ADD, n, blk, NULL); void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); Thread k receives UPC_OP src[i]. i=0 k 1 32 164 2 8 64 128 256 1 3 7 15 31 63 255511 3 7 15 3163255 127
27
2003 Michigan Technological University March 19, 2003 27 Extensions Strided copying Vectors of offsets for src and dst arrays Variable-sized blocks Reblocking (cf: preceding example of prefix reduce) shared int src[3*THREADS]; shared [3] int dst[3*THREADS]; upc_forall(i=0; i<3*THREADS; i++; ?) dst[i] = src[i];
28
2003 Michigan Technological University March 19, 2003 28 More sophisticated synchronization semantics Consider the “pull” implementation of broadcast. There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other. Each thread does a pairwise synchronization with thread 0. Thread i will not have to wait if it reaches its synchronization point after thread 0. Thread 0 returns from the call after it has sync’d with each thread.
29
2003 Michigan Technological University March 19, 2003 29 What’s next? The V1.0 collective spec will be adopted in the next few weeks. A reference implementation will be available from MTU immediately afterwards.
30
2003 Michigan Technological University March 19, 2003 30 MuPC run time system for UPC UPC memory model (Chuck Wallace) UPC programmability (Phil Merkey) UPC test suite (Phil Merkey) http://www.upc.mtu.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.