Presentation is loading. Please wait.

Presentation is loading. Please wait.

2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University

Similar presentations


Presentation on theme: "2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University"— Presentation transcript:

1 2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University steve@mtu.edu

2 2003 Michigan Technological University March 19, 2003 2 Overview  Background  Collective operations in the UPC language  The V1.0 UPC collectives specification  Relocalization operations  Computational operations  Performance and implementation issues  Extensions  Other work

3 2003 Michigan Technological University March 19, 2003 3 Background  UPC is an extension of C that provides a partitioned shared memory programming model.  The V1.1 UPC spec was adopted on March 25.  Processes in UPC are called threads.  Each thread has a private (local) address space.  All threads share a global address space that is partitioned among the threads.  A shared object that resides in thread i’s partition is said to have affinity to thread i.  If thread i has affinity to a shared object x, it is expected that accesses to x take less time than accesses to shared objects to which thread i does not have affinity.

4 2003 Michigan Technological University March 19, 2003 4 UPC programming model shared A[0]=7; 7 local th 0 th 1 th 2 shared [5] int A[10*THREADS]; 0 15 105 2025 int i; iii i=3; 3 A[i]=A[0]+2; 9

5 2003 Michigan Technological University March 19, 2003 5 Collective operations in UPC  If any thread calls a collective function, then all threads must also call that function.  Collectives arguments are single-valued: corresponding function arguments have the same value.  V1.1 UPC contains several collective functions:  upc_notify and upc_wait  upc_barrier  upc_all_alloc  upc_all_lock_alloc  These collectives provide synchronization and memory allocation across all threads.

6 2003 Michigan Technological University March 19, 2003 6 shared void *upc_all_alloc(nblocks, nbytes); This function allocates shared [nbytes] char[nblocks*nbytes] shared local ppp th 0 th 1 th 2 p=upc_all_alloc(4,5 ); p=upc_all_alloc(4,5 ); 0 15 105 p=upc_all_alloc(4,5); shared [5] char *p;

7 2003 Michigan Technological University March 19, 2003 7 The V1.0 UPC Collectives Spec  First draft by Wiebel and Greenberg, March 2002.  Spec discussed at May, 2002, and SC’02 UPC workshops.  Many helpful comments from Dan Bonachea and Brian Wibecan.  V1.0 will be released shortly.

8 2003 Michigan Technological University March 19, 2003 8 Collective functions  Initialization  upc_all_init  “Relocalization” collectives change data affinity.   upc_all_broadcast   upc_all_scatter   upc_all_gather   upc_all_gather_all   upc_all_exchange   upc_all_permute  “Computational” collectives for reduction and sorting.   upc_all_reduce   upc_all_prefix_reduce   upc_all_sort

9 2003 Michigan Technological University March 19, 2003 9 void upc_all_broadcast(dst, src, blk); shared local th 0 th 1 th 2 dst src } blk Thread 0 sends the same block of data to each thread. shared [] char src[blk]; shared [blk] char dst[blk*THREADS];

10 2003 Michigan Technological University March 19, 2003 10 void upc_all_scatter(dst, src, blk); shared local th 0 th 1 th 2 dst src Thread 0 sends a unique block of data to each thread. shared [] char src[blk*THREADS]; shared [blk] char dst[blk*THREADS];

11 2003 Michigan Technological University March 19, 2003 11 void upc_all_gather(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends a block of data to thread 0. shared [blk] char src[blk*THREADS]; shared [] char dst[blk*THREADS];

12 2003 Michigan Technological University March 19, 2003 12 void upc_all_gather_all(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends one block of data to all threads.

13 2003 Michigan Technological University March 19, 2003 13 void upc_all_exchange(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends a unique block of data to each thread.

14 2003 Michigan Technological University March 19, 2003 14 void upc_all_permute(dst, src, perm, blk); shared local th 0 th 1 th 2 1 2 0 dst src perm Thread i sends a block of data to thread perm(i).

15 2003 Michigan Technological University March 19, 2003 15 Computational collectives  Reduce and prefix reduce  One function for each C scalar type, e.g., upc_all_reduceI(…) returns an integer  Operations +, *, &, |, XOR, &&, ||, min, max user-defined binary function  Sort  User-defined comparison function void upc_all_sort(shared void *A, void upc_all_sort(shared void *A, size_t size, size_t n, size_t blk, size_t size, size_t n, size_t blk, int (*func)(shared void *, shared void *)); int (*func)(shared void *, shared void *));

16 2003 Michigan Technological University March 19, 2003 16 int upc_all_reduceI(src, UPC_ADD, n, blk, NULL); shared local src th 0 th 1 th 2 0 9 63 i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); shared [3] int src[4*THREADS]; int i; iii 4281163264128256 51210242048 4 2 8 1 16 32 64 128 256 448  56   3591 512 1024 2048 4095 i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); Thread 0 receives UPC_OP src[i]. i=0 n

17 2003 Michigan Technological University March 19, 2003 17 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared local th 0 th 1 th 2 dst src shared [*] int src[3*THREADS], dst[3*THREADS]; 036 036 13216428641282561321642864128256 1 32 4162864128256 1 127 63 15 31 511 255 3 3 7 7 1531 63 127 255 Thread k receives UPC_OP src[i]. i=0 k

18 2003 Michigan Technological University March 19, 2003 18 Performance and implementation issues  “Push” or “pull”?  Synchronization semantics  Effects of data distribution

19 2003 Michigan Technological University March 19, 2003 19 shared local th 0 th 1 th 2 A “pull” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } 021 dst src

20 2003 Michigan Technological University March 19, 2003 20 shared local th 0 th 1 th 2 A “push” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk ); } 021 dst src iii 0 1 2

21 2003 Michigan Technological University March 19, 2003 21 Synchronization semantics  When are function arguments ready?  When are function results available?

22 2003 Michigan Technological University March 19, 2003 22 local shared Synchronization semantics  Arguments with affinity to thread i are ready when thread i calls the function; results with affinity to thread i are ready when thread i returns.  This is appealing but it is incorrect: In a broadcast, thread 1 does not know when thread 0 is ready. 021 dst src

23 2003 Michigan Technological University March 19, 2003 23 Synchronization semantics  Require the implementation to provide barriers at function entry and exit.  This is convenient for the programming but it is likely to adversely affect performance. void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_barrier; // pull upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); upc_barrier; }

24 2003 Michigan Technological University March 19, 2003 24 Synchronization semantics  V1.0 spec: Synchronization is a user responsibility. #define numelems 10 shared [] int A[numelems]; shared [numelems] int B[numelems*THREADS]; void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); }. // Initialize A.. upc_barrier; upc_all_broadcast( B, A, sizeof(int)*numelems ); upc_barrier;

25 2003 Michigan Technological University March 19, 2003 25 Performance and implementation issues  Data distribution affects both performance and implementation.

26 2003 Michigan Technological University March 19, 2003 26 shared 127 local th 0 th 1 th 2 dst src shared int src[3*THREADS], dst[3*THREADS]; 012 012 1321642864128256, src, UPC_ADD, n, blk, NULL); void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); Thread k receives UPC_OP src[i]. i=0 k 1 32 164 2 8 64 128 256 1 3 7 15 31 63 255511 3 7 15 3163255 127

27 2003 Michigan Technological University March 19, 2003 27 Extensions  Strided copying  Vectors of offsets for src and dst arrays  Variable-sized blocks  Reblocking (cf: preceding example of prefix reduce) shared int src[3*THREADS]; shared [3] int dst[3*THREADS]; upc_forall(i=0; i<3*THREADS; i++; ?) dst[i] = src[i];

28 2003 Michigan Technological University March 19, 2003 28 More sophisticated synchronization semantics  Consider the “pull” implementation of broadcast. There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other. Each thread does a pairwise synchronization with thread 0. Thread i will not have to wait if it reaches its synchronization point after thread 0. Thread 0 returns from the call after it has sync’d with each thread.

29 2003 Michigan Technological University March 19, 2003 29 What’s next?  The V1.0 collective spec will be adopted in the next few weeks.  A reference implementation will be available from MTU immediately afterwards.

30 2003 Michigan Technological University March 19, 2003 30  MuPC run time system for UPC  UPC memory model (Chuck Wallace)  UPC programmability (Phil Merkey)  UPC test suite (Phil Merkey) http://www.upc.mtu.edu


Download ppt "2003 Michigan Technological University March 19, 2003 1 Steven Seidel Department of Computer Science Michigan Technological University"

Similar presentations


Ads by Google