Presentation is loading. Please wait.

Presentation is loading. Please wait.

2003 Michigan Technological University May 6, 2003 1 UPC Workshop George Washington University May 6-7, 2003.

Similar presentations


Presentation on theme: "2003 Michigan Technological University May 6, 2003 1 UPC Workshop George Washington University May 6-7, 2003."— Presentation transcript:

1 2003 Michigan Technological University May 6, 2003 1 UPC Workshop George Washington University May 6-7, 2003

2 2003 Michigan Technological University May 6, 2003 2 The V1.0 UPC Collectives Spec  First draft by Wiebel and Greenberg, March 2002.  Spec discussed at May, 2002, and SC’02 UPC workshops.  Many helpful comments from Dan Bonachea and Brian Wibecan.  pre4V1.0, dated April 2, is now on the table.

3 2003 Michigan Technological University May 6, 2003 3 Collective functions  Initialization  upc_all_init  5.3 “Relocalization” collectives change data affinity. These are byte-oriented operations.   upc_all_broadcast   upc_all_scatter   upc_all_gather   upc_all_gather_all   upc_all_exchange   upc_all_permute  5.4 “Computational” collectives for reduction and sorting. These operations respect data type and blocksize.   upc_all_reduce   upc_all_prefix_reduce   upc_all_sort

4 2003 Michigan Technological University May 6, 2003 4 Remaining collectives spec issues (large and small)  Wording used to specify the affinity of certain arguments  {signed} option for types supported by reduce and prefix reduce operations  What requirements are made of the phase of function arguments?  Associativity of reduce and prefix reduce operations  Commutativity of reduce and prefix reduce operations  Can nbytes be 0 in 5.3 functions?  What are the synchronization semantics?

5 2003 Michigan Technological University May 6, 2003 5 Wording used to specify the affinity of certain arguments  Resolved: The target of the src/dst pointer must have affinity to thread 0.  This applies to distributed arrays, such as the targets of a broadcast and scatter, and the source of a gather.

6 2003 Michigan Technological University May 6, 2003 6 {signed} option for types supported by reduce and prefix reduce operations  “ signed char ” and “ char ” are separate and incompatible types.  Resolved: Remove the brackets around all signed keywords for all the types. Arguments of type “ char ” are treated in an implementation-dependent manner.  Resolved: Remove references to “ASCII values” since these equivalents are already specified by ANSIC.

7 2003 Michigan Technological University May 6, 2003 7 What requirements are made of the phase of function arguments?  Resolved: Remove the “common” statement regarding phase.  Resolved: To the 5.3 functions add: “The src and dst arguments are treated as if they have zero phase.”  Resolved: To the 5.4 functions add: “The phase field for the X argument is respected when referencing array elements.”

8 2003 Michigan Technological University May 6, 2003 8 Associativity and commutative reduce and prefix reduce operations  All provided reduction operators are assumed to be associative and commutative. All reduction operators (except those provided using the UPC_NONCOMM_FUNC ) are assumed to be commutative.  The operation op is always assumed to be associative. All predefined operations are also assumed to be commutative. Users may define operations that are assumed to be associative, but not commutative. The “canonical” evaluation order of a reduction is in the order of array indices. However, the implementation may take advantage of associativity, or associativity and commutativity in order to change the order of evaluation. This may change the result of the reduction for operations that are not strictly associative and commutative, such as floating point addition.  Advice to implementors.  It is strongly recommended that the function be implemented so that the same result be obtained whenever the function is applied on the same arguments, appearing in the same order. Note that this may prevent optimizations that take advantage of the physical location of processors.

9 2003 Michigan Technological University May 6, 2003 9 Alternative Synchronization semantics 1a) The collective function may begin to read or write data when any thread enters the collective function. 1b) The collective function may begin to read or write data with affinity to a thread when that thread enters the collective function. 1c) The collective function may begin to read or write data when all threads have entered the collective function. 2a) The collective function may exit before the operation is complete. The operation is guaranteed to be complete at the beginning of the next synchronization phase. 2b) The collective function may return in a thread when all reads and writes with affinity to the thread are complete. 2c) The operation is complete when any thread exits the collective function. 3) Each collective function implements any pair (1x,2y) of synchronization requirements based on the argument UPC_SYNC_SEM.

10 2003 Michigan Technological University May 6, 2003 10 Synch semantic naming ideas UPC_BEGIN_ON_{ANY, MINE, ALL}_ COMPLETE_{LATER, MINE, ALL}

11 2003 Michigan Technological University May 6, 2003 11 Can nbytes be 0 in 5.3 functions?  Resolved: Yes. Use the variable name numbytes to distinguish it from nbytes in the allocation functions. Add a statement that if numbytes is 0 then the function is a no-op.

12 2003 Michigan Technological University May 6, 2003 12 1. Synchronization phase “Arguments to each call to a collective function must be ready at the beginning of the synchronization phase in which the call is made. Results of each call to a collective function are not ready until the beginning of the next synchronization phase.”  This is a policy that can be relaxed as implementations demonstrate that fewer constraints lead to better performance.  This is an easy-to-remember semantic.

13 2003 Michigan Technological University May 6, 2003 13 2. Bill’s strict semantic On input, no data will be accessed until all threads enter the collective function. On exit, all output will be written before any thread exits the collective function.

14 2003 Michigan Technological University May 6, 2003 14 3. Affinity-based semantics Source data with affinity to a thread must be ready when that thread calls the collective function. Destination data with affinity to a thread will be ready when that thread returns from the collective function. Version A: Provide two versions of each collective. Provide distinct function names: “strict”: guarantee Bill’s strict semantics; “strict”: guarantee Bill’s strict semantics; “relaxed”: affinity-based semantics Version B: Only the “relaxed” affinity-based version is provided; the user provides explicit barriers to guarantee safety.

15 2003 Michigan Technological University May 6, 2003 15 4. “Split-phase” semantics Split-phase collectives. How can the split-phase concept be extended to describe the synchronization semantics of the collective functions?

16 2003 Michigan Technological University May 6, 2003 16 What are the synchronization semantics? Resolution A: Provide two versions of each collective. By distinct function names: “strict”: guaranteed entry and exit barriers; “strict”: guaranteed entry and exit barriers; “relaxed”: affinity-based semantics applies Resolution B: Only the “relaxed” affinity-based version is provided; the user provides explicit barriers to guarantee safety.

17 2003 Michigan Technological University May 6, 2003 17 void upc_all_broadcast(dst, src, blk); shared local th 0 th 1 th 2 dst src } blk Thread 0 sends the same block of data to each thread. shared [] char src[blk]; shared [blk] char dst[blk*THREADS];

18 2003 Michigan Technological University May 6, 2003 18 void upc_all_scatter(dst, src, blk); shared local th 0 th 1 th 2 dst src Thread 0 sends a unique block of data to each thread. shared [] char src[blk*THREADS]; shared [blk] char dst[blk*THREADS];

19 2003 Michigan Technological University May 6, 2003 19 void upc_all_gather(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends a block of data to thread 0. shared [blk] char src[blk*THREADS]; shared [] char dst[blk*THREADS];

20 2003 Michigan Technological University May 6, 2003 20 void upc_all_gather_all(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends one block of data to all threads.

21 2003 Michigan Technological University May 6, 2003 21 void upc_all_exchange(dst, src, blk); shared local th 0 th 1 th 2 dst src Each thread sends a unique block of data to each thread.

22 2003 Michigan Technological University May 6, 2003 22 void upc_all_permute(dst, src, perm, blk); shared local th 0 th 1 th 2 1 2 0 dst src perm Thread i sends a block of data to thread perm(i).

23 2003 Michigan Technological University May 6, 2003 23 int upc_all_reduceI(src, UPC_ADD, n, blk, NULL); shared local src th 0 th 1 th 2 0 9 63 i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); shared [3] int src[4*THREADS]; int i; iii 4281163264128256 51210242048 4 2 8 1 16 32 64 128 256 448  56   3591 512 1024 2048 4095 i=upc_all_reduceI(src,UPC_ADD,12,3,NULL); Thread 0 receives UPC_OP src[i]. i=0 n

24 2003 Michigan Technological University May 6, 2003 24 void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); shared local th 0 th 1 th 2 dst src shared [*] int src[3*THREADS], dst[3*THREADS]; 036 036 13216428641282561321642864128256 1 32 4162864128256 1 127 63 15 31 511 255 3 3 7 7 1531 63 127 255 Thread k receives UPC_OP src[i]. i=0 k

25 2003 Michigan Technological University May 6, 2003 25 shared local th 0 th 1 th 2 A “pull” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { upc_memcpy( (shared char *)dst + MYTHREAD, (shared char *)src, blk ); } 021 dst src

26 2003 Michigan Technological University May 6, 2003 26 shared local th 0 th 1 th 2 A “push” implementation of upc_all_broadcast void upc_all_broadcast( shared void *dst, shared const void *src, size_t blk ) { int i; upc_forall( i=0; i<THREADS; ++i; 0) // Thread 0 only upc_memcpy( (shared char *)dst + i, (shared char *)src, blk ); } 021 dst src iii 0 1 2

27 2003 Michigan Technological University May 6, 2003 27 shared 127 local th 0 th 1 th 2 dst src shared int src[3*THREADS], dst[3*THREADS]; 012 012 1321642864128256, src, UPC_ADD, n, blk, NULL); void upc_all_prefix_reduceI(dst, src, UPC_ADD, n, blk, NULL); Thread k receives UPC_OP src[i]. i=0 k 1 32 164 2 8 64 128 256 1 3 7 15 31 63 255511 3 7 15 3163255 127

28 2003 Michigan Technological University May 6, 2003 28 Extensions  Strided copying  Vectors of offsets for src and dst arrays  Variable-sized blocks  Reblocking (cf: preceding example of prefix reduce) shared int src[3*THREADS]; shared [3] int dst[3*THREADS]; upc_forall(i=0; i<3*THREADS; i++; ?) dst[i] = src[i];

29 2003 Michigan Technological University May 6, 2003 29 More sophisticated synchronization semantics  Consider the “pull” implementation of broadcast. There is no need for arbitrary threads i and j (i, j != 0) to synchronize with each other. Each thread does a pairwise synchronization with thread 0. Thread i will not have to wait if it reaches its synchronization point after thread 0. Thread 0 returns from the call after it has sync’d with each thread.

30 2003 Michigan Technological University May 6, 2003 30 What requirements are made of the phase of function arguments?  Resolved: Remove the “common” statement regarding phase.  Resolved: To the 5.3 functions add: “The src and dst arguments are treated as if they have zero phase.”  Resolved: To the 5.4 functions add: “The phase field for the X argument is respected when referencing array elements.”  Suitably define “respected”.  Note that “respecting” the phase requires over 20 integer operations to compute the address of an arbitrary array element given:  a shared void * array address of arbitrary phase  an element index (offset)  the blocksize and element size

31 2003 Michigan Technological University May 6, 2003 31 Commutativity of reduce and prefix reduce operations  All reduction operators (except those provided using the UPC_NONCOMM_FUNC ) are assumed to be commutative. A commutative reduction operator whose result is dependent on a particular order of execution has undefined results.


Download ppt "2003 Michigan Technological University May 6, 2003 1 UPC Workshop George Washington University May 6-7, 2003."

Similar presentations


Ads by Google