Presentation is loading. Please wait.

Presentation is loading. Please wait.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.

Similar presentations


Presentation on theme: "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication."— Presentation transcript:

1 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication for Multicore By Rajesh Nishtala

2 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB What Are Collectives  An operation called by all threads together to perform globally coordinated communication  May involve a modest amount of computation, e.g. to combine values as they are communicated  Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads  Focus on collectives in Single Program Multiple Data (SPMD) programming models 2Multicore Collective Tuning

3 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Some Collectives  Barrier ( (MPI_Barrier() )  A thread cannot exit a call to a barrier until all other threads have called the barrier  Broadcast ( MPI_Bcast() )  A root thread sends a copy of an array to all the other threads  Reduce-To-All ( MPI_Allreduce() )  Each thread contributes an operand to an arithmetic operation across all the threads  The result is then broadcast to all the threads  Exchange ( MPI_Alltoall() )  For all i, j < N, thread i copies the j th piece of an input array to the i th slot of an output array located on thread i. 3Multicore Collective Tuning

4 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Why Are They Important?  Basic communication building blocks  Found in many parallel programming languages and libraries  Abstraction  If an application is written with collectives, passes the responsibility of tuning to the runtime Opteron/Infi- band/256 Class CClass D Exchange in NAS FT ~28%~23% Reductions in NAS CG ~42%~28% Percentage of runtime spent in collectives 4Multicore Collective Tuning

5 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Experimental Setup  Platforms  Sun Niagra2 1 socket of 8 multi-threaded cores Each core supports 8 hardware thread contexts for 64 total threads  Intel Clovertown 2 “traditional” quad-core sockets  BlueGene/P 1 quad-core socket  MPI for Inter-process communication  shared memory MPICH2 1.0.7 5Multicore Collective Tuning

6 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads v. Processes (Niagra2)  Barrier Performance  Perform a barrier across all 64 threads  Threads arranged into processes in different ways –One extreme has one thread per process while other has 1 process with 64 threads  MPI_Barrier() called between processes  Flat barrier amongst threads  2 orders of magnitude difference in performance! 6Multicore Collective Tuning

7 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads v. Processes (Niagra2) cont.  Other collectives see similar issues with scaling using processes  MPI Collectives called between processes while shared memory is leverage within a process 7Multicore Collective Tuning

8 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Intel Clovertown and BlueGene/P  Less threads per node  Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P 8Multicore Collective Tuning

9 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Barrier w/ Trees  Leveraging shared memory is a critical optimization  Flat trees are don’t scale  Use to aid parallelism  Requires two passes of a tree  First (UP) pass indicates that all threads have arrived. Signal parent when all your children have arrived Once root gets signal from all children then all threads have reported in  Second (DOWN) pass indicates that all threads have arrived Wait for parent to send me a clear signal Propagate clear signal down to my children 0 82 31210 4 6 1 11 9 7 5 1413 15 9Multicore Collective Tuning

10 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example Tree Topologies0 823 1210 4 6 1 11975 14 13 15 Radix 4 k-nomial tree (quadnomial) 0 823 1210 461 119 75 141315 Radix 8 k-nomial tree (octnomial) 0 82 31210 4 6 1 11 9 7 5 1413 15 Radix 2 k-nomial tree (binomial) 10Multicore Collective Tuning

11 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Barrier Performance Results  Time many back-to-back barriers  Flat tree is just one level with all threads reporting to thread 0  Leverages shared memory but non-scalable  Architecture Independent Tree (radix=2)  Pick a generic “good” radix that is suitable for many platforms  Mismatched to architecture  Architecture Dependent Tree  Search overall radices to pick the tree that best matches the architecture GOODGOOD 11Multicore Collective Tuning

12 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Broadcast Performance Results  Time a latency sensitive Broadcast (8 Bytes)  Time Broadcast followed by Barrier and subtract time for Barrier  Yields an approximation for how long it takes for the last thread to get the data GOODGOOD 12Multicore Collective Tuning

13 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Reduce To All Performance Results  4kBytes (512 Doubles) Reduce-To-All  In addition to data movement we also want to parallelize the computation  In Flat approach, computation gets serialized at the root  Tree based approaches allow us to parallelize the computation amongst all the floating point units  8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way GOODGOOD 13Multicore Collective Tuning

14 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimization Summary  Relying on flat trees is not enough for most collectives  Architecture dependent tuning is a further and important optimization GOODGOOD 14Multicore Collective Tuning

15 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Extending the Results to a Cluster  Use one-rack of BlueGene/P (1024 nodes or 4096 cores)  Reduce-To-All by having one thread representative thread make call to inter-node all reduce  Reduce the number of messages in the network  Vary the number of threads per process but use all cores  Relying purely on shared memory doesn’t always yield the best performance  Reduces number of active cores working on computation drops  Can optimize so that computation is partitioned across cores Not suitable for direct call to MPI_Allreduce() 15Multicore Collective Tuning

16 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: Ø pid: 1 x: Ø pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: 1 pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: 1 pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 1 Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Proc 1 thinks collective is done Multicore Collective Tuning16

17 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Strict v. Loose Synchronization  A fix to the problem  Add barrier before/after the collective  Enforces global ordering of the operations  Is there a problem?  We want to decouple synchronization from data movement  Specify the synchronization requirements Potential to aggregate synchronization Done by the user or a smart compiler  How can we realize these gains in applications? 17Multicore Collective Tuning

18 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions  Processes  Threads is a crucial optimization for single-node collective communication  Can use tree-based collectives to realize better performance, even for collectives on one node  Picking the correct tree that best matches architecture yields the best performance  Multicore adds to the (auto)tuning space for collective communication  Shared memory semantics allow us to create new loosely synchronized collectives 18Multicore Collective Tuning

19 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions? 19Multicore Collective Tuning

20 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Backup Slides 20Multicore Collective Tuning

21 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads and Processes  Threads  A sequence of instructions and an execution stack  Communication between threads through common and shared address space No OS/Network involvement needed Reasoning about inter-thread communication can be tricky  Processes  A set of threads and and an associated memory space  All threads within process share address space  Communication between processes must be managed through the OS Inter-process communication is explicit but may be slow More expensive to switch between processes 21Multicore Collective Tuning

22 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Experimental Platforms Niagra2 Clovertown BG/P 22Multicore Collective Tuning

23 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Specs Niagra2 Clovertown BlueGene/P # Sockets121 # Cores/Socket844 Threads Per Core811 Total Thread Count6484 Instruction Set Sparc x86/64 PowerPC Core Frequency1.4 GHz2.6 GHz0.85 GHz Peak DP Floating Point Performance / Core1.4 GFlop/s10.4 GFlop/s3.4 GFlop/s DRAM Read Bandwidth / Socket42.7 GB/s21.3 GB/s13.6 GB/s DRAM Write Bandwidth / Socket21.3 GB/s10.7 GB/s13.6 GB/s L1 Cache Size 8kB 32 kB L2 Cache Size 4MB 16MB 8MB (shared) (4MB/2cores) OS Version Solaris 5.10 Linux 2.6.18 BG/P Compute Kernel C Compiler Sun C (5.9) Intel ICC (10.1) IBM BlueGene XLC MPI Implementation MPICH2 1.0.7 MPICH2 port for ch3:shm device BG/P 23Multicore Collective Tuning

24 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Details of Signaling  For optimum performance have many readers and one writer  Each thread sets a flag (a single word) that others will read  Every reader will get a copy of the cache-line and spin on that copy  When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes  Avoid atomic primitives  On way up the tree, child sets a flag indicating that subtree has arrived  Parent spins on that flag for each child  On way down, each child spins on parent’s flag  When it’s set, it indicates that the parent wants to broadcast the clear signal down  Flags must be on different cache lines to avoid false sharing  Need to switch back-and-forth between two sets of flags 24Multicore Collective Tuning


Download ppt "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication."

Similar presentations


Ads by Google