Download presentation
Presentation is loading. Please wait.
Published byDeborah Pitts Modified over 9 years ago
1
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication for Multicore By Rajesh Nishtala
2
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB What Are Collectives An operation called by all threads together to perform globally coordinated communication May involve a modest amount of computation, e.g. to combine values as they are communicated Can be extended to teams (or communicators) in which they operate on a predefined subset of the threads Focus on collectives in Single Program Multiple Data (SPMD) programming models 2Multicore Collective Tuning
3
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Some Collectives Barrier ( (MPI_Barrier() ) A thread cannot exit a call to a barrier until all other threads have called the barrier Broadcast ( MPI_Bcast() ) A root thread sends a copy of an array to all the other threads Reduce-To-All ( MPI_Allreduce() ) Each thread contributes an operand to an arithmetic operation across all the threads The result is then broadcast to all the threads Exchange ( MPI_Alltoall() ) For all i, j < N, thread i copies the j th piece of an input array to the i th slot of an output array located on thread i. 3Multicore Collective Tuning
4
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Why Are They Important? Basic communication building blocks Found in many parallel programming languages and libraries Abstraction If an application is written with collectives, passes the responsibility of tuning to the runtime Opteron/Infi- band/256 Class CClass D Exchange in NAS FT ~28%~23% Reductions in NAS CG ~42%~28% Percentage of runtime spent in collectives 4Multicore Collective Tuning
5
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Experimental Setup Platforms Sun Niagra2 1 socket of 8 multi-threaded cores Each core supports 8 hardware thread contexts for 64 total threads Intel Clovertown 2 “traditional” quad-core sockets BlueGene/P 1 quad-core socket MPI for Inter-process communication shared memory MPICH2 1.0.7 5Multicore Collective Tuning
6
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads v. Processes (Niagra2) Barrier Performance Perform a barrier across all 64 threads Threads arranged into processes in different ways –One extreme has one thread per process while other has 1 process with 64 threads MPI_Barrier() called between processes Flat barrier amongst threads 2 orders of magnitude difference in performance! 6Multicore Collective Tuning
7
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads v. Processes (Niagra2) cont. Other collectives see similar issues with scaling using processes MPI Collectives called between processes while shared memory is leverage within a process 7Multicore Collective Tuning
8
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Intel Clovertown and BlueGene/P Less threads per node Differences are not as drastic but they are non-trivial Intel Clovertown BlueGene/P 8Multicore Collective Tuning
9
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Barrier w/ Trees Leveraging shared memory is a critical optimization Flat trees are don’t scale Use to aid parallelism Requires two passes of a tree First (UP) pass indicates that all threads have arrived. Signal parent when all your children have arrived Once root gets signal from all children then all threads have reported in Second (DOWN) pass indicates that all threads have arrived Wait for parent to send me a clear signal Propagate clear signal down to my children 0 82 31210 4 6 1 11 9 7 5 1413 15 9Multicore Collective Tuning
10
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Example Tree Topologies0 823 1210 4 6 1 11975 14 13 15 Radix 4 k-nomial tree (quadnomial) 0 823 1210 461 119 75 141315 Radix 8 k-nomial tree (octnomial) 0 82 31210 4 6 1 11 9 7 5 1413 15 Radix 2 k-nomial tree (binomial) 10Multicore Collective Tuning
11
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Barrier Performance Results Time many back-to-back barriers Flat tree is just one level with all threads reporting to thread 0 Leverages shared memory but non-scalable Architecture Independent Tree (radix=2) Pick a generic “good” radix that is suitable for many platforms Mismatched to architecture Architecture Dependent Tree Search overall radices to pick the tree that best matches the architecture GOODGOOD 11Multicore Collective Tuning
12
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Broadcast Performance Results Time a latency sensitive Broadcast (8 Bytes) Time Broadcast followed by Barrier and subtract time for Barrier Yields an approximation for how long it takes for the last thread to get the data GOODGOOD 12Multicore Collective Tuning
13
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Reduce To All Performance Results 4kBytes (512 Doubles) Reduce-To-All In addition to data movement we also want to parallelize the computation In Flat approach, computation gets serialized at the root Tree based approaches allow us to parallelize the computation amongst all the floating point units 8 threads share one FPU thus radix 2,4, & 8 serialize computation in about the same way GOODGOOD 13Multicore Collective Tuning
14
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimization Summary Relying on flat trees is not enough for most collectives Architecture dependent tuning is a further and important optimization GOODGOOD 14Multicore Collective Tuning
15
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Extending the Results to a Cluster Use one-rack of BlueGene/P (1024 nodes or 4096 cores) Reduce-To-All by having one thread representative thread make call to inter-node all reduce Reduce the number of messages in the network Vary the number of threads per process but use all cores Relying purely on shared memory doesn’t always yield the best performance Reduces number of active cores working on computation drops Can optimize so that computation is partitioned across cores Not suitable for direct call to MPI_Allreduce() 15Multicore Collective Tuning
16
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Potential Synchronization Problem 1. Broadcast variable x from root 2. Have proc 1 set a new value for x on proc 4 broadcast x=1 from proc 0 if(myid==1) { put x=5 to proc 4 } else { /* do nothing*/ } pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: Ø pid: 1 x: Ø pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: Ø pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: Ø pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: 1 pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: Ø pid: 2 x: Ø pid: 3 x: Ø pid: 4 x: Ø pid: 1 x: 1 pid: 2 x: 1 pid: 3 x: 1 pid: 4 x: 1 pid: 4 x: 5 pid: 0 x: 1 pid: 1 x: 1 pid: 4 x: 1 Put of x=5 by proc 1 has been lost Proc 1 observes globally incomplete collective Proc 1 thinks collective is done Multicore Collective Tuning16
17
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Strict v. Loose Synchronization A fix to the problem Add barrier before/after the collective Enforces global ordering of the operations Is there a problem? We want to decouple synchronization from data movement Specify the synchronization requirements Potential to aggregate synchronization Done by the user or a smart compiler How can we realize these gains in applications? 17Multicore Collective Tuning
18
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Conclusions Processes Threads is a crucial optimization for single-node collective communication Can use tree-based collectives to realize better performance, even for collectives on one node Picking the correct tree that best matches architecture yields the best performance Multicore adds to the (auto)tuning space for collective communication Shared memory semantics allow us to create new loosely synchronized collectives 18Multicore Collective Tuning
19
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Questions? 19Multicore Collective Tuning
20
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Backup Slides 20Multicore Collective Tuning
21
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Threads and Processes Threads A sequence of instructions and an execution stack Communication between threads through common and shared address space No OS/Network involvement needed Reasoning about inter-thread communication can be tricky Processes A set of threads and and an associated memory space All threads within process share address space Communication between processes must be managed through the OS Inter-process communication is explicit but may be slow More expensive to switch between processes 21Multicore Collective Tuning
22
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Experimental Platforms Niagra2 Clovertown BG/P 22Multicore Collective Tuning
23
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Specs Niagra2 Clovertown BlueGene/P # Sockets121 # Cores/Socket844 Threads Per Core811 Total Thread Count6484 Instruction Set Sparc x86/64 PowerPC Core Frequency1.4 GHz2.6 GHz0.85 GHz Peak DP Floating Point Performance / Core1.4 GFlop/s10.4 GFlop/s3.4 GFlop/s DRAM Read Bandwidth / Socket42.7 GB/s21.3 GB/s13.6 GB/s DRAM Write Bandwidth / Socket21.3 GB/s10.7 GB/s13.6 GB/s L1 Cache Size 8kB 32 kB L2 Cache Size 4MB 16MB 8MB (shared) (4MB/2cores) OS Version Solaris 5.10 Linux 2.6.18 BG/P Compute Kernel C Compiler Sun C (5.9) Intel ICC (10.1) IBM BlueGene XLC MPI Implementation MPICH2 1.0.7 MPICH2 port for ch3:shm device BG/P 23Multicore Collective Tuning
24
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Details of Signaling For optimum performance have many readers and one writer Each thread sets a flag (a single word) that others will read Every reader will get a copy of the cache-line and spin on that copy When writer comes in and changes value of variable, cache-coherency system will handle broadcasting/updating the changes Avoid atomic primitives On way up the tree, child sets a flag indicating that subtree has arrived Parent spins on that flag for each child On way down, each child spins on parent’s flag When it’s set, it indicates that the parent wants to broadcast the clear signal down Flags must be on different cache lines to avoid false sharing Need to switch back-and-forth between two sets of flags 24Multicore Collective Tuning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.