Download presentation
Presentation is loading. Please wait.
Published byOsborn Greene Modified over 9 years ago
1
HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra Dec 2. 2011 @ICL Lunch Talk
2
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
3
Introduction Hierarchies brought by multi-core cluster Message Passing is still dominative Programming. Programming libraries want to handle hierarchies internally. Collective communication is critical to application’s performance
4
Problem: Tuned Collective It cannot see the edges brought by the hierarchies of multi-core clusters Build a logical topology without runtime hardware topology information.
5
Topology-Unaware: Mismatch problem* 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 Core0Core1Core2Core3 Node 0 Node 1 1 4 3 2 2 1 4 3 3 2 1 4 4 3 2 1 Core0 Core2 Core1Core3 Node 0Node 1 P0P1P2P3 P0P1P2P3 Open MPI Tuned Allgather Ring algorithm under different process-core binding cases. --bycore--bynode * T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011 # of nodes # of cores
6
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
7
Related work Cheetah R. Graham and etc., Cheetah: A Framework for Scalable Hierarchical Collective Operations CCGRID 2011 Distance-aware framework T. Ma, and etc., Process Distance- Aware Adaptive MPI Collective Communications. CLUSTER 2011 SBGP BCOL IB links NUMA links Intra-socket links
8
Agenda Introduction Related work Kernel-assisted Approach HierKNEM Experiments Conclusion
9
Status of Kernel-assisted One- sided Single-copy Inter-Process communication KNEM(0.9.7) and LIMIC(0.5.5) XPMEM(Cross-Process Memory Mapping) CMA(Cross Memory Attach).
10
Development of kernel-assisted approach in MPI stacks Intra-node p2p comm. MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC) Intra-node collective comm. KNEM Coll T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011 Inter- and intra-node collective comm. HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012
11
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
12
Framework of HierKNEM Subgroup: Intra-node Comm. Inter-node Comm.
13
Broadcast Inter-node forward KNEM read Leader processes Non-Leader processes
14
SendRecv KNEM Copy Bcast with 64 processes on Dancer’s 8 nodes(8 cores/node), 256KB message size.
15
Reduce Intra-node Comm. Inter-node Comm. New_Comm. Inter-node forward KNEM read/write
16
Allgather: Topology-aware Ring
17
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
18
Hardware Environment Stremi Cluster 32 nodes Node: AMD’s 24-core Gigabit Ethernet Parapluie Cluster 32 nodes Node: AMD’s 24-core 20 G Infiniband
19
Software Environment Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7 KNEM version 0.9.6, LIMIC 0.5.5 IMB-3.2(cache on) Always use the same mapping between cores and processes if without special mention. (--bycore way)
20
Broadcast Performance Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32nodes). More than 30 times!! More than twice
21
Reduce Performance Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes).
22
Allgather Performance Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node).
23
Topology-aware Operations Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node).
24
Core per Node Scalability Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes).
25
Conclusion HierKNEM achieved huge speedup from overlap between inter- and intra-node communication. HierKNEM is immune to modifications of the underlying process-core binding.(topology- aware). HierKNEM provides a linear speedup with the increase of the number of cores per node
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.