HierKNEM: An Adaptive Framework for Kernel- Assisted and Topology-Aware Collective Communications on Many-core Clusters Teng Ma, George Bosilca, Aurelien Bouteiller, Jack J. Dongarra Dec 2. Lunch Talk
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
Introduction Hierarchies brought by multi-core cluster Message Passing is still dominative Programming. Programming libraries want to handle hierarchies internally. Collective communication is critical to application’s performance
Problem: Tuned Collective It cannot see the edges brought by the hierarchies of multi-core clusters Build a logical topology without runtime hardware topology information.
Topology-Unaware: Mismatch problem* Core0Core1Core2Core3 Node 0 Node Core0 Core2 Core1Core3 Node 0Node 1 P0P1P2P3 P0P1P2P3 Open MPI Tuned Allgather Ring algorithm under different process-core binding cases. --bycore--bynode * T. Ma, T. Herault, G. Bosilca and J. J. Dongarra, Process Distance-aware Adaptive MPI Collective Communications, Cluster 2011 # of nodes # of cores
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
Related work Cheetah R. Graham and etc., Cheetah: A Framework for Scalable Hierarchical Collective Operations CCGRID 2011 Distance-aware framework T. Ma, and etc., Process Distance- Aware Adaptive MPI Collective Communications. CLUSTER 2011 SBGP BCOL IB links NUMA links Intra-socket links
Agenda Introduction Related work Kernel-assisted Approach HierKNEM Experiments Conclusion
Status of Kernel-assisted One- sided Single-copy Inter-Process communication KNEM(0.9.7) and LIMIC(0.5.5) XPMEM(Cross-Process Memory Mapping) CMA(Cross Memory Attach).
Development of kernel-assisted approach in MPI stacks Intra-node p2p comm. MPICH2-LMT(KNEM), Open MPI(SM/KNEM BTL, vader BTL), MVAPICH2(LIMIC) Intra-node collective comm. KNEM Coll T Ma, G. Bosilca, A. Bouteiller, B. Goglin, J. Squyres, J. J. Dongarra: Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs. ICPP 2011 Inter- and intra-node collective comm. HierKNEM Coll T Ma, G. Bosilca, A. Bouteiller, J. J. Dongarra: HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters, submitted to IPDPS2012
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
Framework of HierKNEM Subgroup: Intra-node Comm. Inter-node Comm.
Broadcast Inter-node forward KNEM read Leader processes Non-Leader processes
SendRecv KNEM Copy Bcast with 64 processes on Dancer’s 8 nodes(8 cores/node), 256KB message size.
Reduce Intra-node Comm. Inter-node Comm. New_Comm. Inter-node forward KNEM read/write
Allgather: Topology-aware Ring
Agenda Introduction Related work Kernel-assisted approach HierKNEM Experiments Conclusion
Hardware Environment Stremi Cluster 32 nodes Node: AMD’s 24-core Gigabit Ethernet Parapluie Cluster 32 nodes Node: AMD’s 24-core 20 G Infiniband
Software Environment Open MPI 1.5.3, MPICH2-1.4 and MVAPICH2-1.7 KNEM version 0.9.6, LIMIC IMB-3.2(cache on) Always use the same mapping between cores and processes if without special mention. (--bycore way)
Broadcast Performance Figure: Aggregate Broadcast bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32nodes). More than 30 times!! More than twice
Reduce Performance Figure: Aggregate Reduce bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node, 32 nodes).
Allgather Performance Figure: Aggregate Allgather bandwidth of collective modules on multicore clusters (768 processes, 24 cores/node).
Topology-aware Operations Figure: Impact of process mapping: aggregate Broadcast and Allgather bandwidth of the collective modules for two different process-core bindings: by core and by node (Parapluie cluster, IB20G, 768 processes, 24 cores/node).
Core per Node Scalability Figure: Core per node scalability: aggregate bandwidth of Broadcast for 2MB messages on multicore clusters (32 nodes).
Conclusion HierKNEM achieved huge speedup from overlap between inter- and intra-node communication. HierKNEM is immune to modifications of the underlying process-core binding.(topology- aware). HierKNEM provides a linear speedup with the increase of the number of cores per node