Presenter: Surabhi Jain

Slides:



Advertisements
Similar presentations
6/1/2014FLOCON 2009, Scottsdale, AZ. DoD Disclaimer 6/1/2014FLOCON 2009, Scottsdale, AZ This document was prepared as a service to the DoD community.
Advertisements

Carnegie Mellon University Software Engineering Institute CERT® Knowledgebase Copyright © 1997 Carnegie Mellon University VU#14202 UNIX rlogin with stack.
Smart Grid Communication System (SGCS) Jeff Nichols Sr. Director IT Infrastructure San Diego Gas & Electric 1.
ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
Intel ® Xeon ® Processor E v2 Product Family Ivy Bridge Improvements *Other names and brands may be claimed as the property of others. FeatureXeon.
WinDS-H2 Model and Analysis Walter Short, Nate Blair, Donna Heimiller, Keith Parks National Renewable Energy Laboratory May 27, 2005 Project AN4 This presentation.
Pension Fund Trustees Liability Ncedi Mbongwe. Introduction to Camargue Underwriting Managers Established in 2001 Underwriters: Mutual and Federal and.
IMPORTANT READ CAREFULLY BEFORE USING THIS PRODUCT LICENSE AGREEMENT AND LIMITED WARRANTY BY INSTALLING OR USING THE SOFTWARE, FILES OR OTHER ELECTRONIC.
Software & Services Group Developer Products Division Copyright© 2013, Intel Corporation. All rights reserved. *Other brands and names are the property.
Megan Houchin Safety Analysis Engineering Y-12 National Security Complex SAWG May 7 th, 2012.
Innovations in Structured Products October 25, 2010 An Innovator’s Dilemma?
Jul The New Geant4 License J. Perl The New Geant4 License Makes clear the user’s wide- ranging freedom to use, extend or redistribute Geant4, even.
1 Copyright © 2012 Mahindra & Mahindra Ltd. All rights reserved. 1 Change Management – Process and Roles.
HEVC Commentary and a call for local temporal distortion metrics Mark Buxton - Intel Corporation.
1 Copyright © 2012 Mahindra & Mahindra Ltd. All rights reserved. 1 Hybrid Projects - Defect Management - Process and Roles.
Jeremy W. Poling B&W Y-12 L.L.C. Can’t Decide Whether to Use a DATA Step or PROC SQL? You Can Have It Both Ways with the SQL Function!
Intel® Education Read With Me Intel Solutions Summit 2015, Dallas, TX.
Intel® Education Learning in Context: Science Journal Intel Solutions Summit 2015, Dallas, TX.
Getting Reproducible Results with Intel® MKL 11.0
OpenCL Introduction A TECHNICAL REVIEW LU OCT
FPGA and ASIC Technology Comparison - 1 © 2009 Xilinx, Inc. All Rights Reserved How do I Get Started with PlanAhead?
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
1 Jon Sudduth Project Engineer, Intelligent Grid Deployment SWEDE April 26, 2011.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency & Renewable Energy, operated by the Alliance for Sustainable.
Copyright 2011, Atmel December, 2011 Atmel ARM-based Flash Microcontrollers 1 1.
End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.
Andrew McNab - License issues - 10 Apr 2002 License issues for EU DataGrid (on behalf of Anders Wannanen) Andrew McNab, University of Manchester
Custom Software Development Intellectual Property and Other Key Issues © 2006 Jeffrey W. Nelson and Iowa Department of Justice (Attach G)
IBIS-AMI and Direction Decisions
Gas-Electric System Interface Study OPSI Annual Meeting October 8, 2013 Raleigh, North Carolina.
Y-12 Integration of Security and Safety Basis, Including Firearms Safety David Sheffey Safety Analysis, Compliance, and Oversight Manager B&W Technical.
NREL is a national laboratory of the U.S. Department of Energy, Office of Energy Efficiency & Renewable Energy, operated by the Alliance for Sustainable.
International Telecommunication Union New Delhi, India, December 2011 ITU Workshop on Standards and Intellectual Property Rights (IPR) Issues Philip.
The Drive to Improved Performance/watt and Increasing Compute Density Steve Pawlowski Intel Senior Fellow GM, Architecture and Planning CTO, Digital Enterprise.
Changing Databases This presentation gives a quick overview on how to change databases in Osprey.
Primer Briefing “Brand Name or Equal” Purchase Descriptions Ask a Professor - # Date:
Boxed Processor Stocking Plans Server & Mobile Q1’08 Product Available through February’08.
What’s All This I Hear About Information “Architecture?” InterLab 06 Joe Chervenak & Marsha Luevane National Renewable Energy Laboratory.
National Alliance for Medical Image Computing Licensing in NAMIC 3 requirements from NCBC RFA (paraphrased)
Oracle Fusion Applications 11gR1 ( ) Functional Overview (L2) Manage Inbound Logistics (L3) Manage and Disposition Inventory Returns.
Evaluation of the Impact to the Safety Basis of Research Conducted in Production Facilities at the Y-12 National Security Complex Rebecca N. Bell Senior.
INTEL CONFIDENTIAL Intel® Smart Connect Technology Remote Wake with WakeMyPC November 2013 – Revision 1.2 CDI/IBP #:
-1- For Oracle employees and authorized partners only. Do not distribute to third parties. © 2009 Oracle Corporation – Proprietary and Confidential Oracle.
The secure site rendering issue (all navigation crushed together as a list at the top of the page) is a compatibility issue with Internet Explorer only.
Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Oracle Proprietary and Confidential. 1.
Spring 2016 ICC Meeting – Subcommittee F Estimating the Value / Benefit of Diagnostics 1 of 3 – Perhaps? Josh Perkel and Nigel Hampton NEETRAC.
Connectivity to bank and sample account structure
RaboDirect Financial Health Barometer 2016
Models for Resources and Management
BLIS optimized for EPYCTM Processors
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 1.1
Parallelspace PowerPoint Template for ArchiMate® 2.1 version 2.0
The Small batch (and Other) solutions in Mantle API
Many-core Software Development Platforms
OpenFabrics Interfaces: Past, present, and future
Subset Selection in Multiple Linear Regression
OpenFabrics Interfaces Working Group Co-Chair Intel November 2016
September Workshop and Advisory Board Meeting Presenter Affiliation
12/26/2018 5:07 AM Leap forward with fast, agile & trusted solutions from Intel & Microsoft* Eman Yarlagadda (for Christine McMonigal) Hybrid Cloud – Product.
September Workshop and Advisory Board Meeting Presenter Affiliation
Enabling TSO in OvS-DPDK
A Scalable Approach to Virtual Switching
Mitigating Inter-Job Interference Using Adaptive Flow-Aware Routing
Expanded CPU resource pool with
© 2013 Sri U-Thong Limited. All rights reserved
2/3 20% 71% Half 54% Over Half 45% 14% Introduction GHG Mitigation
Emotional Intelligence: The Core of Family Offices
Presentation transcript:

Framework for scalable intra-node collective operations using shared memory Presenter: Surabhi Jain Contributors: Surabhi Jain, Rashid Kaleem, Marc Gamell Balmana, Akhil Langer, Dmitry Durnov, Alexander Sannikov, and Maria Garzaran Supercomputing 2018, Dallas, USA

Legal Notices & Disclaimers Acknowledgment: This material is based upon work supported by the U.S. Department of Energy and Argonne National Laboratory and its Leadership Computing Facility under Award Number(s) DE-AC02-06CH11357 and Award Number 8F-30005. This work was generated with financial support from the U.S. Government through said Contract and Award Number(s), and as such the U.S. Government retains a paid-up, nonexclusive, irrevocable, world-wide license to reproduce, prepare derivative works, distribute copies to the public, and display publicly, by or on behalf of the Government, this work in whole or in part, or otherwise use the work for Federal purposes. Disclaimer: This report/presentation was prepared as an account of work sponsored by an agency and/or National Laboratory of the United States Government. Neither the United States Government nor any agency or National Laboratory thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency or National Laboratory thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency or National Laboratory thereof. Access to this document is with the understanding that Intel is not engaged in rendering advice or other professional services. Information in this document may be changed or updated without notice by Intel. This document contains copyright information, the terms of which must be observed and followed. Reference herein to any specific commercial product, process or service does not constitute or imply endorsement, recommendation, or favoring by Intel or the US Government. Intel makes no representations whatsoever about this document or the information contained herein. IN NO EVENT SHALL INTEL BE LIABLE TO ANY PARTY FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES FOR ANY USE OF THIS DOCUMENT, INCLUDING, WITHOUT LIMITATION, ANY LOST PROFITS, BUSINESS INTERRUPTION, OR OTHERWISE, EVEN IF INTEL IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 2

Legal Notices & Disclaimers (cont.) INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to www.intel.com/benchmarks. Performance results are based on testing as of July 31, 2018 and may not reflect all publicly available security updates. See configuration disclosure for details. No component or product can be absolutely secure. Intel®, Pentium®, Intel® Xeon®, Intel® Xeon PhiTM, Intel® CoreTM, Intel® VTuneTM, Intel® CilkTM, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Optimization Notice Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 3

Motivation MPI collectives represent common communication patterns, computations, or synchronization Why should we optimize intra-node collectives? They are on the critical path for many collectives (Reduce, Allreduce, Barrier,…) First, perform intra-node portion Then, perform inter-node portion Important for large multicore nodes and/or small clusters

Contributions Propose a framework to optimize intra-node collectives Based on release/gather building blocks Dedicated shared memory layer Topology aware intra-node trees Implement 3 collectives: MPI_Bcast(), MPI_Reduce(), and MPI_Allreduce() Significant speedups with respect to MPICH, MVAPICH, and Open MPI E.g., for MPI_Allreduce, average speedups of 3.9x faster than Open MPI 1.2x faster than MVAPICH 2.1x faster than MPICH/ch3, 2.9x faster than MPICH/ch4

Outline Background Design and Implementation Shared memory layout Release and gather steps Implement collectives using release and gather Optimizations Performance Evaluation Conclusion

Background – MPI_Allreduce Current MPI Implementations optimize collectives for multiple ranks per node Intra-node reduce MPICH and Open MPI use point to point, MVAPICH uses dedicated shared memory Intra-node reduce Node 0 Node 1 Node 2 Node 3 Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Rank 8 Rank 9 Rank 10 Rank 11 MPI_AllReduce ( …,0,... )

Background – MPI_Allreduce Current MPI Implementations optimize collectives for multiple ranks per node Intra-node reduce MPICH and Open MPI use point to point, MVAPICH uses dedicated shared memory Inter-node allreduce Inter-node allreduce Node 0 Node 1 Node 2 Node 3 Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Rank 8 Rank 9 Rank 10 Rank 11 MPI_AllReduce ( …,0,... )

Background – MPI_Allreduce Current MPI Implementations optimize collectives for multiple ranks per node Intra-node reduce MPICH and Open MPI use point to point, MVAPICH uses dedicated shared memory Inter-node allreduce Intra-node bcast Intra-node bcast Node 0 Node 1 Node 2 Node 3 Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Rank 8 Rank 9 Rank 10 Rank 11 MPI_AllReduce ( …,0,... )

Intra-node Broadcast 4 steps: Root copies to shared memory buffer Root sets a flag to let the ranks know that data is ready Other ranks copy the data out Other ranks update a flag to indicate the root that they have copied the data Copy-out Non-root 1 Root User buffer MPI_Bcast (...) User buffer Shared buffer MPI_Bcast (...) Copy-in Copy-out User buffer Non-root 2 MPI_Bcast (…)

Intra-node Reduce 4 steps: Each non-root copies to shared memory Each non-root updates a flag to tell the root that data is ready Root copies the data out of each non-root Root updates a flag to tell non-roots that it has copied the data out Non-root 1 Copy-in User buffer MPI_Reduce (...) Shared buffer Copy-out Root User buffer Copy-in Shared buffer MPI_Reduce (...) Non-root 2 User buffer Copy-out MPI_Reduce (...)

Design and implementation

Shared Memory Layout Bcast Buffer: Root copies the data in. Other ranks copy data out Reduce Buffer: Each rank copy its data in. Root copies the data out and reduces Flags: To notify the ranks after copying the data in/out of shared memory

Release and Gather steps 1 2 3 4 7 5 6 Release Step Set-up- Arrange the ranks in a tree with rank 0 as the root Release: A rank releases its children Top-down step Copy the data (if bcast) Inform the children using release flags Gather: A rank gathers from all its children Bottom-up step Copy the data (if reduce) Inform the parent using gather flags 1 2 3 4 7 5 6 Gather Step

Bcast and Reduce using Release and Gather steps 1 2 3 4 7 5 6 1 2 3 4 7 5 6 Collective Release step Gather step MPI_Bcast Data movement Root copy data in shm buffer Inform children Children copy data out Acknowledgment Inform parent buffer ready for next bcast MPI_Reduce Acknowledgement Inform children buffer ready for next reduce Data movement All copy data in shm buffer Inform parent Parent reduce data

Optimizations Intra-node topology aware trees Data pipelining Read from parent flag on the release step Data copy optimization in reduce

Intra-node Topology aware trees Socket 0 1 2 3 Socket 1 5 6 7 4 Socket 2 9 10 11 8 Socket 3 13 14 15 12 Socket 4 17 18 19 16 Subtree 0 Subtree 4 Subtree 8 Subtree 12 Subtree 16 1 2 3 5 6 7 4 9 10 11 8 13 14 15 12 17 18 19 16 S0 Socket-leader-first Socket-leader-last S0 4 8 Better for release step 4 8 S0 Better for gather step S4 12 16 S8 12 16 S4 S8 S12 S16 S12 S16

Other variants for trees Right skewed v/s left skewed K-ary v/s k-nomial trees Topology-unaware trees 1 2 2 1 3 4 5 6 6 5 4 3 7 7 Left skewed tree Right skewed tree

Data Pipelining Bcast buffer split into 3 Split the large message into multiple Bcast- Root copy the next chunk of data in next cell Non-roots copy out from previous cells Reduce- Non-roots copy in the next cells Root reduce the data from previous cells Also useful for back to back collectives Cell 0 Cell 1 Cell 2 Root Non-root 1 Non-root 3 Non-root 2 Bcast buffer split into 3

Other Optimizations Read from parent flag on the release step Parent updates its own flag Not write flag for each child Data copy optimization in Reduce Root reduce the data directly in its user-buffer Not reduce in shm buf and copy to user-buffer 1 2 3 4 5 6 7

performance evaluation

Experimental Setup System Configuration Skylake (SKL): Intel® Xeon Gold 6138F CPU (2.0 GHz, 2 sockets, 20 cores/socket, 2 threads/core). 32KB L1 data and instruction cache, 1MB L2 cache, 27.5MB L3 cache OmniPath-1 Fabric Interconnect Software Configuration Gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3 running linux version 4.4.132-94.33-default Libfabric (commit id 91669aa), opa-psm2 (commit id 0f9213e) MPICH (commit id d815dd4) used as the baseline for our implementation, MPICH/ch3, MPICH/ch4 Open MPI (version 3.0.0) and MVAPICH (version 2-2.3rc1) Benchmark Intel MPI Benchmarks (IMB) (version 2018 Update 1). Reported T-max used for comparison

MPI_Bcast: Single node, 40 MPI ranks (1 rank per core) Lower the better 32KB buffer split in 4 cells Flat tree used to propagate flags Tuned Open MPI, MVAPICH, MPICH/ch3, and MPICH/ch4 Average Speedups: 3.9x faster than Open MPI 1.2x faster than MVAPICH 2.1x faster than MPICH/ch3 2.9x faster than MPICH/ch4 Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks “-iter 5000 -msglog 22 -sync 1 –imb_barrier 1 –root_shift 0”, Tmax *See performance-related disclaimers on slide 3

MPI_Allreduce: Single node, 40 MPI ranks (1 rank per core) Lower the better 32KB buffers split in 4 cells Tree configuration Reduce: Socket-leaders-last and right-skewed Msg size < 512B topology-unaware, k-nomial tree, K=4 512B <= msg_size < 8KB topology aware, k-ary tree, K=3 Msg size >= 8KB topology aware, k-ary tree, K=2 Bcast: Flat tree Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks “-iter 5000 -msglog 22 -sync 1 –imb_barrier 1 –root_shift 0”, Tmax *See performance-related disclaimers on slide 3

Impact of Topology aware trees Lower the better MPI_Reduce, 40 MPI ranks, 1 rank per core Topology-aware tree Socket-leaders-last and right-skewed Msg size <= 4KB, k-ary tree, K=3 Msg size > 4KB, k-ary tree, K=2 Topology-unaware trees Msg size <= 16KB, k-nomial tree, K=8 Msg size > 16KB, k-nomial tree, K=2 Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks “-iter 5000 -msglog 22 -sync 1 –imb_barrier 1 –root_shift 0”, Tmax *See performance-related disclaimers on slide 3

Multiple node runs (32 nodes, 40 ranks per node) We only compare to MPICH/ch3 and MPICH/ch4 to keep the inter-node collectives implementation same Lower the better Intel Xeon Gold 6138F CPU , 40 cores, 2 threads/core, 2.0 Ghz Frequency, 32KB of L1, 1MB of L2, 27.5 MB of L3 cache. gcc compiler version 8.1.0 SUSE Linux Enterprise Server 12 SP3. IMB Benchmarks “-iter 5000 -msglog 22 -sync 1 –imb_barrier 1 –root_shift 0”, Tmax *See performance-related disclaimers on slide 3

Why are we better? Network topology aware Dedicated shared memory Node topology aware Open MPI MVAPICH MPICH (ch3, ch4) Our framework

Conclusions Implement MPI_Bcast, MPI_Reduce, and MPI_Allreduce using release and gather building blocks Significantly outperform MVAPICH, Open MPI, and MPICH Careful design of trees to propagate data and flags provide improvement upto 1.8x over naïve trees Compared to MPICH, speedups upto 2.18x for MPI_Allreduce and upto 2.5x for MPI_Bcast on a 32 node cluster

Check out MPICH BoF today! @C145, 5:15pm Questions? surabhi.jain@intel.com Check out MPICH BoF today! @C145, 5:15pm