SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the.

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

NERCS Users’ Group, Oct. 3, 2005 Interconnect and MPI Bill Saphir.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

1 Advancing Supercomputer Performance Through Interconnection Topology Synthesis Yi Zhu, Michael Taylor, Scott B. Baden and Chung-Kuan Cheng Department.

SAN DIEGO SUPERCOMPUTER CENTER Niches, Long Tails, and Condos Effectively Supporting Modest-Scale HPC Users 21st High Performance Computing Symposia (HPC'13)

Towards Autonomic Adaptive Scaling of General Purpose Virtual Worlds Deploying a large-scale OpenSim grid using OpenStack cloud infrastructure and Chef.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.

Communication Analysis of Parallel 3D FFT for Flat Cartesian Meshes on Large Blue Gene Systems A. Chan, P. Balaji, W. Gropp, R. Thakur Math. and Computer.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO IEEE Symposium of Massive Storage Systems, May 3-5, 2010 Data-Intensive Solutions.

CS CS 5150 Software Engineering Lecture 19 Performance.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 5, 2005 Lecture 2.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

CS CS 5150 Software Engineering Lecture 25 Performance.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

A Scalable, Commodity Data Center Network Architecture Mohammad Al-Fares, Alexander Loukissas, Amin Vahdat Presented by Gregory Peaker and Tyler Maclean.

NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Computer System Architectures Computer System Software

SDSC RP Update TeraGrid Roundtable Reviewing Dash Unique characteristics: –A pre-production/evaluation “data-intensive” supercomputer based.

PuReMD: Purdue Reactive Molecular Dynamics Package Hasan Metin Aktulga and Ananth Grama Purdue University TST Meeting,May 13-14, 2010.

Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,

The WRF Model The Weather Research and Forecasting (WRF) Model is a mesoscale numerical weather prediction system designed for both atmospheric research.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

/ ZZ88 Performance of Parallel Neuronal Models on Triton Cluster Anita Bandrowski, Prithvi Sundararaman, Subhashini Sivagnanam, Kenneth Yoshimoto,

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Planned AlltoAllv a clustered approach Stephen Booth (EPCC) Adrian Jackson (EPCC)

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Michael L. Norman Principal Investigator Interim Director, SDSC Allan Snavely.

PFPC: A Parallel Compressor for Floating-Point Data Martin Burtscher 1 and Paruj Ratanaworabhan 2 1 The University of Texas at Austin 2 Cornell University.

A Distributed Algorithm for 3D Radar Imaging PATRICK LI SIMON SCOTT CS 252 MAY 2012.

Group 3 Sandeep Chinni Arif Khan Venkat Rajiv. Delay Tolerant Networks Path from source to destination is not present at any single point in time. Combining.

A.SATHEESH Department of Software Engineering Periyar Maniammai University Tamil Nadu.

SAN DIEGO SUPERCOMPUTER CENTER SDSC's Data Oasis Balanced performance and cost-effective Lustre file systems. Lustre User Group 2013 (LUG13) Rick Wagner.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

Nanco: a large HPC cluster for RBNI (Russell Berrie Nanotechnology Institute) Anne Weill – Zrahia Technion,Computer Center October 2008.

1 Recommendations Now that 40 GbE has been adopted as part of the 802.3ba Task Force, there is a need to consider inter-switch links applications at 40.

Using Heterogeneous Paths for Inter-process Communication in a Distributed System Vimi Puthen Veetil Instructor: Pekka Heikkinen M.Sc.(Tech.) Nokia Siemens.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.

Energy-Aware Resource Adaptation in Tessellation OS 3. Space-time Partitioning and Two-level Scheduling David Chou, Gage Eads Par Lab, CS Division, UC.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Advanced User Support for MPCUGLES code at University of Minnesota October 09,

Sunpyo Hong, Hyesoon Kim

Optimizing Parallel Programming with MPI Michael Chen TJHSST Computer Systems Lab Abstract: With more and more computationally- intense problems.

GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.

Tackling I/O Issues 1 David Race 16 March 2010.

BluesGene/L Supercomputer A System Overview Pietro Cicotti October 10, 2005 University of California, San Diego.

Background Computer System Architectures Computer System Software.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Software System Performance CS 560. Performance of computer systems In most computer systems:  The cost of people (development) is much greater than.

Performance Comparison of Ad Hoc Network Routing Protocols Presented by Venkata Suresh Tamminiedi Computer Science Department Georgia State University.

Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.

Introduction to Data Analysis with R on HPC Texas Advanced Computing Center Feb

ISPASS th April Santa Rosa, California

Memory Management for Scalable Web Data Servers

Migration Strategies – Business Desktop Deployment (BDD) Overview

Department of Computer Science University of California, Santa Barbara

Exploring Non-Uniform Processing In-Memory Architectures

CARLA Buenos Aires, Argentina - Sept , 2017

Hybrid Programming with OpenMP and MPI

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Performance of Applications Using Dual-Rail InfiniBand 3D Torus Network on the Gordon Supercomputer Dongju Choi, Glenn Lockwood, Robert Sinkovits, Mahidhar Tatineni San Diego Supercomputer Center University of California, San Diego

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Background SDSC data intensive supercomputer Gordon: 1,024 dual-socket Intel Sandy Bridge nodes, each with 64 GB DDR3–1333 memory 16 cores per node and 16 nodes (256 cores) per switch Large IO nodes and local/global ssd disks Dual rails QDR InfiniBand network supports IO and Compute communication separately. Can be scheduled to be used for computation also. We have been interested witch communication oversubscription in switch-to-switch and switch/node topology effects on application performance.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Gordon System Architecture 3-D torus of switches on Gordon Subrack level network architecture on Gordon

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO MVAPICH2 MPI Implementation MVAPICH2 current version 1.9, 2.0 on the Gordon system Full control of dual rail usage at the task level via user settable environment variables: MV2_NUM_HCAS=2, MV2_IBA_HCA=mlx4_0:mlx4_1 MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8000: can be as low as 8KB, MV2_SM_SCHEDULING=ROUND_ROBIN: explicitly distribute tasks over rails

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO OSU Micro-Benchmarks Compare the performance of single and dual rail QDR InfiniBand vs FDR InfiniBand: evaluate the impact of rail sharing, scheduling, and threshold parameters Bandwidth tests Latency tests

OSU Bandwidth Test Results for Single Rail QDR, FDR, and Dual-Rail QDR Network Configurations -Single rail FDR performance is much better than single rail QDR for message sizes larger than 4K bytes -Dual rail QDR performance exceeds FDR performance at sizes greater than 32K -FDR showing better performance between 4K and 32K byte sizes due to the rail-sharing threshold

OSU Bandwidth Test Performance with MV2_RAIL_SHARING_LARGE_MSG_THRESHOLD=8K - Lowering the rail sharing threshold bridges the dual-rail QDR, FDR performance gap down to 8K bytes.

OSU Bandwidth Test Performance with MV2_SM_SCHEDULING = ROUND_ROBIN - Adding explicit round-robin tasks to communicate over different rails

OSU Latency Benchmark Results for QDR, Dual-Rail QDR with MVAPICH2 Defaults, FDR - There is no latency penalty at small message sizes (expected as only one rail is active below the striping threshold). - Above the striping threshold a minor increase in latency is observed but the performance is still better than single rail FDR.

OSU Latency Benchmark Results for QDR, Dual-Rail QDR with Round Robin Option, FDR - Distributing messages across HCAs using the round-robin option increases the latency at small message sizes. - Again, the latency results are better than the FDR case.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Application Performance Benchmarks Applications P3DFFT Benchmark LAMMPS Water Box Benchmark AMBER Cellulose Benchmark Test Configuration Single Rail vs. Dual Rails Multiple Switch Runs with Maximum Hops=1 or no hops limit for 512 core runs (2 switches are involved)

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO P3DFFT Benchmark Parallel Three-Dimensional Fast Fourier Transforms Used for studies of turbulence, climatology, astrophysics and material science Depends strongly on the available bandwidth as the main communication component is driven by transposes of large arrays (alltoallv)

Simulation Results for P3DFFT benchmark with 256 cores and QDR, Dual-Rail QDR Run# QDR Wallclock Time (s) Dual-Rail QDR Wallclock Time (s) Dual-rail runs are consistently faster than the single rail runs, with an average performance gain of 23%.

Communication and Compute Time Breakdown for 256 core, Single/Dual QDR rail P3DFFT Runs. Run # Total Time Comm. Time Compute Time Run # Total Time Comm. Time Compute Time Compute part is nearly identical in both sets of runs -Performance improvement is almost entirely in the communication part of the code -Shows that Dual rail boosts the alltoallv performance and consequently speeds up the overall calculation Single Rail RunsDual Rail Runs

Communication and Compute Time Breakdown for 512 core, Single/Dual QDR Rail P3DFFT Runs. Maximum Switch Hops=1 Run # Total Time Comm. Time Compute Time Run # Total Time Comm. Time Compute Time Single Rail RunsDual Rail Runs -Shows similar dual rail benefits -Fewer runs pans/links, reducing the likelihood of oversubscription due to other jobs -Also can increase the likelihood of oversubscription due to lesser switch connections

P3DFFT benchmark with 512 cores, Single Rail QDR. No Switch Hop Restriction Run #Total TimeComm. TimeCompute Time oversubscription is mitigated by topology of the run and the performance is nearly 15% better than the single hop case. However, as seen from the results a different topology may also lead to lower performance if the distribution is not optimal (it could be by oversubscription of the job itself or from other jobs).

-Spread out the computation on several switches. Lowering bandwidth requirements on a given set of switch-to-switch links -bad for latency bound codes (given the extra switch hops) but benefit bandwidth sensitive codes depending on the topology of the run -Nukada et. al. utilizes dynamic links to minimize congestion to perform better in the dual-rail case P3DFFT benchmark with 512 cores, Single Rail QDR. No Switch Hop Restriction Run #Total TimeComm. TimeCompute Time Nukada, A., Sato, K. and Matsuoka, S Scalable multi-GPU 3-D FFT for TSUBAME 2.0 supercomputer. In Proceedings of the International Conference on HighPerformance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, LosAlamitos, CA, USA, Article 44, 10 pages.

Communication and Compute Time Breakdown for 1024 Core P3DFFT Runs RunTotal Time Comm. Time Compute Time No switch hop restrictions are placed on the runs. -. Communication aspect is greatly improved in the dual rail cases while compute fraction is the nearly identical in all the runs. RunTotal Time Comm. Time Compute Time Single Rail RunsDual Rail Runs

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO LAMMPS Water Box Benchmark Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) is a widely used classical molecular dynamics code. 12,000 water molecules (36,000 atoms) are set in the input Simulation is run for 20 picoseconds.

LAMMPS Water Box Benchmark with Single/Dual Rail QDR and 256 cores. Run # QDR Wallclock Time (s) Dual-Rail QDR Wallclock Time (s) Dual-Rail runs show better performance than the single rail runs and mitigate communication overhead with an average of 32% in wallclock time used. improvement

LAMMPS Water Box Benchmark with Single-Dual Rail QDR and 512 Cores Run # Single Rail QDR w MAX_HOP=1 Wallclock Time (s) Single Rail QDR w No Limit in MAX_HOP Wallclock Time (s) Dual Rail QDR Wallclock Time (s) Application is not scaling due to larger communication overhead (happens due to fine level of domain decomposition) - LAMMPS benchmark is very sensitive to topology and shows large variations if the maximum switch hops are not restricted

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO AMBER Cellulose Benchmark Amber is a package of programs for molecular dynamics simulations of proteins and nucleic acids. 408,609 atoms are used for the tests.

Amber Cellulose Benchmark with Single/Dual Rail QDR and 256 Cores Run # Single Rail QDR Wallclock Time (s) Dual Rail QDR Wallclock Time (s) Communication overhead is low and the dual rail benefit is minor (<3%)

Amber Cellulose Benchmark with Single/Dual Rail QDR, 512 cores Run # Single Rail QDR w MAX_HOP=1 Wallclock Time (s) Single Rail QDR w No Limit in MAX_HOP Wallclock Time (s) Dual Rail QDR Wallclock Time (s) There is a modest benefit (<5 %) in the single rail QDR runs - Communication overhead increases with increased core count, leading to the drop off in scaling. This can be mitigated with dual rail QDR - Dual rail QDR performance is better by 17%

Amber Cellulose Benchmark with Single/Dual Rail QDR, 512 cores - Dual rail enables the benchmark to scale to higher core count - Shows sensitivity to the topology due to the larger number of switch hops and possible contention from other jobs Run # Single Rail QDR w MAX_HOP=1 Wallclock Time (s) Single Rail QDR w No Limit in MAX_HOP Wallclock Time (s) Dual Rail QDR Wallclock Time (s)

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Summary Aggregated bandwidth obtained with dual rail QDR exceeds the FDR performance. Shows performance benefits from dual rail QDR configurations. Gordon’s 3-D torus of switches leads to variability in performance due to oversubscription/topology considerations. Switch topology can be configured to enable mitigation of the link oversubscription bottleneck.

SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Summary Performance improvement also varies based on the degree of communication overhead. Benchmark cases with larger communication fractions (with respect to overall run time) show more improvement with dual rail QDR configurations. Computational time scaled with the core counts in both single and dual rail configurations for the currently benchmarked applications: LAMMPS and Amber

Acknowlegements This work was supported by NSF grant: OCI # Gordon: A Data Intensive Supercomputer.