Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto Supervisor: Prof. Jorma Virtamo Instructor: D.Sc.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

CoMPI: Enhancing MPI based applications performance and scalability using run-time compression. Rosa Filgueira, David E.Singh, Alejandro Calderón and Jesús.

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Today’s topics Single processors and the Memory Hierarchy

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Types of Parallel Computers

History of Distributed Systems Joseph Cordina

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.

CSC Site Update HP Nordic TIG April 2008 Janne Ignatius Marko Myllynen Dan Still.

Nor Asilah Wati Abdul Hamid, Paul Coddington. School of Computer Science, University of Adelaide PDCN FEBRUARY 2007 AVERAGES, DISTRIBUTIONS AND SCALABILITY.

1 Performance Evaluation of Gigabit Ethernet & Myrinet

Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Router Architectures An overview of router architectures.

Router Architectures An overview of router architectures.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. System Level Resource Discovery and Management for Multi Core Environment Javad.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

Virtualizing Modern High-Speed Interconnection Networks with Performance and Scalability Institute of Computing Technology, Chinese Academy of Sciences,

Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.

LLNL-PRES-XXXXXX This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Silicon Building Blocks for Blade Server Designs accelerate your Innovation.

Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.

CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.

An architecture for space sharing HPC and commodity workloads in the cloud Jack Lange Assistant Professor University of Pittsburgh.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.

Trace Generation to Simulate Large Scale Distributed Application Olivier Dalle, Emiio P. ManciniMar. 8th, 2012.

Extreme-scale computing systems – High performance computing systems Current No. 1 supercomputer Tianhe-2 at petaflops Pushing toward exa-scale computing.

The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Jaguar Super Computer Topics Covered Introduction Architecture Location & Cost Bench Mark Results Location & Manufacturer Machines in top 500 Operating.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Remote Direct Memory Access (RDMA) over IP PFLDNet 2003, Geneva Stephen Bailey, Sandburst Corp., Allyn Romanow, Cisco Systems,

Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.

PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

2009/4/21 Third French-Japanese PAAP Workshop 1 A Volumetric 3-D FFT on Clusters of Multi-Core Processors Daisuke Takahashi University of Tsukuba, Japan.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.

CS 4396 Computer Networks Lab Router Architectures.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 April 11, 2006 Session 23.

Modeling Billion-Node Torus Networks Using Massively Parallel Discrete-Event Simulation Ning Liu, Christopher Carothers 1.

Interconnection network network interface and a case study.

Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.

Presented by NCCS Hardware Jim Rogers Director of Operations National Center for Computational Sciences.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Background Computer System Architectures Computer System Software.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

A CASE STUDY IN USING MASSIVELY PARALLEL SIMULATION FOR EXTREME- SCALE TORUS NETWORK CO-DESIGN Misbah Mubarak, Rensselaer Polytechnic Institute Christopher.

Voltaire and the CERN openlab collaborate on Grid technology project using InfiniBand May 27, 2004 Patrick Chevaux EMEA Business Development

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

APE group Many-core platforms and HEP experiments computing XVII SuperB Workshop and Kick-off Meeting Elba, May 29-June 1,

Enhancements for Voltaire’s InfiniBand simulator

Super Computing By RIsaj t r S3 ece, roll 50.

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Types of Parallel Computers

Presentation transcript:

Proprietary or Commodity? Interconnect Performance in Large-scale Supercomputers Author: Olli-Pekka Lehto Supervisor: Prof. Jorma Virtamo Instructor: D.Sc. (Tech.) Jussi Heikonen

Contents  Introduction  Background  Objectives  Test platforms  Testing methodology  Results  Conclusions

Introduction  A Modern High Performance Computing (HPC) system consists of: Login nodes, service nodes, IO nodes, Compute nodes Usually multicore/SMP Interconnect network(s) which links all nodes together  Applications are run in parallel The individual tasks of a parallel application exchange data via the interconnect The interconnect plays a critical role in the overall performance of the parallel application  Commodity system Use of off-the-shelf components Leverages economies of scale involved in the PC industry “Industry standard” architectures which allow for the system to be extended in a vendor-independent fashion  Proprietary system A highly integrated system designed specifically for HPC May contain some off-the-shelf components but cannot be extended in a vendor independent fashion.

Background:The Cluster Revolution In the last decade clusters have rapidly become the system architecture of choice in HPC. The high end of the market is still dominated by proprietary MPP systems. Source:

Background: The Architecture Evolution The architectures of proprietary and commodity HPC systems have been converging. Nowadays it's difficult to differentiate between the two. The interconnect network is a key differentiating factor between commodity and proprietary architectures. Increasing R&D costs drive move towards commodity components Competition between AMD and Intel Competition in specialized cluster interconnects Disclaimer: IBM Blue Gene is an notable execption

Problem Statement Does a supercomputer architecture with a proprietary interconnection network offer a significant performance advantage compared to a similiar sized cluster with a commodity network?

Test Platforms  In 2007 CSC - Scientific Computing Ltd. conducted a 10M€ procurement to aquire new HPC systems  The procurement was split into two parts Lot 1: Capability computing Massively parallel “Grand Challenge” –computation Lot 2: Capacity computing Sequential and small to medium size parallel problems Problems requiring large amounts of memory

Test Platform 1: Louhi  Winner of Lot 1: “capability computing”  Cray XT4 Phase 1 (2007): GHz AMD Opteron cores (dual core): 10.5TFlop/s Phase 2 (2008): GHz AMD Opteron cores (quad core* ): 70.1TFlop/s  Unicos/lc operating system Linux on the login and service nodes Catamount microkernel on compute nodes  Proprietary “SeaStar2” interconnection network 3-dimensional torus topology Each node connected to 6 neighbors with 7,6GByte/s links Each node has a router integrated into the NIC NIC connected directly to AMD HyperTransport bus NIC has an onboard CPU for protocol processing (“protocol offloading”) Remote Direct Memory Access (RDMA) * 4 Flops/cycle

Test Platform 1: Louhi Source: Cray Inc.

Test Platform 2: Murska  Winner of Lot 2: “capacity computing”  HP CP4000BL XC blade cluster GHz AMD Opteron cores (dual-core):10.6 TFlop/s Dual-socket HP BL465c server blades  HPC Linux operating system HP's turnkey cluster OS based on RHEL  InfiniBand interconnect network A multipurpose high-speed network Fat tree topology (blocking) 24-port blade enclosure switches 16 16Gbit/s DDR *) downlinks 8 16Gbit/s DDR uplinks, running at 8Gbit/s 288-port master switch with SDR **) ports (8Gbit/s) Recently upgraded to DDR ● Host Channel Adapters (HCA) connected to 8x PCIe buses ● Remote Direct Memory Access (RDMA) *) Double Data Rate **) Single Data Rate

Test Methodology  Testing of individual parameters with microbenchmarks End-to-end communication latency and bandwidth Communication processing overhead Consistency of performance across the system  Testing of real-world behavior with a scientific application Gromacs – A popular open source molecular dynamics application  All measurements use Message Passing Interface (MPI) MPI is by far the most popular parallel programming application programming interface (API) for HPC Murska uses HP's HP-MPI implementation Louhi uses a Cray-modified version of the MPICH2 implementation

End-to-end Latency  Intel MPI Benchmarks (IMB) PingPong test was used Measures the latency to send and recieve a single point-to-point message as a function of the message size Arguably the most popular metric of interconnect performance (esp. short messages)  Murska's HP-MPI has 2 modes of operation (RDMA and SRQ) RDMA requires 256kbytes of memory per MPI task while SRQ (Shared Recieve Queue) has a constant memory requirement SRQ causes a notable latency overhead Murska using RDMA has a slightly lower latency with short messages As the message size grows, Louhi outperforms Murska

End-to-end Bandwidth  IMB PingPong test with large message sizes Inverse of the latency Test was done between nearest-neighbor nodes Louhi has still not reached peak bandwidth with the largest message size (4 megabytes) The gap between RDMA and SRQ narrows as the link becomes saturated: SRQ doesn't affect performance with large message sizes

Communication Processing Overhead  Measure of how much communication stresses the CPU The C in HPC stands for Computing, not Communication ;)  MPI has asynchronous communication routines which overlap communication and computation This requires autonomous communication mechanisms Murska has RDMA, Louhi has protocol offloading and RDMA  Sandia Nat'l Labs SMB benchmark was used See how much work one process gets done while another process communicates with it constantly. Result as application availability percentage 100%: communication is performed completely in the background 0%: communication is performed completely in the foreground, no work done Separate results for getting work done at the sender and at the reciever side

Reciever Side Availability 100% Availability: Communication does not interfere with processing at all 0% Availability: Processing communication hogs the CPU completely Louhi's availability improves with 8k-128k messages Murska's availability drops significantly between 16k and 32k

Sender Side Availability Louhi's availability improves dramatically with large messages: The offload engine can process packets autonomously.

Gromacs  A popular and mature molecular dynamics simulation package Open Source, downloadable from  Programmed with MPI Designed to exploit overlapping computation and communication, if available  Parallel speedup was measured by using a fixed size molecular system Run times for task counts from 16 to 128 were measured MPI calls were profiled How much time spent in communication subroutines? Which subroutines were the most time-consuming?

Gromacs Run Time Murska’s scaling stops at 32 tasks Louhi’s scaling stops at 64 tasks Time spent in MPI communication routines starts increasing at 64 tasks

MPI Call Profile Fraction of the total MPI time spent in a specific MPI call MPI_Wait (wait for an asynchronous message transfer to complete) starts dominating time usage on Murska as the task count grows. MPI_Alltoall starts dominating MPI time usage on Louhi again as the task count grows. With small message sizes the MPI_Alltoall (all processes send a message to each other) dominates the time spent in MPI.

Conclusions  On Murska, a tradeoff has to be made with large parallel problems SRQ: Sacrifice latency in favor of memory capacity RDMA: Sacrifice memory capacity in favor of latency  Murska is able to outperform Louhi in some benchmarks Especially in short message performance in RDMA mode  Louhi was more consistent in providing low processing overhead Being able to overlap long messages tends to be more important than short messages as they take more time to complete  Gromacs scaled significantly better on Louhi Most likely largely due to lower communication processing overhead  A proprietary system still has it's place The interconnect is designed from ground up to handle MPI communication and HPC workloads The streamlined microkernel also helps out  Focusing only on “hero numbers” (e.g. short message latency) can be misleading

Questions? Thank you!