© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect.

Slides:



Advertisements
Similar presentations
Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.
Advertisements

The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
Chapter 17 Networking Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
1 InfiniBand HW Architecture InfiniBand Unified Fabric InfiniBand Architecture Router xCA Link Topology Switched Fabric (vs shared bus) 64K nodes per sub-net.
VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.
© 2007 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.1 Computer Networks and Internets with Internet Applications, 4e By Douglas.
Protocols and the TCP/IP Suite
Internet Architecture Two computers, anywhere in the world, following certain hardware, software, protocol specifications, can communicate, reliably even.
An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.
 The Open Systems Interconnection model (OSI model) is a product of the Open Systems Interconnection effort at the International Organization for Standardization.
IWARP Ethernet Key to Driving Ethernet into the Future Brian Hausauer Chief Architect NetEffect, Inc.
© 2013 Mellanox Technologies 1 NoSQL DB Benchmarking with high performance Networking solutions WBDB, Xian, July 2013.
SRP Update Bart Van Assche,.
Signature Verbs Extension Richard L. Graham. Data Integrity Field (DIF) Used to provide data block integrity check capabilities (CRC) for block storage.
Dual Stack Virtualization: Consolidating HPC and commodity workloads in the cloud Brian Kocoloski, Jiannan Ouyang, Jack Lange University of Pittsburgh.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Protocols for Wide-Area Data-intensive Applications: Design and Performance Issues Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi, Brian.
Inter-process Communication and Coordination Chaitanya Sambhara CSC 8320 Advanced Operating Systems.
TPT-RAID: A High Performance Multi-Box Storage System
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.
Copyright 2009 Fujitsu America, Inc. 0 Fujitsu PRIMERGY Servers “Next Generation HPC and Cloud Architecture” PRIMERGY CX1000 Tom Donnelly April
1 March 2010 A Study of Hardware Assisted IP over InfiniBand and its Impact on Enterprise Data Center Performance Ryan E. Grant 1, Pavan Balaji 2, Ahmad.
Current major high performance networking technologies InfiniBand 10G-Ethernet.
Towards a Common Communication Infrastructure for Clusters and Grids Darius Buntinas Argonne National Laboratory.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Maximizing The Compute Power With Mellanox InfiniBand Connectivity Gilad Shainer Wolfram Technology Conference 2006.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Dynamic Time Variant Connection Management for PGAS Models on InfiniBand Abhinav Vishnu 1, Manoj Krishnan 1 and Pavan Balaji 2 1 Pacific Northwest National.
Boosting Event Building Performance Using Infiniband FDR for CMS Upgrade Andrew Forrest – CERN (PH/CMD) Technology and Instrumentation in Particle Physics.
The NE010 iWARP Adapter Gary Montry Senior Scientist
The Red Storm High Performance Computer March 19, 2008 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
Gilad Shainer, VP of Marketing Dec 2013 Interconnect Your Future.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device Shuang LiangRanjit NoronhaDhabaleswar K. Panda IEEE.
Example: Sorting on Distributed Computing Environment Apr 20,
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi 1. Prerequisites.
March 9, 2015 San Jose Compute Engineering Workshop.
A Data Communication Reliability and Trustability Study for Cluster Computing Speaker: Eduardo Colmenares Midwestern State University Wichita Falls, TX.
August 22, 2005Page 1 of (#) Datacenter Fabric Workshop Open MPI Overview and Current Status Tim Woodall - LANL Galen Shipman - LANL/UNM.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
© 2012 MELLANOX TECHNOLOGIES 1 Disruptive Technologies in HPC Interconnect HPC User Forum April 16, 2012.
Minimizing Communication Latency to Maximize Network Communication Throughput over InfiniBand Design and Implementation of MPICH-2 over InfiniBand with.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
SDN AND OPENFLOW SPECIFICATION SPEAKER: HSUAN-LING WENG DATE: 2014/11/18.
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
Chapter 24 Transport Control Protocol (TCP) Layer 4 protocol Responsible for reliable end-to-end transmission Provides illusion of reliable network to.
1 THE EARTH SIMULATOR SYSTEM By: Shinichi HABATA, Mitsuo YOKOKAWA, Shigemune KITAWAKI Presented by: Anisha Thonour.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Reading TCP/IP Protocol. Training target: Read the following reading materials and use the reading skills mentioned in the passages above. You may also.
Chapter 7: Consistency & Replication IV - REPLICATION MANAGEMENT By Jyothsna Natarajan Instructor: Prof. Yanqing Zhang Course: Advanced Operating Systems.
Accelerating High Performance Cluster Computing Through the Reduction of File System Latency David Fellinger Chief Scientist, DDN Storage ©2015 Dartadirect.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Reliable Multicast (RMC) Liran Liss Mellanox Technologies Inc.
The Evaluation Tool for the LHCb Event Builder Network Upgrade Guoming Liu, Niko Neufeld CERN, Switzerland 18 th Real-Time Conference June 13, 2012.
Network Topologies for Scalable Multi-User Virtual Environments Lingrui Liang.
PART1 Data collection methodology and NM paradigms 1.
Enhancements for Voltaire’s InfiniBand simulator
What is Fibre Channel? What is Fibre Channel? Introduction
Fabric Interfaces Architecture – v4
Versatile HPC: Comet Virtual Clusters for the Long Tail of Science SC17 Denver Colorado Comet Virtualization Team: Trevor Cooper, Dmitry Mishin, Christopher.
CLUSTER COMPUTING.
Chapter 3 Part 3 Switching and Bridging
NVMe.
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect

© 2012 MELLANOX TECHNOLOGIES 2 Leading Server and Storage Interconnect Provider Software Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio ICsSwitches/GatewaysAdapter CardsCables Scalability, Reliability, Power, Performance

© 2012 MELLANOX TECHNOLOGIES 3 HCA Roadmap of Interconnect Innovations InfiniHost World’s first InfiniBand HCA 10Gb/s InfiniBand PCI-X host interface 1 million msg/sec InfiniHost III World’s first PCIe InfiniBand HCA 20Gb/s InfiniBand PCIe million msg/sec ConnectX (1,2,3) World’s first Virtual Protocol Interconnect (VPI) Adapter 40Gb/s & 56Gb/s PCIe 2.0, 3.0 x8 33 million msg/sec Connect-IB The Exascale Foundation June

© 2012 MELLANOX TECHNOLOGIES 4  A new interconnect architecture for compute intensive applications  World’s fastest server and storage interconnect solution providing 100Gb/s injection bandwidth  Enables unlimited clustering scalability with new Dynamically Connected Transport service  Accelerates compute-intensive and parallel-intensive applications with over 130 million msg/sec  Optimized for multi-tenant environments of 100s of Virtual Machines per server Announcing Connect-IB: The Exascale Foundation

© 2012 MELLANOX TECHNOLOGIES-- CONFIDENTIAL -- 5  New innovative transport – Dynamically Connected Transport service The new transport service combines the best of: - Reliable Connected Service – transport reliability - Unreliable Datagram (UD) – no resources reservation Scale out for unlimited clustering size of compute and storage Eliminates overhead and reduces memory footprint  CoreDirect Collective Hardware Offloads Provides ‘state’ to Work Queue Mechanisms for Collective Offloading in HCA Frees CPU to do meaningful computation in parallel with collective operations  Derived Data Types Hardware support for non-contiguous ‘strided’ memory access Scatter/gather optimizations Connect-IB Advanced HPC Features New Transport Mechanism for Unlimited Scalability

© 2012 MELLANOX TECHNOLOGIES 6 Dynamically Connected Transport Service

© 2012 MELLANOX TECHNOLOGIES 7  Transport Scalability RC requires connection per peer – strains resource requirements at large scale (O(N)) XRC requires connection per remote node – strains resource requirements at large scale (O(N))  Transport Performance UD supports only send/receive semantics – no RDMA or Atomic operations support Problems The New Capability addresses

© 2012 MELLANOX TECHNOLOGIES 8  Domically Connected (DC) H/W entities DC Initiator (DCI) - Data source DC Target (DCT) – Data Destination  Key concept Reliable communications - Supports RDMA and Atomics Single Initiator can send to multiple destinations Resource footprint scales as: - Application communication patterns - Single node communication characteristics Dynamically Connected Transport Service Basics

© 2012 MELLANOX TECHNOLOGIES- MELLANOX CONFIDENTIAL - 9 Communication Time Line – Common Case

© 2012 MELLANOX TECHNOLOGIES 10 COREDirect Enhanced support

© 2012 MELLANOX TECHNOLOGIES 11  Collective communication scalability For many HPC applications the scalability of such communications determines application scalability  System noise Uncoordinated system activity causes the slow down in one process to be magnified at other processes Effects increase as the size of the system increases  Collective communication performance Problems The New Capability addresses

© 2012 MELLANOX TECHNOLOGIES 12 Scalability of Collective Operations Ideal Algorithm Impact of System Noise

© 2012 MELLANOX TECHNOLOGIES 13 Scalability of Collective Operations Offloaded Algorithm Nonblocking Algorithm - Communication processing

© 2012 MELLANOX TECHNOLOGIES 14  Managed QP progresses a separate counter (instead of by door-bell)  A ‘wait work queue’ entry waits until specified completion queue (QP) reaches specified producer index value  ‘Enable tasks’ manage QP’s to be executed by the H/W  Can set receive CQ’s to continue to be active if they overflow wait events monitor progress  Submit lists of task to multiple QP’s sufficient to describe collective operations  Can setup a special completion queue to monitor list completion request CQE from the relevant task Key Hardware Features

© 2012 MELLANOX TECHNOLOGIES 15  Collective communications Optimizations Communication pattern involving multiple processes Optimized collectives involve a communicator-wide data-dependent communication pattern Data needs to be manipulated at intermediate stages of a collective operation Collective operations limit application scalability - For example, system noise  COREDirect – Key Ideas Create a local description of the communication pattern Pass the description to the HCA Manage the collective operation on the network, freeing the CPU to do meaningful computation Poll for collective completion Collective Communication Methodology

© 2012 MELLANOX TECHNOLOGIES 16 Barrier Collective

© 2012 MELLANOX TECHNOLOGIES 17 Alltoall Collective (128 Bytes)

© 2012 MELLANOX TECHNOLOGIES 18 Nonblocking Allgather (Overlap Post-Work- Wait)

© 2012 MELLANOX TECHNOLOGIES 19 Nonblocking Alltoall (Overlap-Wait)

© 2012 MELLANOX TECHNOLOGIES 20 Non-Contiguous Data Type Support

© 2012 MELLANOX TECHNOLOGIES 21  Transfer of non-contiguous data Often triggers data packing in main memory, adding to the communication overhead Increased CPU involvement in communication pre/post-processing Problems The New Capability addresses

© 2012 MELLANOX TECHNOLOGIES- MELLANOX CONFIDENTIAL - 22 Combining Contiguous Memory Regions

© 2012 MELLANOX TECHNOLOGIES 23  Supports non-contiguous strided memory access, scatter/gather Non-Contiguous Memory Access – Regular Access

© 2012 MELLANOX TECHNOLOGIES 24 THANK YOU