© 2012 MELLANOX TECHNOLOGIES 1 The Exascale Interconnect Technology Rich Graham – Sr. Solutions Architect
© 2012 MELLANOX TECHNOLOGIES 2 Leading Server and Storage Interconnect Provider Software Comprehensive End-to-End 10/40/56Gb/s Ethernet and 56Gb/s InfiniBand Portfolio ICsSwitches/GatewaysAdapter CardsCables Scalability, Reliability, Power, Performance
© 2012 MELLANOX TECHNOLOGIES 3 HCA Roadmap of Interconnect Innovations InfiniHost World’s first InfiniBand HCA 10Gb/s InfiniBand PCI-X host interface 1 million msg/sec InfiniHost III World’s first PCIe InfiniBand HCA 20Gb/s InfiniBand PCIe million msg/sec ConnectX (1,2,3) World’s first Virtual Protocol Interconnect (VPI) Adapter 40Gb/s & 56Gb/s PCIe 2.0, 3.0 x8 33 million msg/sec Connect-IB The Exascale Foundation June
© 2012 MELLANOX TECHNOLOGIES 4 A new interconnect architecture for compute intensive applications World’s fastest server and storage interconnect solution providing 100Gb/s injection bandwidth Enables unlimited clustering scalability with new Dynamically Connected Transport service Accelerates compute-intensive and parallel-intensive applications with over 130 million msg/sec Optimized for multi-tenant environments of 100s of Virtual Machines per server Announcing Connect-IB: The Exascale Foundation
© 2012 MELLANOX TECHNOLOGIES-- CONFIDENTIAL -- 5 New innovative transport – Dynamically Connected Transport service The new transport service combines the best of: - Reliable Connected Service – transport reliability - Unreliable Datagram (UD) – no resources reservation Scale out for unlimited clustering size of compute and storage Eliminates overhead and reduces memory footprint CoreDirect Collective Hardware Offloads Provides ‘state’ to Work Queue Mechanisms for Collective Offloading in HCA Frees CPU to do meaningful computation in parallel with collective operations Derived Data Types Hardware support for non-contiguous ‘strided’ memory access Scatter/gather optimizations Connect-IB Advanced HPC Features New Transport Mechanism for Unlimited Scalability
© 2012 MELLANOX TECHNOLOGIES 6 Dynamically Connected Transport Service
© 2012 MELLANOX TECHNOLOGIES 7 Transport Scalability RC requires connection per peer – strains resource requirements at large scale (O(N)) XRC requires connection per remote node – strains resource requirements at large scale (O(N)) Transport Performance UD supports only send/receive semantics – no RDMA or Atomic operations support Problems The New Capability addresses
© 2012 MELLANOX TECHNOLOGIES 8 Domically Connected (DC) H/W entities DC Initiator (DCI) - Data source DC Target (DCT) – Data Destination Key concept Reliable communications - Supports RDMA and Atomics Single Initiator can send to multiple destinations Resource footprint scales as: - Application communication patterns - Single node communication characteristics Dynamically Connected Transport Service Basics
© 2012 MELLANOX TECHNOLOGIES- MELLANOX CONFIDENTIAL - 9 Communication Time Line – Common Case
© 2012 MELLANOX TECHNOLOGIES 10 COREDirect Enhanced support
© 2012 MELLANOX TECHNOLOGIES 11 Collective communication scalability For many HPC applications the scalability of such communications determines application scalability System noise Uncoordinated system activity causes the slow down in one process to be magnified at other processes Effects increase as the size of the system increases Collective communication performance Problems The New Capability addresses
© 2012 MELLANOX TECHNOLOGIES 12 Scalability of Collective Operations Ideal Algorithm Impact of System Noise
© 2012 MELLANOX TECHNOLOGIES 13 Scalability of Collective Operations Offloaded Algorithm Nonblocking Algorithm - Communication processing
© 2012 MELLANOX TECHNOLOGIES 14 Managed QP progresses a separate counter (instead of by door-bell) A ‘wait work queue’ entry waits until specified completion queue (QP) reaches specified producer index value ‘Enable tasks’ manage QP’s to be executed by the H/W Can set receive CQ’s to continue to be active if they overflow wait events monitor progress Submit lists of task to multiple QP’s sufficient to describe collective operations Can setup a special completion queue to monitor list completion request CQE from the relevant task Key Hardware Features
© 2012 MELLANOX TECHNOLOGIES 15 Collective communications Optimizations Communication pattern involving multiple processes Optimized collectives involve a communicator-wide data-dependent communication pattern Data needs to be manipulated at intermediate stages of a collective operation Collective operations limit application scalability - For example, system noise COREDirect – Key Ideas Create a local description of the communication pattern Pass the description to the HCA Manage the collective operation on the network, freeing the CPU to do meaningful computation Poll for collective completion Collective Communication Methodology
© 2012 MELLANOX TECHNOLOGIES 16 Barrier Collective
© 2012 MELLANOX TECHNOLOGIES 17 Alltoall Collective (128 Bytes)
© 2012 MELLANOX TECHNOLOGIES 18 Nonblocking Allgather (Overlap Post-Work- Wait)
© 2012 MELLANOX TECHNOLOGIES 19 Nonblocking Alltoall (Overlap-Wait)
© 2012 MELLANOX TECHNOLOGIES 20 Non-Contiguous Data Type Support
© 2012 MELLANOX TECHNOLOGIES 21 Transfer of non-contiguous data Often triggers data packing in main memory, adding to the communication overhead Increased CPU involvement in communication pre/post-processing Problems The New Capability addresses
© 2012 MELLANOX TECHNOLOGIES- MELLANOX CONFIDENTIAL - 22 Combining Contiguous Memory Regions
© 2012 MELLANOX TECHNOLOGIES 23 Supports non-contiguous strided memory access, scatter/gather Non-Contiguous Memory Access – Regular Access
© 2012 MELLANOX TECHNOLOGIES 24 THANK YOU