IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty.

Slides:



Advertisements
Similar presentations
L. Alchaal & al. Page Offering a Multicast Delivery Service in a Programmable Secure IP VPN Environment Lina ALCHAAL Netcelo S.A., Echirolles INRIA.
Advertisements

Neighbor Discovery for IPv6 Mangesh Kaushikkar. Overview Introduction Terminology Protocol Overview Message Formats Conceptual Model of a Host.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Addressing the Network – IPv4 Network Fundamentals – Chapter 6.
Internet Control Protocols Savera Tanwir. Internet Control Protocols ICMP ARP RARP DHCP.
Remote Procedure Call (RPC)
OFED TCP Port Mapper Proposal June 15, Overview Current NE020 Linux OFED driver uses host TCP/IP stack MAC and IP address for RDMA connections Hardware.
Multi-Layer Switching Layers 1, 2, and 3. Cisco Hierarchical Model Access Layer –Workgroup –Access layer aggregation and L3/L4 services Distribution Layer.
11 TROUBLESHOOTING Chapter 12. Chapter 12: TROUBLESHOOTING2 OVERVIEW  Determine whether a network communications problem is related to TCP/IP.  Understand.
MCDST : Supporting Users and Troubleshooting a Microsoft Windows XP Operating System Chapter 13: Troubleshoot TCP/IP.
Linux Networking TCP/IP stack kernel controls the TCP/IP protocol Ethernet adapter is hooked to the kernel in with the ipconfig command ifconfig sets the.
Oct 21, 2004CS573: Network Protocols and Standards1 IP: Addressing, ARP, Routing Network Protocols and Standards Autumn
Chapter 8 Administering TCP/IP.
Hands-On Microsoft Windows Server 2003 Networking Chapter 7 Windows Internet Naming Service.
Subnetting.
Chapter 23: ARP, ICMP, DHCP IS333 Spring 2015.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 7: Planning a DNS Strategy.
COEN 252: Computer Forensics Router Investigation.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 Application Layer Functionality and Protocols Network Fundamentals – Chapter.
CMPT 471 Networking II Address Resolution IPv6 Neighbor Discovery 1© Janice Regan, 2012.
Network Layer (Part IV). Overview A router is a type of internetworking device that passes data packets between networks based on Layer 3 addresses. A.
Support Protocols and Technologies. Topics Filling in the gaps we need to make for IP forwarding work in practice – Getting IP addresses (DHCP) – Mapping.
CN2668 Routers and Switches Kemtis Kunanuraksapong MSIS with Distinction MCTS, MCDST, MCP, A+
11 NETWORK PROTOCOLS AND SERVICES Chapter 10. Chapter 10: Network Protocols and Services2 NETWORK PROTOCOLS AND SERVICES  Identify how computers on TCP/IP.
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Addressing the Network – IPv4 Network Fundamentals – Chapter 6.
Copyright © 2007 InfiniBand ® Trade Association. Other names and brands are properties of their respective owners. IB Cross-Subnet Communication OpenFabrics.
Windows Internet Connection Sharing Dave Eitelbach Program Manager Networking And Communications Microsoft Corporation.
Mapping Internet Addresses to Physical Addresses (ARP)
Module 7: Configuring TCP/IP Addressing and Name Resolution.
Name Resolution Domain Name System.
23-Support Protocols and Technologies Dr. John P. Abraham Professor UTPA.
The Network Layer. Network Projects Must utilize sockets programming –Client and Server –Any platform Please submit one page proposal Can work individually.
Computer Networks. IP Addresses Before we communicate with a computer on the network we have to be able to identify it. Every computer on a network must.
PA3: Router Junxian (Jim) Huang EECS 489 W11 /
End-to-end resource management in DiffServ Networks –DiffServ focuses on singal domain –Users want end-to-end services –No consensus at this time –Two.
70-291: MCSE Guide to Managing a Microsoft Windows Server 2003 Network Chapter 6: Name Resolution.
CMPT 471 Networking II Address Resolution IPv4 ARP RARP 1© Janice Regan, 2012.
InfiniBand Routing Solution Approach Yaron Haviv, CTO, Voltaire
NUS.SOC.CS2105 Ooi Wei Tsang Application Transport Network Link Physical you are here.
Scalable name and address resolution infrastructure -- Ira Weiny/John Fleck #OFADevWorkshop.
Management Scalability Author: Todd Rimmer Date: April 2014.
Update on Scalable SA Project #OFADevWorkshop Hal Rosenstock Mellanox Technologies.
Hyung-Min Lee ©Networking Lab., 2001 Chapter 8 ARP and RARP.
High Availability through the Linux bonding driver
Internetworking Internet: A network among networks, or a network of networks Allows accommodation of multiple network technologies Universal Service Routers.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
NFD Permanent Face Junxiao Shi, Outline what is a permanent face necessity and benefit of having permanent faces guarantees provided by.
IP1 The Underlying Technologies. What is inside the Internet? Or What are the key underlying technologies that make it work so successfully? –Packet Switching.
COP 5611 Operating Systems Spring 2010 Dan C. Marinescu Office: HEC 439 B Office hours: M-Wd 2:00-3:00 PM.
FreeS/WAN & VPN Cory Petkovsek VPN: Virtual Private Network – a secure tunnel through untrusted networks. IP Security (IPSec): a standardized set of authentication.
ICS 156: Networking Lab Magda El Zarki Professor, ICS UC, Irvine.
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
Semester 2v2 Chapter 8: IP Addressing. Describe how IP addressing is important in routing. IP addresses are specified in 32-bit dotted-decimal format.
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 6: Planning, Configuring, And Troubleshooting WINS.
OpenFabrics Developers Summit SC06 QoS Update and Implementation RFC Eitan Zahavi, Mellanox Technologies Nov 2006.
Quality of Service Support Dror Goldenberg - Mellanox Sean Hefty – Intel.
Chapter 5. An IP address is simply a series of binary bits (ones and zeros). How many binary bits are used? 32.
Address Resolution Protocol Yasir Jan 20 th March 2008 Future Internet.
Chapter 4: server services. The Complete Guide to Linux System Administration2 Objectives Configure network interfaces using command- line and graphical.
InfiniBand Routing in OFA Jason Gunthorpe – Obsidian Sean Hefty – Intel Hal Rosenstock – Voltaire.
1 K. Salah Module 5.1: Internet Protocol TCP/IP Suite IP Addressing ARP RARP DHCP.
Mobile IP THE 12 TH MEETING. Mobile IP  Incorporation of mobile users in the network.  Cellular system (e.g., GSM) started with mobility in mind. 
SC’13 BoF Discussion Sean Hefty Intel Corporation.
IP: Addressing, ARP, Routing
70-293: MCSE Guide to Planning a Microsoft Windows Server 2003 Network, Enhanced Chapter 6: Planning, Configuring, And Troubleshooting WINS.
ARP and RARP Objectives Chapter 7 Upon completion you will be able to:
Fabric Interfaces Architecture – v4
Simple Connectivity Between InfiniBand Subnets
Allocating IP Addressing by Using Dynamic Host Configuration Protocol
Ch 17 - Binding Protocol Addresses
Windows Name Resolution
Presentation transcript:

IB ACM InfiniBand Communication Management Assistant (for Scaling) Sean Hefty

Problem MPI job startup places a huge burden on the fabric Need to resolve addressesNeed to resolve paths

Problem Address Resolution ARP for address resolution – ARP storm as all nodes try to discover other nodes Number of outstanding ARP requests = O(n 2 ) May require pre-loading ARP cache before starting MPI job – ARP replies require path record queries! – ARP entries timeout across the subnet 1000 nodes  1 million ARP entries subnet wide 15 minute timeout  1000 entries timeout per second 24 hour timeout  12 entries timeout per second Even with a 24 hour ARP entry timeout, 7200 entries will timeout during a 10 minute MPI run! No easy solution, but workable in practice

Problem Path Resolution Centralized SA hinders scalability Number of outstanding queries = O(n 2 ) Queries take minutes to complete – If apps actually wait that long – (Think of how many ARP entries just timed out!) – Responses are not shared Standard path record query mechanisms simply do not work at scale

Solution IB ACM Addresses connection setup scalability issues Service / daemon providing: Address resolution – ARP like protocol over IB multicast – Names may be host names, IP addresses Route resolution – Construct or query for path records – Path record caching

Solution ACM for Address Resolution Wait, why duplicate ARP mechanisms? IPoIB issues path record queries as part of its ARP implementation Store address and route data together – Timeout data together – Under certain conditions, we can resolve address and route data without using SA queries We can obtain additional information about the remote device – E.g. maximum RDMA capabilities Restrict the number of outstanding address resolution requests on the subnet to O(n)

Solution ACM for Route Resolution Is ACM a path record cache? Yes, but also an open source framework for addressing scalability issues – May be able to construct path records without querying the SA Integrates with the librdmacm to make its use transparent to the user Restricts the number of outstanding SA queries from the subnet to O(n)

Usage Model Recommended use is via librdmacm – Requires ‘--with-ib_acm’ compile option Temporary requirement, will be removed once ACM is more proven – Falls back to normal resolution on failures Unable to contact local or remote ib_acm service Failure to resolve address or path information Small latency hit if librdmacm compiled with ACM support, but service is not running

Usage Model rdma_resolve_route – All existing librdmacm apps can take advantage of path record caching – Path data is obtained from ib_acm service and kernel is configured to use it ib_acm lookup is synchronous – Requires kernel (OFED 1.5.2) or newer rdma_resolve_addr still ends up going through ARP :P

Usage Model rdma_getaddrinfo – Supports address and route resolution Kernel support to make use of address resolution is pending (AF_IB patch set) – Can act as a simple (i.e. dumb) path record query interface For apps that manually configure their QPs, but require path records (specifically SL data) from the SA to avoid subnet deadlock – NODELAY flag Quick check against cached data Kicks off ACM resolution protocols in background Future lookups can find cached data

Usage Model ib_acm may be accessed directly via a socket interface – Not recommended – Protocol is defined by acm.h header file, but is not documented – librdmacm and ib_acme may be used as guides

ACM Service Listens on TCP socket Simple request / reply protocol Input: – Name, IP address, partial path record – Destination and optional source Output: – Set of path records Forward, reverse, alternates, CM paths – Current implementation is 1 Path must be fully reversible

ACM Service Examples src = [ ] dst = Path record with SGID / DGID Path Record IB ACM librdmacm (rdma_getaddrinfo)

ACM Operation 1.All ACM services join multicast groups – Groups differ by pkey, rate, MTU, type of service – All services must join 1 common group – All traffic occurs on common group 2.ACM receives client request 3.Initiator broadcasts destination address on common multicast group 4.All peer ACM services cache source data

ACM Operation 5.Target service queries SA or constructs path record to initiator – See next slide 6.Target service responds with IB address 7.Initiator caches response address 8.Initiator queries SA or constructs path record to target 9.ACM responds to client request

ACM Operation Query SA for path data – Required for certain fabric topologies to avoid deadlock Construct path record based on common multicast group supported by source and destination – Can be more efficient – Once multicast groups are configured, avoids querying the SA

ACM Communication Defines application specific MADs – Uses management class 44 Allocates a UD QP Protocol is defined internal to ACM service – Defined in internal header files – Undocumented

ACME Multi-purpose ACM test and configuration application Verifies path data from ACM against SA path records – Can be used to validate if multicast resolution protocol is usable May be used to pre-load cache Generate configuration files

ACM Configuration Users may override default ACM configuration options through the use of configuration files Default files can be generated using the ib_acme utility – acm_addr.cfg Contains local address data File will be unique per node – acm_opts.cfg Contains ACM service options Likely the same across an entire cluster

ACM Configuration Address assignments – – Name is usually host name or IP address – Pkey can be use for QoS Logging location and details Timeout / retry settings Outstanding requests limits – To SA or peer ACM services – Restricts the number of outstanding requests on the subnet

Work In Progress Validate concepts, performance, and scalability – Initial testing is positive Support for dynamic changes is limited – Handles port up/down events – Does not respond to local IP address or hostname changes Want to obtain path data progressively – Pre-load cache using ib_acme – Save / restore data between reboots

Work In Progress Cached data not invalidated – Must stop / restart service – Proposal is to invalidate data based on CM timeouts Dependent on kernel changes QoS support controlled by SA – Addresses / names map to specific pkey values – A QoS scheme based on pkeys would work best

Pending Work AF_IB Support Allows direct IB addressing with transport independent rdma_getaddrinfo interface Desired for librdmacm to make use of ACM address resolution Removes librdmacm requirement for ipoib Provides greater flexibility moving forward to support alternate paths

Pending Work MAD Snooping Invalidate cache based on local events – CM timeouts, rejects Avoid registering for SA events – Data is invalidated based on local events only Capture path data from other applications – ipoib, srp, iser, rds, etc.