Download presentation
Presentation is loading. Please wait.
Published byJunior Murphy Modified over 6 years ago
1
A Brief Introduction to OpenFabrics Interfaces - libfabric
Sean Hefty The terms OFI and libfabric are often used interchangeably. Libfabric is the first, but likely not the only, component of what will become OpenFabrics Interfaces. It is focused on user space applications
2
Motivation OpenFabrics libibverbs middleware
Widely adopted low-level RDMA API Ships with upstream Linux Intended as unified API for RDMA Motivation but… Designed around InfiniBand architecture Targets specific hardware implementation Hardware, not network, abstraction Too low level for most consumers, not designed around HPC Hardware and fabric features are changing Divergence is driving alternative APIs – UCX, PSM, MXM, CCI, PAMI, uGNI … More applications require high-performance fabrics Cloud systems, data analytics, virtualization, big data … This should not be viewed as an ‘attack’ on libibverbs. Libibverbs has served a very useful purpose over the years and will continue to exist as part of OpenFabrics. The libibverbs interface was originally designed as a merger of 3 different IB interfaces. It is based on the IB spec, chapter 11. IB terminology is used throughout the interface. Non-IB hardware has had to adopt to these interfaces, sometimes with restrictions. E.g. libibverbs does not expose any MTU sizes other than those defined by IB. Ethernet devices (e.g. iWarp) must report non-standard MTU values. The IB spec never intended for verbs to be a software interface. It was an agreement made to the IB hardware. OFI is OFA adapting to the changing hardware implementations, new applications, and new fabric features.
3
OpenFabrics Interfaces Working Group
Solution Optimized SW path to HW Minimize cache and memory footprint Reduce instruction count Minimize memory accesses Scalable Implementation Agnostic Software interfaces aligned with application requirements 168 requirements from MPI, PGAS, SHMEM, DBMS, sockets, NVM, … Leverage existing open source community Inclusive development effort App and HW developers Good impedance match with multiple fabric hardware InfiniBand, iWarp, RoCE, raw Ethernet, UDP offload, Omni-Path, GNI, others Open Source Application-Centric libfabric OFA created a new working group to address the challenges of the marketplace and ensure its relevance in the industry. OFA was an ideal group for developing a new set of interfaces – had active community, multiple vendors, and end-users.
4
OpenFabrics Interfaces Working Group
Charter: Develop an extensible, open source framework and interfaces aligned with ULP and application needs for high-performance fabric services github.com/ofiwg Application-centric interfaces will help foster fabric innovation and accelerate their adoption One of the goals of OFI was to switch the focus from being a bottom up approach to one that was top down. Verbs is an example of a bottom up approach. The hardware implementation is exposed directly to the applications. As hardware evolved, the interfaces were forced to change in order to expose the new features. By focusing on the application’s needs instead, implementation details are hidden from the app. As hardware incorporate new features, those features can be used by modifying the providers rather than enabling each application.
5
Development Requirement analysis Rough conceptual model
~200 requirements MPI, PGAS, SHMEM, DBMS, sockets, … Development Requirement analysis Iterative design and implementation Deployment Rough conceptual model Input from wide variety of devices Quarterly release cycle Collective feedback from OFIWG The development process is iterative. We spent months analyzing application requirements in order to get at the real application requirement, as opposed to a suggested solution. In many cases, the initial requirement that we received was a proposal for a solution, often based on how things were done with existing interfaces. By spending time understanding the driving need of each requirement, we were able to craft a API well suited for application needs. At the same time, we analyzed the various proposals to understand their impact on the hardware implementations. Could it be done in hardware? What would the resulting API cost to implement in terms of memory footprint, cache usage, or instruction count? The first release was Q1 of There have since been releases for 1.1, 1.1.1, and we’re targeting a 1.2 release at the end of Q Sometime next year, we anticipate adding the first set of extensions to the 1.0 API.
6
Application Requirements
Give us a high-level interface! Give us a low-level interface! And this was just the MPI developers! Try talking to the government! Libfabric tries to walk both lines. We want it easy for a casual developer to get their code working. At the same time, there are some power users that want very low level access to the hardware. If libibverbs is conceptually viewed as being ‘assembly language’ for IB devices, libfabric can be viewed as being C. It still allows for low-level access to generic services.
7
Implementation Agnostic API
Design Implementation Agnostic API EASY Enable simple, basic usage Move functionality under OFI GURU Advanced application constructs Expose abstract HW capabilities Range of application usage models It should be noted that the ‘Easy’ part of the design is there, but the implementation of that is on-going. That is, each provider has implemented those pieces of libfabric that it does well. However, the next step is to expand the implementation so that it is easier for applications to switch from one provider to another. This is simply a matter of having the time to complete the development.
8
Architecture libfabric
Note: current implementation focused on enabling applications Architecture Intel MPI MPICH (Netmod) Open MPI (MTL / BTL) Open MPI SHMEM Sandia SHMEM GASNet Clang UPC rsockets ES-API Libfabric Enabled Middleware libfabric Control Services Communication Services Completion Services Data Transfer Services Discovery Connection Management Event Queues Message Queues RMA Triggered Ops Address Vectors Counters Tag Matching Atomics fi_info Sockets TCP, UDP Verbs IB, RoCE, iWarp Cisco usNIC Intel Omni-Path Cray GNI Mellanox MXM IBM Blue Gene A3Cube RONNIEE The experimental providers do not ship with libfabric. The Blue Gene provider was developed by Intel in order to test the development of the middleware at scale. To date, we’ve run libfabric over BG hardware at 1 million ranks. The A3Cube provider was developed by a graduate student in Italy. Provider development has been focused on time to market, so that middleware can be enabled over libfabric. The sockets provider is for development purposes and works on both Linux and MAC OS X. The verbs provider is a layered provider that enables existing IB, RoCE, and iWarp hardware. Cisco has a native libfabric provider (native meaning there isn’t a lower-level interface sitting under libfabric). The BG provider is native as well. The Intel providers enable TrueScale and OPA. Cray has a native GNI provider, and Intel’s MPI team contributed a provider over MXM, which targets greater scalability over Mellanox fabrics. Supported or in active development Experimental
9
Select desired endpoint type and capabilities
EASY Fabric Information Endpoint Types Capabilities MSG - Reliable connected DGRAM - Datagram RDM - Reliable unconnected, datagram messages Message queue - FIFO RMA Tagged messages Atomics Select desired endpoint type and capabilities This is an easy example of using an RDM endpoint with tag matching capabilities. With just this support, many MPIs can easily be ported over libfabric.
10
Fabric Information EASY App n App 1 App 2 . . . RDM Message Queue
OFI Enabled Applications RDM Message Queue Common Implementation DGRAM Message Queue In this example, the HW supports unreliable datagram communication. This is similar to what is supported by Cisco’s usNIC or to a rough degree Intel’s True Scale or OPA gen 1 hardware. Rather than each application needed to code for reliability, by pushing this down into libfabric, we can have a single, optimized implementation that all applications can take advantage of.
11
Fabric Information Capabilities GURU
Application desired features and permissions Primary – must be requested by application Secondary – offered by provider (application can request) Communication type – msg, tagged, rma, atomics, triggered Permissions – local R/W, remote R/W, send/recv Features – rma events, directed recv, multi-recv ,… Applications request specific capabilities through the libfabric control interface.
12
Expose optimal way to use underlying hardware resources
GURU Fabric Information Attributes Defines the limits and behavior of selected interfaces Progress – provider or application driven Threading – resource synchronization boundaries Resource mgmt – protect against queue overruns Ordering – message processing, data transfers Expose optimal way to use underlying hardware resources Libfabric attributes are different than those defined for other interfaces. Most interfaces define attributes are hardware maximums or limits. Libfabric defines limits based on optimal values supported by the provider. E.g. a provider may support 64,000 endpoints in terms of addressing capabilities, but the hardware cache effectively limits that number to 256 endpoints active at any given time. The intent is that a resource manager can use the data to allocate resources among different jobs and processes. Additionally, libfabric defines some attributes around meeting application needs. E.g. the threading attribute allows an application to allocate resources among threads such that it can avoid locking. In many cases the attributes are optimization hints from the application to the provider on its intended use.
13
Request application take action to improve overall performance
GURU Fabric Information Mode Provider hints on how it is best used Local MR – must register buffers for local operations Context – app provides ‘scratch space’ for provider to track request Buffer prefix – app provides space for network headers Request application take action to improve overall performance Although the interfaces are driven by the application, there are cases where performance could be improved if the application took some action on behalf of the provider. These are the mode bits. In most cases, it is cheaper for an application to take these actions than the provider doing so. E.g. if we implement reliability over an unreliable interface (as shown by laying RDM over DGRAM EPs in the previous slide), then a provider needs to track each request. The FI_CONTEXT mode bit indicates that the provider would like the application to provide the memory used to track the request. Many HPC middleware already allocate memory with each operation, so allocating a few more bytes is easier and cheaper than the provider also allocating memory.
14
Endpoints Addressable communication portal EASY
Conceptually similar to a socket or QP transmit receive completions Conceptual (or real) command queues Sequence of request and completion processing Conceptually, each endpoint is associated with a hardware transmit and receive command queue. There is no requirement that the provider implement the command queues in hardware, however.
15
Shared Tx/Rx Contexts GURU Enable resource manager to direct use of HW resources Endpoint Endpoint Endpoint Endpoint transmit receive Number of endpoints greater than available resources Map to command queues or HW limits (caching) If there are more endpoints active than the hardware can effectively support, the endpoints may manually be configured to use shared command queues.
16
Scalable Endpoints Multiple Tx/Rx contexts per endpoint GURU
transmit transmit transmit transmit receive receive receive receive - Multi-threading - Ordering - Progress - Completions Incoming requests may be able to target a specific receive context Scalable endpoints are the opposite of shared endpoints. The intent is that an application can take advantage of all hardware resources that are available. An anticipated use of scalable endpoints is to allow multiple threads to each have their own transmit command queue. This avoids locking between threads. This has an advantage over using multiple endpoints in that the number of addresses that each process must maintain is decreased. In this case, we have 1 endpoint = 1 address, but 4 transmit contexts. The alternative would be to have 4 endpoints = 4 addresses.
17
API Performance Analysis
Issues apply to many APIs: Verbs, AIO, DAPL, Portals, NetworkDirect, … API Performance Analysis libibverbs with InfiniBand libfabric with InfiniBand Structure Field Write Size Branch? Type Parameter sge 16 void * buf 8 send_wr 60 size_t len next Yes desc num_sge fi_addr_t dest_addr opcode context flags Totals 76+8 = 84 4+1 = 5 40 Generic entry points result in additional memory reads/writes Interface parameters can force branches in the provider code Move operation flags into initialization code path for optimal SW paths This analyzes the affect that the API has on the underlying implementation. Although this comparison is against libibverbs, the problem appears in many other APIs. This looks at the number of bytes of memory that an application must write in order to invoke the API in order to send a simple message. It also examines whether there are parameters into the API that essentially force a branch to occur in the underlying code. With libibverbs, we require writing an additional 44 bytes to memory in order to send a message, and the interface adds 5 branches in the underlying implementation. (An analysis of an actual provider shows that at a minimum 19 branches are actually taken, but at least 5 of those are the result of how the interface has been defined.)
18
libibverbs with InfiniBand libfabric with InfiniBand
Memory Footprint Per peer addressing data libibverbs with InfiniBand libfabric with InfiniBand Type Data Size struct * ibv_ah 8 uint64 fi_addr_t uint32 QPN 4 QKey 4 [0] 24 Total 36 Map Address Vector : encodes peer address direct mapping to HW command data IB Data: DLID SL QPN Size: 2 1 3 Index Address Vector : minimal footprint requires lookup/calculation for peer address Libfabric considered the impact of trying to scale to millions of peers. Its address vector concept is used to convert endpoint addresses into fabric specific addresses. Address vectors may be shared between processes, which allow a single copy of all addresses to be shared among the ranks on a single system. It also allows a provider to greatly reduce the amount of storage required for all endpoints. With the case of a ‘map address vector’, each address seen by the user is 8 bytes. This may be a pointer into provider memory structures. In the best case, the 8-bytes encodes the actual address as shown. This allows data transfer operations to place the encoded data directly into a command. No additional memory references are needed to send. This allows for minimal instructions in any transmit operation. The cost is that the app must store 8 bytes of data per remote endpoint. The ‘index address vector’ allows the app to use a simple index to reference each remote endpoint. In this case, the app does not need to store any addressing data. However, the provider may need to use the index to look up the actual address of the endpoint. In some cases, the actual address may be calculated based on the index. This enables a very small memory footprint to access millions of peers, with the cost of performing the calculation on each send.
19
OFA Community Release Schedule
1.0 – Q1 2015 Initial release – support for PSM, Verbs (IB/iWarp), usNIC, socket providers Quickly enable applications, mix of native and layered providers 1.1 – Q2 2015 Bug fixes and provider enhancements 1.1.1 – Q3 2015 Bug fix only release 1.2 – Q4 2015 New providers - enhanced verbs, Omni- Path (PSM2), MXM, GNI 2016 Interface extensions
20
OFIWG at SC ‘15 Tutorial – Monday 1:30 – 5:00
Detailed look at the libfabric interface Basic examples Middleware implementation reference MPI OpenSHMEM BoF – Tuesday 1:30 – 3:00 OFIWG, including the Data Storage/Data Access subgroup Initial collection of interface extensions
21
Legal Notices and Disclaimers
Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer. No computer system can be absolutely secure. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit Intel, the Intel logo, and others are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © 2015 Intel Corporation. Place at the back of the deck
22
INTEL® HPC DEVELOPER CONFERENCE
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.