ZeroMQ-ofi sketch Kayla Seager.

Slides:



Advertisements
Similar presentations
Introduction to Sockets Jan Why do we need sockets? Provides an abstraction for interprocess communication.
Advertisements

Interprocess Communication CH4. HW: Reading messages: User Agent (the user’s mail reading program) is either a client of the local file server or a client.
Remote Procedure Call (RPC)
© 2007 Cisco Systems, Inc. All rights reserved.Cisco Public 1 Version 4.0 OSI Transport Layer Network Fundamentals – Chapter 4.
Socket Programming.
Inter Process Communication:  It is an essential aspect of process management. By allowing processes to communicate with each other: 1.We can synchronize.
Sockets CIS 370 Fall 2009, UMassD. Introduction  Sockets provide a simple programming interface which is consistent for processes on the same machine.
Fall 2000Datacom 11 Lecture 4 Socket Interface Programming: Service Interface between Applications and TCP.
Chapter 17 Networking Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William Stallings.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
LWIP TCP/IP Stack 김백규.
LWIP TCP/IP Stack 김백규.
CMPT 471 Networking II Address Resolution IPv4 ARP RARP 1© Janice Regan, 2012.
Orbited Scaling Bi-directional web applications A presentation by Michael Carter
Chapter 6-2 the TCP/IP Layers. The four layers of the TCP/IP model are listed in Table 6-2. The layers are The four layers of the TCP/IP model are listed.
7/26/ Design and Implementation of a Simple Totally-Ordered Reliable Multicast Protocol in Java.
Dr. John P. Abraham Professor University of Texas Pan American Internet Applications and Network Programming.
Windows Network Programming ms-help://MS.MSDNQTR.2004JAN.1033/winsock/winsock/windows_sockets_start_page_2.htm 井民全.
Chapter 2 Applications and Layered Architectures Sockets.
The Socket Interface Chapter 21. Application Program Interface (API) Interface used between application programs and TCP/IP protocols Interface used between.
Network Programming Eddie Aronovich mail:
CPSC 441 TUTORIAL – FEB 13, 2012 TA: RUITNG ZHOU UDP REVIEW.
1 Kyung Hee University Chapter 8 ARP(Address Resolution Protocol)
The Client-Server Model And the Socket API. Client-Server (1) The datagram service does not require cooperation between the peer applications but such.
Part 4: Network Applications Client-server interaction, example applications.
FCM Workflow using GCM.
Socket Programming.
AMQP, Message Broker Babu Ram Dawadi. overview Why MOM architecture? Messaging broker like RabbitMQ in brief RabbitMQ AMQP – What is it ?
Networks, Part 2 March 7, Networks End to End Layer  Build upon unreliable Network Layer  As needed, compensate for latency, ordering, data.
Netprog: Client/Server Issues1 Issues in Client/Server Programming Refs: Chapter 27.
RDA3 Transport Joel Lauener on behalf of the CMW team 26th June, 2013
TCP/IP1 Address Resolution Protocol Internet uses IP address to recognize a computer. But IP address needs to be translated to physical address (NIC).
ZeroMQ Chapter 4 Reliable Request-Reply Patterns
Using ZeroMQ for GEP. 2 About ZeroMQ The “zero” in ZeroMQZeroMQ  Zero Broker  Zero Latency (Low Latency)  Zero Administration  Zero Cost – Cross Platform.
Chapter 9: Transport Layer
Last Class: Introduction
Module 12: I/O Systems I/O hardware Application I/O Interface
Instructor Materials Chapter 9: Transport Layer
CSCE 313 Network Socket MP8 DUE: FRI MAY 5, 2017
LWIP TCP/IP Stack 김백규.
Frame Relay lab1.
Data Transport for Online & Offline Processing
Chapter 8 ARP(Address Resolution Protocol)
Chapter 3 Internet Applications and Network Programming
Fabric Interfaces Architecture – v4
Distribution and components
Hubs Hubs are essentially physical-layer repeaters:
CHAPTER 3 Architectures for Distributed Systems
Socket Programming in C
Client/Server Example
Introduction of Transport Protocols
Transport Layer Our goals:
Inter Process Communication (IPC)
Operating System Concepts
CS703 - Advanced Operating Systems
Distributed Program Design
Issues in Client/Server Programming
TRANSMISSION CONTROL PROTOCOL
CPEG514 Advanced Computer Networkst
Chapter 5 Transport Layer Introduction
Building A Network: Cost Effective Resource Sharing
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Computer Networking A Top-Down Approach Featuring the Internet
Chapter 5 Transport Layer Introduction
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
OSI Reference Model Unit II
OSI Model 7 Layers 7. Application Layer 6. Presentation Layer
Exceptions and networking
Module 12: I/O Systems I/O hardwared Application I/O Interface
Network Basics and Architectures Neil Tang 09/05/2008
Presentation transcript:

ZeroMQ-ofi sketch Kayla Seager

Purpose of discussion Explore requirements of ZeroMQ (“sockets library”) ZeroMQ: widely used in AI/enterprise Identify Mismatches between ZeroMQ and Libfabric Focus on a commonly used subset of ZMQ API Brainstorm Libfabric improvements

Zeromq “Asynchronous sockets library with load balancing”

What is ZeroMQ? Socket-like interface but builds features on top of it… Multiple Connections per socket Async communication model: fire and forget Background CM: listen/accept Message-based (vs stream) Some built-in load balancing (defined by socket type)

ZMQ Objects & Mapping to OFI ZMQ CTX FI_DOMAIN ZMQ Socket fd ZMQ Socket fd FI_EP Connection fd Connection fd Connection fd Connection fd Ctx: only zeromq object that can be safely shared between threads

Example Code Server: Client: 1) void *context = zmq_ctx() 2) void *rep_sock = zmq_socket(context, ZMQ_REP) 3) zmq_bind(rep_sock, “tcp://*:4040”) 4) zmq_recv(rep_sock, buffer, 6, flag) Client: 2) void *req_sock = zmq_socket(context, ZMQ_REQ) 3) zmq_connect(req_sock, “tcp://localhost:4040”) 4) zmq_send(req_sock, “hello”, 6, flag)

Example Code: What’s missing? Server: 4 zmq_recv(rep_sock, buffer, 6, flag) Client: zmq_send(req_sock, “hello”, 6, flag) No destination address provided! One socket <--> multiple connections how does it pick the connection?

Example Code: Server: Client: Learning curve… zmq_socket(context, ZMQ_REP) Client: zmq_socket(context, ZMQ_REQ) Socket type determines connection selection Learning curve…

Socket types/Message Patterns Definition: how and when to use connections Example: Request-Reply: (ZMQ_REQ/ZMQ_REP) Load balancing Dealer/Router Pub-Sub: broadcast to all connections Exclusive Pair: only one connection Different sockets with the right type can connect to each other Only going to talk about REQ/REP/DEALER/ROUTER

REQ/REP: synchronous Send/Recv Requires a single REQ and REP socket Synchronous send/recv Will wait for recv from socket it sent to Round Robin load balancing 1 SockAREP Wait for SockA recv Sock REQ REQ.send(msg1) REQ.send(msg2) SockBREP 2 REQ knows exactly which pipe to expect the reply from Last peer = last person I communicated with since its sync you know who that is =)

Round Robin Send send one message to each destination in round robin order send msg1 send msg2 send msg3 2 1 3 dest1 dest2 dest3 dest1 dest2 dest3 dest1 dest2 dest3

1 2 3 Fair Queue Recv MsgQ Dest1 MsgQ Dest2 MsgQ Dest3 Read one message from each destination in round robin order Recv(msg1, dest1) Recv(msg2, dest2) Recv(msg3, dest3) MsgQ Dest1 1 2 3 MsgQ Dest2 MsgQ Dest3

Dealer/Router Aynchronous (REP/REQ) Round Robin load balancing send and receive! Router: uses ID for connections

Special case: Router and IDs Wait you can set the Destination? (on Router Send only) Dealer/Client: (connection to Router) Can “address” your connection via ID’s can set ID (else ZMQ will set one for you) Router and ID management: Specify Send via connection ID Expects first “message” to be ID uses it for connection look-up Receive Round robins connections Chosen connection: look-up which ID, First receive is ID, then message

Special case: Router and IDs Client: set ID zmq_setsockopt( client_sock, ZMQ_IDENTITY, ClientA_ID…) zmq_connect( client_sock, Router_address ) 3) zmq_recv(msg_for_ClientA) 4) zmq_send(msg_for_Router) Router Server: 1) zmq_send(ClientA_ID, MORE) 2) zmq_send(msg_for_ClientA) 5) zmq_recv(ClientA_ID) 6) zmq_recv(msg_for_Router) ID is only used at Router Socket layer – not transmitted

ZMQ Architecture: Single Socket User’s ZMQ socket_fd Front End: Message Pattern Protocol implementation Front end ZMQ_REP/REQ… Put/recv message on/from queue Signal backend Back end “fi_send/fi_recv” Signal front end CM polling Message Queue Message Queue “Pipes” created per connection Control messaging Back End: Transport/CM Poll(fds..) network

Overall ZMQ goal: build systems Make asynchronous sockets “easy” ZMQ sockets are “lego blocks” for messaging systems Not constrained to any particular system -can do broker or brokerless Not constrained to any one message system (like other MQ systems)

Case Study: MXNet-Pslite Model: AI Framework ZMQ API usage Router/Dealer socket type Msg API (send/recv) MORE flag Dealer Dealer Dealer Dealer Dealer Dealer Dealer Dealer Dealer process 1 process 2 process 3 Router Router Router bind bind bind connect connect connect Dealer Dealer Dealer Node 4 Router Focusing on this API set Essentially ps-lite is devising their own ID-connection scheme and overlaying that on the dealer sockets to determine who to send to

ZeroMQ – OFI mismatches

Related Work note Alice – FairMQ - Nanomsg Nanomsg: refactored ZeroMQ Pluggable transports Nanomsg-Libfabric (usNIC target) PR for true Zero-Copy support Can’t reuse existing FD based solns

ZMQ Architecture Not a great fit for Libfabric User’s ZMQ socket_fd ZMQ Architecture Front End: Message Pattern Protocol implementation Not a great fit for Libfabric We already have async communication… Asynchronous progress Message queues Are we only missing the message patterns? no…. Message Queue Message Queue “Pipes” created per connection Control messaging Back End: Transport/CM Poll(fds..) network

ZMQ Semantic mismatches for Libfabric Multi-connection “endpoints” Dynamic Process management Buffered receive Peer-to-peer flow control Shared memory solution

Multi-connection “endpoints” One endpoint: multiple connect oriented connections? mapping to connectionless FI_EP_RDM or single connection FI_EP_MSG? It is multi-connection per socket…

2. Dynamic Process management Back End: Transport/CM Poll(fds..) network

Need to queue incoming receives CM Problem statement 1 Need Server/Client name Exchange Can’t solely use ZMQ “CM” calls Need to be able to go from a bind->send Creating and destroying connections at any time Can’t have CM send/recv interfere with messaging Need a dedicated separate CM channel Can’t have a recv(any) interfering with the routing/scheduling algorithm Must resolved remote connection before send w/o recv help Need to queue incoming receives

CM Problem statement 2 Utility CM: -need timeout if client tries to resolve server address before server is started

3. Buffered receive ZMQ Buffering Requirement -forces buffer to come from transport Zmq_msg_t msg create_buffer()

ZMQ Buffering Requirement ZMQ_MSG_API Requires usage of zmq_msg_t (internal to transport buffer) User responsible for create/destroy MORE flag send/recv API has “MORE” flag capability Multiple send/recv treat as send/recv single message MORE = multi-sends/recv = 1 message

ZMQ Buffer: ZMQ_MSG API zmq_msg_t: buffer “handle” Asks ZMQ to provide buffer But user decides on lifetime of ptr (malloc/free) Example: Send

ZMQ Buffer: ZMQ_MSG API Example: Recv Void * context = Zmq_ctx() Void * rep_sock = Zmq_socket(context, ZMQ_REP) Zmq_bind(rep_sock, “tcp://*:4040”) Zmq_msg_t msg Zmq_msg_init(msg) Zmq_msg_recv(&rep_sock, msg, flag) , Asked ZMQ to buffer without knowing size

ZMQ Buffer: MORE flag cont. Transport implications: multi-msg as “one message” must have local completion of segments Must buffer “iovec” segments User API: Parts are sent separately and received separately Must receive all or none of the message parts buf1 buf2 buf3 -even in msg case they can free buffer Main point: atomic send/recv but also local completion = must buffer chunks on send AND recv Zmq_recv(buf2, MORE) Zmq_recv(buf3, 0) Zmq_recv(buf1, MORE)

ZMQ Buffer: MORE flag treat as single fi_send ZMQ tells user if there is more to receive

ZMQ Buffer: Summary Need buffering for zmq_msg_t handle receive side: user won’t provide Length Buffer Libfabric Options FI_PEEK helps, but no buffer support Buffered send/recv iovec? “buffered” recv?

4. Peer-to-peer flow control Implementing Router/Dealer socket type: ->Requirement comes out of load balancing support 1 2 3 MsgQ Dest1 MsgQ Dest2 MsgQ Dest3

Router Requirements: ID & FQ ID management Create ID’s for sockets (sockopt) Map to connections Send/Recv Send: ID lookup Receive: Fair queuing Return ID add one connection at a time

Router Requirements: flow control Loop over active connections in Round Robin fashion If queue is either empty or full (unused or overused), deactivate Atomic swap into deactivated index reactivated by backend High water mark: relies on TCP flow control (full queue) Logical End of active connections Round Robin: Message in queue? Connection1 MsgQ Connection2 MsgQ Connection3 MsgQ Connection4 MsgQ Tracks HWM, Send side: if a pipe hits high water mark swap to deactives it (relies on TCP flow control) If a recv queue gets full need to back off sender Recv: reactivate sender if back at lwm

5. shared memory solution: Extent of ZeroMQ transport support Unicast (can do all protocols) TCP IPC (inter-process communication) TIPC Cluster IPC with socket interface INPROC Inter-thread communication Multicast (only PUB/SUB) EPGM PGM NORM Engines Stream Engine Shared Memory EPGM? PGM Norm Engine

Summary ZMQ mismatches for Libfabric Multi-connection “endpoints” Dynamic Process management Buffered receive Peer-to-peer flow control Shared memory solution

Thank you! Feedback: RabbitMQ etc same requirements? How wide are these requirements? How often flow control? Why do Apps do X?

Case Study: AI Framework MXNet-Pslite Model: Per process resources N x Dealer sockets Send only 1 Router socket Recv only Dealers connect to one Router Dedicated connection Router receive Fair queuing all incoming recvs Dealer Dealer Dealer Dealer Dealer Dealer Dealer Dealer Dealer process 1 process 2 process 3 Router Router Router bind bind bind connect connect connect Dealer Dealer Dealer Node 4 Router Essentially ps-lite is devising their own ID-connection scheme and overlaying that on the dealer sockets to determine who to send to

How does it compare to other MQ systems? Pro: Brokerless higher throughput/latency More flexible in message model options Messaging library Con: Static routing (always RR) Learning curve Harder to build complex systems