Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jon P. Maloy TIPC: Communication for Linux Clusters.

Similar presentations


Presentation on theme: "Jon P. Maloy TIPC: Communication for Linux Clusters."— Presentation transcript:

1 Jon P. Maloy jon.maloy@ericsson.com TIPC: Communication for Linux Clusters

2 NOKIA RESEARCH CENTER / BOSTON  ForCES  Efficient communication CE-FE, TIPC used as TML  IETF drafts draft-maloy-tipc-01.txt draft-maloy-tipc-tml-00.txt  State Synchronization across nodes  E.g. Connection Tracking migration Reliable Multicast support Tight link supervision  Efficient Clustering of Network Devices  Has been used in Ericsson products for 8 years Proven in the field TIPC Motivation

3 NOKIA RESEARCH CENTER / BOSTON TIPC Transparent Inter Process Communication  A transport protocol specialized for single node and cluster environments  “Cluster global Unix sockets” with structured addressing scheme  Supports both connection oriented and connectionless communication  Reliable and non-reliable multicast  A framework for detecting, supervising and maintaining cluster topology  Source code available from SourceForge under dual BSD/GPL licence  Not intrusive; small; no kernel changes required  Code re-work ongoing to streamline for Linux  Adopted by several OS:es in telecom industry already  More to come

4 NOKIA RESEARCH CENTER / BOSTON  TCP/SCTP  Too generic for efficient local communication, only connection oriented  UDP  Unreliable, no congestion control  Unix Sockets  Only single node, only connection oriented  What We Wanted  One communication service with the speed of UDP/UNIX sockets, the reliability of TCP, and the versatility of them all combined  Functional addressing  Extend address location transparency beyond the local node  Have failure detection times at millisecond level, at least  A way to know when addresses becomes available/unavailable Why Another Protocol ?

5 NOKIA RESEARCH CENTER / BOSTON  Addressing Location Transparency  Powerful functional addressing scheme  The cluster can be seen as one single node  In all three communication modes  Selective transparency  Lightweight, Reactive Connections  Immediate connection abortion at node/process failure or overload  Performance  Directly on media (Ethernet,RapidIO...) when possible, otherwise on IP  24 byte header for most messages  Numbers (slightly dated) 80 % faster than loopback TCP 35 % faster than inter-node TCP for short messages What We Got

6 NOKIA RESEARCH CENTER / BOSTON  Congestion control at three levels  Connection level, signalling link level and media level  Based on 4 importance priorities  Simple to configure  No configuration needed at all in single node mode  Must set each node’s identity for cluster mode operation, that is all  Automatic neighbour detection using multicast/broadcast  Topology Subscription Service  Functional and physical topology And More…

7 NOKIA RESEARCH CENTER / BOSTON  Network Redundancy  Can set each interface (“network plane”) as active or standby  Can have up to 3 standby networks for one active  Networks need not be same type  Network Load Sharing  Can set two interfaces active and two standby  Neighbour Supervision  “Lean” heartbeat scheme between nodes  Node failure detected within 500 ms, carrier failure detected immediately  Scalability  Can handle clusters up to hundreds of nodes And More…

8 NOKIA RESEARCH CENTER / BOSTON TCP Shared Memory EthernetSCTPDCCP Bearer Adapter API Sequence/Retransmission Control Packet Bundling Congestion Control Fragmentation/De-fragmentationReliable Multicast Neighbour Detection Link Establish/Supervision/Failover Address Table Distribution Connection Supervision Route/Link Selection Address SubscriptionAddress Resolution User Adapter API Socket API Adapter Port API Adapter Custom API Adapters Node Internal Functional View

9 NOKIA RESEARCH CENTER / BOSTON Zone Node Internet/ Intranet Slave Node Network Topology* Cluster * Only Single Cluster communication supported in current implementation

10 NOKIA RESEARCH CENTER / BOSTON Server Process, Partition B Server Process, Partition A Client Process bind(type = foo, lower=0, upper=99) sendto(type = foo, instance = 33) bind(type = foo, lower=100, upper=199) Functional Addressing: Unicast  Function Address  Persistent, reusable 64 bit port identifier assigned by user Consists of 32 bit function type number and 32 bit instance number  Function Address Sequence (“Partition”)  Range of function addresses of same function type Consists of function type,lower bound,upper bound foo,33

11 NOKIA RESEARCH CENTER / BOSTON Unicast Code Example // client.c #define FOO 4711 #define INSTANCE 33 int main(int argc, char* argv[], char* dummy[]) { struct sockaddr_tipc srv_addr; int sd = socket (AF_TIPC, SOCK_RDM,0); srv_addr.addrtype = TIPC_ADDR_NAME; srv_addr.addr.name.name.type = FOO; srv_addr.addr.name.name.instance = INSTANCE; srv_addr.addr.name.domain = 0; printf("** TIPC client program started **\n\n"); wait_for_server(&srv_addr.addr.name.name, 10000); /* Send connectionless "hello" message: */ char buf[40] = {"Hello World"}; if (0 > sendto(sd,buf,strlen(buf)+1,0, (struct sockaddr*)&srv_addr, sizeof(srv_addr))){ perror("Client: Failed to send"); exit(1); } /* Receive the acknowledge */ if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("\n*** TIPC client program finished ***\n"); } //server.c #define FOO 4711 #define LOWER_BOUND 0 #define UPPER_BOUND 99 int main(int argc, char* argv[], char* dummy[]) { int sd = socket (AF_TIPC, SOCK_RDM,0); struct sockaddr_tipc partition_addr, client_addr; int alen = sizeof(client_addr); char inbuf[40],outbuf[40] = "Uh ?"; partition_addr.family = AF_TIPC; partition_addr.addrtype = TIPC_ADDR_NAMESEQ; partition_addr.addr.nameseq.type = FOO; partition_addr.addr.nameseq.lower = LOWER_BOUND; partition_addr.addr.nameseq.upper = UPPER_BOUND; partition_addr.scope = TIPC_CLUSTER_SCOPE; printf("** TIPC server program started **\n"); /* Make server available: */ if (0 != bind (sd, (struct sockaddr*)&partition_addr, sizeof(partition_addr))){ printf ("Server: Failed to bind\n"); exit (1); } if (0 >= recvfrom(sd,inbuf,sizeof(inbuf), 0, (struct sockaddr*)&client_addr, &alen)){ perror("Unexepected recv: "); } printf("Server: Message received: %s !\n", inbuf); if (0 > sendto(sd,outbuf,strlen(outbuf)+1,0, (struct sockaddr*)&client_addr, sizeof(client_addr))){ perror("Server: Failed to send"); } printf("\n** TIPC server program finished **\n"); }

12 NOKIA RESEARCH CENTER / BOSTON Unicast Code Example / /server.c #define FOO 4711 #define LOWER_BOUND 0 #define UPPER_BOUND 99 int main(int argc, char* argv[], char* dummy[]) { int sd = socket (AF_TIPC, SOCK_RDM,0); struct sockaddr_tipc partition_addr, client_addr; int alen = sizeof(client_addr); char inbuf[40],outbuf[40] = "Uh ?"; partition_addr.family = AF_TIPC; partition_addr.addrtype = TIPC_ADDR_NAMESEQ; partition_addr.addr.nameseq.type = FOO; partition_addr.addr.nameseq.lower = LOWER_BOUND; partition_addr.addr.nameseq.upper = UPPER_BOUND; partition_addr.scope = TIPC_CLUSTER_SCOPE; printf("** TIPC server program started **\n"); if (0 != bind (sd, (struct sockaddr*)&partition_addr,sizeof(partition_addr))){ printf ("Server: Failed to bind\n"); exit (1); } if (0 >= recvfrom(sd,inbuf,sizeof(inbuf), 0,(struct sockaddr*)&client_addr,&alen)){ perror("Unexepected recv: "); exit(1); } printf("Server: Message received: %s !\n", inbuf); if (0 > sendto(sd,outbuf,strlen(outbuf)+1,0,(struct sockaddr*)&client_addr,sizeof(client_addr))){ perror("Server: Failed to send"); } printf("\n** TIPC server program finished **\n"); }

13 NOKIA RESEARCH CENTER / BOSTON Unicast Code Example / / client.c #define FOO 4711 #define INSTANCE 33 int main(int argc, char* argv[], char* dummy[]) { char buf[40] = {"Hello World"}; struct sockaddr_tipc srv_addr; int sd = socket (AF_TIPC, SOCK_RDM,0); srv_addr.addrtype = TIPC_ADDR_NAME; srv_addr.addr.name.name.type = FOO; srv_addr.addr.name.name.instance = INSTANCE; srv_addr.addr.name.domain = 0; printf("** TIPC client program started **\n\n"); wait_for_server(&srv_addr.addr.name.name,10000); if (0 > sendto(sd,buf,strlen(buf)+1,0,(structsockaddr*)&srv_addr,sizeof(srv_addr))){ perror("Client: Failed to send"); exit(1); } if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("** TIPC client program finished **\n\n"); }

14 NOKIA RESEARCH CENTER / BOSTON Server Process, Partition B Server Process, Partition A Client Process bind(type = foo, lower=0, upper=99) sendto(type = foo, lower = 33, upper = 133) bind(type = foo, lower=100, upper=199) foo,33,133 Functional Addressing: Multicast  Based on Function Address Sequences  Any partition overlapping with the range used in the destination address will receive a copy of the message  Client defines “multicast group” per call

15 NOKIA RESEARCH CENTER / BOSTON Multicast Code Example // client.c #define FOO 4711 #define LOWER_BOUND 33 #define UPPER_BOUND 133 int main(int argc, char* argv[], char* dummy[]) { struct sockaddr_tipc mcast_group; int sd = socket (AF_TIPC, SOCK_RDM,0); mcast_group.addrtype = TIPC_ADDR_NAMESEQ; mcast_group.addr.name.name.type = FOO; mcast_group.addr.nameseq.lower = LOWER_BOUND; mcast_group.addr.nameseq.upper = UPPER_BOUND; printf("** TIPC client program started **\n\n"); wait_for_server(&mcast_group.addr.name.name, 10000); /* Send connectionless "hello" message: */ char buf[40] = {"Hello World"}; if (0 > sendto(sd,buf,strlen(buf)+1,0, (struct sockaddr*)&mcast_group, sizeof(mcast_group))){ perror("Client: Failed to send"); exit(1); } /* Receive one acknowledge */ if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("\n****** TIPC client program finished ******\n"); } //server.c #define FOO 4711 #define LOWER_BOUND 0 #define UPPER_BOUND 99 int main(int argc, char* argv[], char* dummy[]) { int sd = socket (AF_TIPC, SOCK_RDM,0); struct sockaddr_tipc partition_addr, client_addr; int alen = sizeof(client_addr); char inbuf[40],outbuf[40] = "Uh ?"; partition_addr.family = AF_TIPC; partition_addr.addrtype = TIPC_ADDR_NAMESEQ; partition_addr.addr.nameseq.type = FOO; partition_addr.addr.nameseq.lower = LOWER_BOUND; partition_addr.addr.nameseq.upper = UPPER_BOUND; partition_addr.scope = TIPC_CLUSTER_SCOPE; printf("** TIPC server program started **\n"); /* Make server available: */ if (0 != bind (sd, (struct sockaddr*)&partition_addr, sizeof(partition_addr))){ printf ("Server: Failed to bind\n"); exit (1); } if (0 >= recvfrom(sd,inbuf,sizeof(inbuf), 0, (struct sockaddr*)&client_addr, &alen)){ perror("Unexepected recv: "); } printf("Server: Message received: %s !\n", inbuf); if (0 > sendto(sd,outbuf,strlen(outbuf)+1,0, (struct sockaddr*)&client_addr, sizeof(client_addr))){ perror("Server: Failed to send"); } printf("\n** TIPC server program finished **\n"); }

16 NOKIA RESEARCH CENTER / BOSTON Multicast Code Example / / client.c #define FOO 4711 #define LOWER_BOUND 33 #define UPPER_BOUND 133 int main(int argc, char* argv[], char* dummy[]) { char buf[40] = {"Hello World"}; struct sockaddr_tipc mcast_group; int sd = socket (AF_TIPC, SOCK_RDM,0); mcast_group.addrtype = TIPC_ADDR_NAMESEQ; mcast_group.addr.name.name.type = FOO; mcast_group.addr.nameseq.lower = LOWER_BOUND; mcast_group.addr.nameseq.upper = UPPER_BOUND; printf("** TIPC client program started **\n\n"); wait_for_server(&mcast_group.addr.name.name,10000); if (0 > sendto(sd,buf,strlen(buf)+1,0,(struct sockaddr*)&mcast_group,sizeof(mcast_group))){ perror("Client: Failed to send"); exit(1); } /* Receive first acknowledge */ if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("\n****** TIPC client program finished ******\n"); }

17 NOKIA RESEARCH CENTER / BOSTON  Location of server not known by client  Lookup of physical destination performed on-the-fly  Efficient, no secondary messaging involved Client Process sendto(type = foo, lower = 33, upper = 133) Node Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) bind(type = foo, lower=100, upper=199) foo,33,133 Address Location Transparency

18 NOKIA RESEARCH CENTER / BOSTON  Location of server not known by client  Lookup of physical destination performed on-the-fly  Efficient, no secondary messaging involved Client Process sendto(type = foo, lower = 33, upper = 133) Node Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) bind(type = foo, lower=100, upper=199) foo,33,133 Address Location Transparency Node

19 NOKIA RESEARCH CENTER / BOSTON Node bind(type = foo, lower=100, upper=199) Node  Location of server not known by client  Lookup of physical destination performed on-the-fly  Efficient, no secondary messaging involved Client Process sendto(type = foo, lower = 33, upper = 133) Node Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) foo,33,133 Address Location Transparency

20 NOKIA RESEARCH CENTER / BOSTON  Many sockets may bind to same partition  Closest-First or Round-Robin algorithm chosen by client bind(type = foo, lower=0, upper=99) Client Process sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A’ Server Process, Partition A bind(type = foo, lower=0, upper=99) foo,33,133 Address Binding

21 NOKIA RESEARCH CENTER / BOSTON  Many sockets may bind to same partition  Closest-First or Round-Robin algorithm chosen by client  Same socket may bind to many partitions bind(type = foo, lower=100, upper=199) Client Process sendto(type = foo, lower = 33, upper = 133) Server Process, Partition B Server Process, Partition A+B’ bind(type = foo, lower=0, upper=99) bind(type=foo, lower=100, upper=199) foo,33,133 Address Binding

22 NOKIA RESEARCH CENTER / BOSTON  Many sockets may bind to same partition  Closest-First or Round-Robin algorithm chosen by client  Same socket may bind to many partitions  Same socket may bind to different functions bind(type = foo, lower=100, upper=199) Client Process sendto(type = foo, lower = 33, upper = 133) Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) bind(type=bar, lower=0, upper=999) foo,33,133 Address Binding

23 NOKIA RESEARCH CENTER / BOSTON Server Process, Partition B Server Process, Partition A Client Process bind(type = foo, lower=0, upper=99) subscribe(type = foo, lower = 0, upper = 500) bind(type = foo, lower=100, upper=199) foo,0,99 Functional Topology Subscription  Function Address/Address Partition bind/unbind events foo,100,199

24 NOKIA RESEARCH CENTER / BOSTON TIPC bind(type = node, lower=0x1001003, upper=0x1001003) Node Client Process subscribe(type = node, lower = 0x1001000, upper = 0x1001009) node,0x1001003 node,0x1001002 Node bind(type = node, lower=0x1001002, upper=0x1001002) TIPC Network Topology Subscription  Node/Cluster/Zone availability events  Same mechanism as for functional events

25 NOKIA RESEARCH CENTER / BOSTON Connections  Establishment based on functional addressing  Selectable lookup algorithm, partitioning, redundancy etc  Lightweight  End-to-end flow control  SOCK_STREAM/SOCK_SEQPACKET in connection oriented mode  Mutually compatible

26 NOKIA RESEARCH CENTER / BOSTON Connection Setup foo,117 Server Process, Partition B Client Process sendto(type = foo, instance = 117 )  No protocol messages exchanged during setup/shutdown  Only payload carrying messages

27 NOKIA RESEARCH CENTER / BOSTON Connection Setup  No protocol messages exchanged during setup/shutdown  Only payload carrying messages Server Process, Partition B Client Process lconnect(client) send()

28 NOKIA RESEARCH CENTER / BOSTON Connection Setup  No protocol messages exchanged during setup/shutdown  Only payload carrying messages Server Process, Partition B Client Process lconnect(server)

29 NOKIA RESEARCH CENTER / BOSTON Connection Shutdown  No protocol messages exchanged during setup/shutdown  Only payload carrying messages Server Process, Partition B Client Process disconnect()

30 NOKIA RESEARCH CENTER / BOSTON Connection Shutdown  No protocol messages exchanged during setup/shutdown  Only payload carrying messages Server Process, Partition B Client Process disconnect()

31 NOKIA RESEARCH CENTER / BOSTON Connection Setup/Shutdown  Well-known TCP-style connect/shutdown with exchange of SYN and FIN message exchange available as alternative Server Process, Partition B Client Process bind() listen() accept() connect(type=foo, instance=117) SYN (foo,117)

32 NOKIA RESEARCH CENTER / BOSTON Connection Abortion  Immediate “abortion” event in case of peer process crash Server Process, Partition B Client Process abort

33 NOKIA RESEARCH CENTER / BOSTON Connection Abortion  Immediate “abortion” event in case of peer node crash Server Process, Partition B Client Process abort Node

34 NOKIA RESEARCH CENTER / BOSTON Connection Abortion  Immediate “abortion” event in case of communication failure Server Process, Partition B Client Process abort Node

35 NOKIA RESEARCH CENTER / BOSTON Connection Abortion  Immediate abortion in case of node overload Server Process, Partition B Client Process Node abort

36 NOKIA RESEARCH CENTER / BOSTON Connection Flow Control  End-to-end send window of N messages slows sender process in case of receiver process overload  Acknowledge sent from receiver each N/2 message  Sender socket keeps only a counter, not a retransmission buffer Server Process, Partition B Client Process Node Acknowledg e

37 NOKIA RESEARCH CENTER / BOSTON Signalling Links  Retransmission protocol and congestion control at signalling link level  Transmitted packets acknowledged/released by any packet from other node  Packet losses detected and retransmission performed earlier  Packets from different sources are bundled in same buffer in case of congestion  Packet flow more traffic driven, no need for timers per socket or message Server Process, Partition B Client Process Node Client Process Server Process, Partition B

38 NOKIA RESEARCH CENTER / BOSTON Network Load Sharing  One link per node pair and interface  Typically two links per node pair, for full load sharing and redundancy Server Process, Partition B Client Process Node Client Process Server Process, Partition B

39 NOKIA RESEARCH CENTER / BOSTON Network Redundancy  Smooth failover in case of single link failure, with no consequences for user level connections  Each link supervised by conditional heartbeats, i.e. when no other traffic Server Process, Partition B Client Process Node Client Process Server Process, Partition B

40 NOKIA RESEARCH CENTER / BOSTON Code Status  Initial Release for Linux  Feedback (S. Hemminger, Jamal) was that we have to do some re-work Memory handling, buffer handling, locking policy, socket interface, management protocol/interface…  All issues addressed, but not all checked in at SF yet  New, fully POSIX compliant socket interface/implementation  More conventional use of buffers (performance…)  Reliable multicast needs more testing  Still not fully ready for inclusion in kernel, but we are close…

41 NOKIA RESEARCH CENTER / BOSTON Short Term Goals  End of August: Kernel Ready  Reliable multicast fully tested  New socket implementation finished and tested  Netlink based management/configuration protocol finished and tested Replaced all ioctls().

42 NOKIA RESEARCH CENTER / BOSTON Long Term Goals  Multi-cluster Functionality  Mostly user space  Automatic inter-cluster neighbour discovery and link setup  Fully manual inter cluster link setup  Guaranteeing name table consistency between clusters  Slave node name table reduction  Additional Bearers  Dynamic registration of “bearers” from user space (e.g. TCP, DCCP)  Distributed netlink ??

43 NOKIA RESEARCH CENTER / BOSTON http://tipc.sourceforge.net

44 QUESTIONS ??


Download ppt "Jon P. Maloy TIPC: Communication for Linux Clusters."

Similar presentations


Ads by Google