Download presentation
Presentation is loading. Please wait.
Published bySimon Johns Modified over 8 years ago
1
Jon P. Maloy jon.maloy@ericsson.com TIPC: Communication for Linux Clusters
2
NOKIA RESEARCH CENTER / BOSTON ForCES Efficient communication CE-FE, TIPC used as TML IETF drafts draft-maloy-tipc-01.txt draft-maloy-tipc-tml-00.txt State Synchronization across nodes E.g. Connection Tracking migration Reliable Multicast support Tight link supervision Efficient Clustering of Network Devices Has been used in Ericsson products for 8 years Proven in the field TIPC Motivation
3
NOKIA RESEARCH CENTER / BOSTON TIPC Transparent Inter Process Communication A transport protocol specialized for single node and cluster environments “Cluster global Unix sockets” with structured addressing scheme Supports both connection oriented and connectionless communication Reliable and non-reliable multicast A framework for detecting, supervising and maintaining cluster topology Source code available from SourceForge under dual BSD/GPL licence Not intrusive; small; no kernel changes required Code re-work ongoing to streamline for Linux Adopted by several OS:es in telecom industry already More to come
4
NOKIA RESEARCH CENTER / BOSTON TCP/SCTP Too generic for efficient local communication, only connection oriented UDP Unreliable, no congestion control Unix Sockets Only single node, only connection oriented What We Wanted One communication service with the speed of UDP/UNIX sockets, the reliability of TCP, and the versatility of them all combined Functional addressing Extend address location transparency beyond the local node Have failure detection times at millisecond level, at least A way to know when addresses becomes available/unavailable Why Another Protocol ?
5
NOKIA RESEARCH CENTER / BOSTON Addressing Location Transparency Powerful functional addressing scheme The cluster can be seen as one single node In all three communication modes Selective transparency Lightweight, Reactive Connections Immediate connection abortion at node/process failure or overload Performance Directly on media (Ethernet,RapidIO...) when possible, otherwise on IP 24 byte header for most messages Numbers (slightly dated) 80 % faster than loopback TCP 35 % faster than inter-node TCP for short messages What We Got
6
NOKIA RESEARCH CENTER / BOSTON Congestion control at three levels Connection level, signalling link level and media level Based on 4 importance priorities Simple to configure No configuration needed at all in single node mode Must set each node’s identity for cluster mode operation, that is all Automatic neighbour detection using multicast/broadcast Topology Subscription Service Functional and physical topology And More…
7
NOKIA RESEARCH CENTER / BOSTON Network Redundancy Can set each interface (“network plane”) as active or standby Can have up to 3 standby networks for one active Networks need not be same type Network Load Sharing Can set two interfaces active and two standby Neighbour Supervision “Lean” heartbeat scheme between nodes Node failure detected within 500 ms, carrier failure detected immediately Scalability Can handle clusters up to hundreds of nodes And More…
8
NOKIA RESEARCH CENTER / BOSTON TCP Shared Memory EthernetSCTPDCCP Bearer Adapter API Sequence/Retransmission Control Packet Bundling Congestion Control Fragmentation/De-fragmentationReliable Multicast Neighbour Detection Link Establish/Supervision/Failover Address Table Distribution Connection Supervision Route/Link Selection Address SubscriptionAddress Resolution User Adapter API Socket API Adapter Port API Adapter Custom API Adapters Node Internal Functional View
9
NOKIA RESEARCH CENTER / BOSTON Zone Node Internet/ Intranet Slave Node Network Topology* Cluster * Only Single Cluster communication supported in current implementation
10
NOKIA RESEARCH CENTER / BOSTON Server Process, Partition B Server Process, Partition A Client Process bind(type = foo, lower=0, upper=99) sendto(type = foo, instance = 33) bind(type = foo, lower=100, upper=199) Functional Addressing: Unicast Function Address Persistent, reusable 64 bit port identifier assigned by user Consists of 32 bit function type number and 32 bit instance number Function Address Sequence (“Partition”) Range of function addresses of same function type Consists of function type,lower bound,upper bound foo,33
11
NOKIA RESEARCH CENTER / BOSTON Unicast Code Example // client.c #define FOO 4711 #define INSTANCE 33 int main(int argc, char* argv[], char* dummy[]) { struct sockaddr_tipc srv_addr; int sd = socket (AF_TIPC, SOCK_RDM,0); srv_addr.addrtype = TIPC_ADDR_NAME; srv_addr.addr.name.name.type = FOO; srv_addr.addr.name.name.instance = INSTANCE; srv_addr.addr.name.domain = 0; printf("** TIPC client program started **\n\n"); wait_for_server(&srv_addr.addr.name.name, 10000); /* Send connectionless "hello" message: */ char buf[40] = {"Hello World"}; if (0 > sendto(sd,buf,strlen(buf)+1,0, (struct sockaddr*)&srv_addr, sizeof(srv_addr))){ perror("Client: Failed to send"); exit(1); } /* Receive the acknowledge */ if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("\n*** TIPC client program finished ***\n"); } //server.c #define FOO 4711 #define LOWER_BOUND 0 #define UPPER_BOUND 99 int main(int argc, char* argv[], char* dummy[]) { int sd = socket (AF_TIPC, SOCK_RDM,0); struct sockaddr_tipc partition_addr, client_addr; int alen = sizeof(client_addr); char inbuf[40],outbuf[40] = "Uh ?"; partition_addr.family = AF_TIPC; partition_addr.addrtype = TIPC_ADDR_NAMESEQ; partition_addr.addr.nameseq.type = FOO; partition_addr.addr.nameseq.lower = LOWER_BOUND; partition_addr.addr.nameseq.upper = UPPER_BOUND; partition_addr.scope = TIPC_CLUSTER_SCOPE; printf("** TIPC server program started **\n"); /* Make server available: */ if (0 != bind (sd, (struct sockaddr*)&partition_addr, sizeof(partition_addr))){ printf ("Server: Failed to bind\n"); exit (1); } if (0 >= recvfrom(sd,inbuf,sizeof(inbuf), 0, (struct sockaddr*)&client_addr, &alen)){ perror("Unexepected recv: "); } printf("Server: Message received: %s !\n", inbuf); if (0 > sendto(sd,outbuf,strlen(outbuf)+1,0, (struct sockaddr*)&client_addr, sizeof(client_addr))){ perror("Server: Failed to send"); } printf("\n** TIPC server program finished **\n"); }
12
NOKIA RESEARCH CENTER / BOSTON Unicast Code Example / /server.c #define FOO 4711 #define LOWER_BOUND 0 #define UPPER_BOUND 99 int main(int argc, char* argv[], char* dummy[]) { int sd = socket (AF_TIPC, SOCK_RDM,0); struct sockaddr_tipc partition_addr, client_addr; int alen = sizeof(client_addr); char inbuf[40],outbuf[40] = "Uh ?"; partition_addr.family = AF_TIPC; partition_addr.addrtype = TIPC_ADDR_NAMESEQ; partition_addr.addr.nameseq.type = FOO; partition_addr.addr.nameseq.lower = LOWER_BOUND; partition_addr.addr.nameseq.upper = UPPER_BOUND; partition_addr.scope = TIPC_CLUSTER_SCOPE; printf("** TIPC server program started **\n"); if (0 != bind (sd, (struct sockaddr*)&partition_addr,sizeof(partition_addr))){ printf ("Server: Failed to bind\n"); exit (1); } if (0 >= recvfrom(sd,inbuf,sizeof(inbuf), 0,(struct sockaddr*)&client_addr,&alen)){ perror("Unexepected recv: "); exit(1); } printf("Server: Message received: %s !\n", inbuf); if (0 > sendto(sd,outbuf,strlen(outbuf)+1,0,(struct sockaddr*)&client_addr,sizeof(client_addr))){ perror("Server: Failed to send"); } printf("\n** TIPC server program finished **\n"); }
13
NOKIA RESEARCH CENTER / BOSTON Unicast Code Example / / client.c #define FOO 4711 #define INSTANCE 33 int main(int argc, char* argv[], char* dummy[]) { char buf[40] = {"Hello World"}; struct sockaddr_tipc srv_addr; int sd = socket (AF_TIPC, SOCK_RDM,0); srv_addr.addrtype = TIPC_ADDR_NAME; srv_addr.addr.name.name.type = FOO; srv_addr.addr.name.name.instance = INSTANCE; srv_addr.addr.name.domain = 0; printf("** TIPC client program started **\n\n"); wait_for_server(&srv_addr.addr.name.name,10000); if (0 > sendto(sd,buf,strlen(buf)+1,0,(structsockaddr*)&srv_addr,sizeof(srv_addr))){ perror("Client: Failed to send"); exit(1); } if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("** TIPC client program finished **\n\n"); }
14
NOKIA RESEARCH CENTER / BOSTON Server Process, Partition B Server Process, Partition A Client Process bind(type = foo, lower=0, upper=99) sendto(type = foo, lower = 33, upper = 133) bind(type = foo, lower=100, upper=199) foo,33,133 Functional Addressing: Multicast Based on Function Address Sequences Any partition overlapping with the range used in the destination address will receive a copy of the message Client defines “multicast group” per call
15
NOKIA RESEARCH CENTER / BOSTON Multicast Code Example // client.c #define FOO 4711 #define LOWER_BOUND 33 #define UPPER_BOUND 133 int main(int argc, char* argv[], char* dummy[]) { struct sockaddr_tipc mcast_group; int sd = socket (AF_TIPC, SOCK_RDM,0); mcast_group.addrtype = TIPC_ADDR_NAMESEQ; mcast_group.addr.name.name.type = FOO; mcast_group.addr.nameseq.lower = LOWER_BOUND; mcast_group.addr.nameseq.upper = UPPER_BOUND; printf("** TIPC client program started **\n\n"); wait_for_server(&mcast_group.addr.name.name, 10000); /* Send connectionless "hello" message: */ char buf[40] = {"Hello World"}; if (0 > sendto(sd,buf,strlen(buf)+1,0, (struct sockaddr*)&mcast_group, sizeof(mcast_group))){ perror("Client: Failed to send"); exit(1); } /* Receive one acknowledge */ if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("\n****** TIPC client program finished ******\n"); } //server.c #define FOO 4711 #define LOWER_BOUND 0 #define UPPER_BOUND 99 int main(int argc, char* argv[], char* dummy[]) { int sd = socket (AF_TIPC, SOCK_RDM,0); struct sockaddr_tipc partition_addr, client_addr; int alen = sizeof(client_addr); char inbuf[40],outbuf[40] = "Uh ?"; partition_addr.family = AF_TIPC; partition_addr.addrtype = TIPC_ADDR_NAMESEQ; partition_addr.addr.nameseq.type = FOO; partition_addr.addr.nameseq.lower = LOWER_BOUND; partition_addr.addr.nameseq.upper = UPPER_BOUND; partition_addr.scope = TIPC_CLUSTER_SCOPE; printf("** TIPC server program started **\n"); /* Make server available: */ if (0 != bind (sd, (struct sockaddr*)&partition_addr, sizeof(partition_addr))){ printf ("Server: Failed to bind\n"); exit (1); } if (0 >= recvfrom(sd,inbuf,sizeof(inbuf), 0, (struct sockaddr*)&client_addr, &alen)){ perror("Unexepected recv: "); } printf("Server: Message received: %s !\n", inbuf); if (0 > sendto(sd,outbuf,strlen(outbuf)+1,0, (struct sockaddr*)&client_addr, sizeof(client_addr))){ perror("Server: Failed to send"); } printf("\n** TIPC server program finished **\n"); }
16
NOKIA RESEARCH CENTER / BOSTON Multicast Code Example / / client.c #define FOO 4711 #define LOWER_BOUND 33 #define UPPER_BOUND 133 int main(int argc, char* argv[], char* dummy[]) { char buf[40] = {"Hello World"}; struct sockaddr_tipc mcast_group; int sd = socket (AF_TIPC, SOCK_RDM,0); mcast_group.addrtype = TIPC_ADDR_NAMESEQ; mcast_group.addr.name.name.type = FOO; mcast_group.addr.nameseq.lower = LOWER_BOUND; mcast_group.addr.nameseq.upper = UPPER_BOUND; printf("** TIPC client program started **\n\n"); wait_for_server(&mcast_group.addr.name.name,10000); if (0 > sendto(sd,buf,strlen(buf)+1,0,(struct sockaddr*)&mcast_group,sizeof(mcast_group))){ perror("Client: Failed to send"); exit(1); } /* Receive first acknowledge */ if (0 >= recv(sd,buf,sizeof(buf), 0)){ perror("Unexepected response"); exit(1); } printf("Client: Received response: %s \n",buf); printf("\n****** TIPC client program finished ******\n"); }
17
NOKIA RESEARCH CENTER / BOSTON Location of server not known by client Lookup of physical destination performed on-the-fly Efficient, no secondary messaging involved Client Process sendto(type = foo, lower = 33, upper = 133) Node Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) bind(type = foo, lower=100, upper=199) foo,33,133 Address Location Transparency
18
NOKIA RESEARCH CENTER / BOSTON Location of server not known by client Lookup of physical destination performed on-the-fly Efficient, no secondary messaging involved Client Process sendto(type = foo, lower = 33, upper = 133) Node Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) bind(type = foo, lower=100, upper=199) foo,33,133 Address Location Transparency Node
19
NOKIA RESEARCH CENTER / BOSTON Node bind(type = foo, lower=100, upper=199) Node Location of server not known by client Lookup of physical destination performed on-the-fly Efficient, no secondary messaging involved Client Process sendto(type = foo, lower = 33, upper = 133) Node Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) foo,33,133 Address Location Transparency
20
NOKIA RESEARCH CENTER / BOSTON Many sockets may bind to same partition Closest-First or Round-Robin algorithm chosen by client bind(type = foo, lower=0, upper=99) Client Process sendto(type = foo, lower = 33, upper = 133) Server Process, Partition A’ Server Process, Partition A bind(type = foo, lower=0, upper=99) foo,33,133 Address Binding
21
NOKIA RESEARCH CENTER / BOSTON Many sockets may bind to same partition Closest-First or Round-Robin algorithm chosen by client Same socket may bind to many partitions bind(type = foo, lower=100, upper=199) Client Process sendto(type = foo, lower = 33, upper = 133) Server Process, Partition B Server Process, Partition A+B’ bind(type = foo, lower=0, upper=99) bind(type=foo, lower=100, upper=199) foo,33,133 Address Binding
22
NOKIA RESEARCH CENTER / BOSTON Many sockets may bind to same partition Closest-First or Round-Robin algorithm chosen by client Same socket may bind to many partitions Same socket may bind to different functions bind(type = foo, lower=100, upper=199) Client Process sendto(type = foo, lower = 33, upper = 133) Server Process, Partition B Server Process, Partition A bind(type = foo, lower=0, upper=99) bind(type=bar, lower=0, upper=999) foo,33,133 Address Binding
23
NOKIA RESEARCH CENTER / BOSTON Server Process, Partition B Server Process, Partition A Client Process bind(type = foo, lower=0, upper=99) subscribe(type = foo, lower = 0, upper = 500) bind(type = foo, lower=100, upper=199) foo,0,99 Functional Topology Subscription Function Address/Address Partition bind/unbind events foo,100,199
24
NOKIA RESEARCH CENTER / BOSTON TIPC bind(type = node, lower=0x1001003, upper=0x1001003) Node Client Process subscribe(type = node, lower = 0x1001000, upper = 0x1001009) node,0x1001003 node,0x1001002 Node bind(type = node, lower=0x1001002, upper=0x1001002) TIPC Network Topology Subscription Node/Cluster/Zone availability events Same mechanism as for functional events
25
NOKIA RESEARCH CENTER / BOSTON Connections Establishment based on functional addressing Selectable lookup algorithm, partitioning, redundancy etc Lightweight End-to-end flow control SOCK_STREAM/SOCK_SEQPACKET in connection oriented mode Mutually compatible
26
NOKIA RESEARCH CENTER / BOSTON Connection Setup foo,117 Server Process, Partition B Client Process sendto(type = foo, instance = 117 ) No protocol messages exchanged during setup/shutdown Only payload carrying messages
27
NOKIA RESEARCH CENTER / BOSTON Connection Setup No protocol messages exchanged during setup/shutdown Only payload carrying messages Server Process, Partition B Client Process lconnect(client) send()
28
NOKIA RESEARCH CENTER / BOSTON Connection Setup No protocol messages exchanged during setup/shutdown Only payload carrying messages Server Process, Partition B Client Process lconnect(server)
29
NOKIA RESEARCH CENTER / BOSTON Connection Shutdown No protocol messages exchanged during setup/shutdown Only payload carrying messages Server Process, Partition B Client Process disconnect()
30
NOKIA RESEARCH CENTER / BOSTON Connection Shutdown No protocol messages exchanged during setup/shutdown Only payload carrying messages Server Process, Partition B Client Process disconnect()
31
NOKIA RESEARCH CENTER / BOSTON Connection Setup/Shutdown Well-known TCP-style connect/shutdown with exchange of SYN and FIN message exchange available as alternative Server Process, Partition B Client Process bind() listen() accept() connect(type=foo, instance=117) SYN (foo,117)
32
NOKIA RESEARCH CENTER / BOSTON Connection Abortion Immediate “abortion” event in case of peer process crash Server Process, Partition B Client Process abort
33
NOKIA RESEARCH CENTER / BOSTON Connection Abortion Immediate “abortion” event in case of peer node crash Server Process, Partition B Client Process abort Node
34
NOKIA RESEARCH CENTER / BOSTON Connection Abortion Immediate “abortion” event in case of communication failure Server Process, Partition B Client Process abort Node
35
NOKIA RESEARCH CENTER / BOSTON Connection Abortion Immediate abortion in case of node overload Server Process, Partition B Client Process Node abort
36
NOKIA RESEARCH CENTER / BOSTON Connection Flow Control End-to-end send window of N messages slows sender process in case of receiver process overload Acknowledge sent from receiver each N/2 message Sender socket keeps only a counter, not a retransmission buffer Server Process, Partition B Client Process Node Acknowledg e
37
NOKIA RESEARCH CENTER / BOSTON Signalling Links Retransmission protocol and congestion control at signalling link level Transmitted packets acknowledged/released by any packet from other node Packet losses detected and retransmission performed earlier Packets from different sources are bundled in same buffer in case of congestion Packet flow more traffic driven, no need for timers per socket or message Server Process, Partition B Client Process Node Client Process Server Process, Partition B
38
NOKIA RESEARCH CENTER / BOSTON Network Load Sharing One link per node pair and interface Typically two links per node pair, for full load sharing and redundancy Server Process, Partition B Client Process Node Client Process Server Process, Partition B
39
NOKIA RESEARCH CENTER / BOSTON Network Redundancy Smooth failover in case of single link failure, with no consequences for user level connections Each link supervised by conditional heartbeats, i.e. when no other traffic Server Process, Partition B Client Process Node Client Process Server Process, Partition B
40
NOKIA RESEARCH CENTER / BOSTON Code Status Initial Release for Linux Feedback (S. Hemminger, Jamal) was that we have to do some re-work Memory handling, buffer handling, locking policy, socket interface, management protocol/interface… All issues addressed, but not all checked in at SF yet New, fully POSIX compliant socket interface/implementation More conventional use of buffers (performance…) Reliable multicast needs more testing Still not fully ready for inclusion in kernel, but we are close…
41
NOKIA RESEARCH CENTER / BOSTON Short Term Goals End of August: Kernel Ready Reliable multicast fully tested New socket implementation finished and tested Netlink based management/configuration protocol finished and tested Replaced all ioctls().
42
NOKIA RESEARCH CENTER / BOSTON Long Term Goals Multi-cluster Functionality Mostly user space Automatic inter-cluster neighbour discovery and link setup Fully manual inter cluster link setup Guaranteeing name table consistency between clusters Slave node name table reduction Additional Bearers Dynamic registration of “bearers” from user space (e.g. TCP, DCCP) Distributed netlink ??
43
NOKIA RESEARCH CENTER / BOSTON http://tipc.sourceforge.net
44
QUESTIONS ??
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.