Chapter 5 : End-to-End (Transport) Protocols Summary of underlying best-effort network capabilities (host-host) –drops packets or datagrams –re-orders packets or datagrams ( send order versus receive order) –delivers duplicate copies of a given packet or datagram –limits size of packets or datagrams –Arbitrarily long delays for delivering packets or datagrams End-to-end services desired ( process-to-process ) –guarantee delivery of messages –Same-order delivery of messages (same order as they are sent) –deliver one copy of each message –arbitrarily large message support –Synchronization support sender-to-receiver (ie, connection oriented) –flow control (allowing receiver to regulate sender’s rate) –multiple application process support on each host
End-to-End (Transport) Protocols continued RPC RPC = Remote Procedure Call
Simple Demultiplexing support (UDP) Unreliable, unordered, datagram service Adds demultiplexing No flow control Endpoints identified by ports –servers have well-known ports –see /etc/services on Unix Optional checksum Header format
Reliable Byte-Stream (TCP) Overview Connection-oriented, Byte-stream –sending process writes stream of bytes –TCP breaks into segments and sends them as IP_datagrams –receiving process reads stream of bytes (a) byte segments sent as IP datagrams (b) 2048B read as stream Full duplex channel Flow control provided (to keep sender from overrunning receiver) Congestion control provided (to keep sender from overrunning network)
End-to-End Issues Based on sliding window protocol similar to that used at data link level, but the situation is very different. Potentially connects many different hosts –need explicit connection establishment and termination Potentially different RTT –need adaptive timeout mechanism Potentially long delay in network –need to be prepared for arrival of very old packets Potentially different capacity at destination –need to accommodate different amounts of buffering Potentially different network capacity –need to be prepared for network congestion
Segment Format Each connection is identified by a 4-tuple as a demux key: Sliding window alg + flow control involve the following fields: AcknowledgmentNumber, SequenceNumber, AdvertisedWindowSize HdrLen is in 32-bit words 6-bit Flag: used to relay control information URGent set when urgent data is pointed to by UrgPtr ACK set when AckNum is valid, PUSh signifies that sender has flushed buffers ReSeT says receiver has become confused (start over!) SYNchronize, FINish set to establish/terminate connection Checksum : pseudo header(SrcAdr+DestAdr+Lengths) + TCP_header + data Advertised
Connection Establishment and Termination 3-WAY Handshake: Client sends a segment to the server with its start Seq# (SYN=1, SeqNum=x) Server sends a segment with (SYN=1,ACK=1,AckNum=x+1, SeqNum=y (its own start Seq#)) Client sends ack segment with (ACK=1, AckNum=y+1) Normal case Call collision (Active Client)(Passive Server)
1.Guarantees the reliable delivery of data 2.Ensures that data is delivered in order 3.Enforces flow control (that sender does not overrun receiver) Basically the same as in the sliding window algorithm at the link level For 1. (guaranteed reliable delivery). Where TCP sliding window differs is that it folds flow control in as well. Rather than fixed size window, receiver advertises a window size thru the AdvertiseWindow field (based on available buffers). Sender then is limited to having no more than that window size. Treatment of Sequence number wrap-around is essentially the same as link level. Sliding Window
Each byte has a Sequence number. ACKs are Cumulative. Sending side LastByteAcked LastByteSent LastByteSent LastByteWritten Bytes between LastByteAcked and LastByteWritten must be buffered Receiving side –LastByteRead < NextByteExpected –bytes between NextByteRead and LastByteRcvd must be buffered Sliding Window Send buffer Window
Keeping the Pipe Full Bandwidth T1 (1.5Mbps) Ethernet (10Mbps) T3 (45Mbps) FDDI (100Mbps) STS-3 (155Mbps) STS-12 (622Mbps) STS-24 (1.2Gbps) Time til Wraparound 6.4 hours 57 minutes 13 minutes 6 minutes 4 minutes 55 seconds 28 seconds Delay x Bandwidth Product 18KB 122KB 549KB 1.2MB 1.8MB 7.4MB 14.8MB
Adaptive Retransmission Original Algorithm –Measure SampleRTT for each segment/ACK pair –Compute weighted average of RTT EstimatedRTT = a * EstimatedRTT + b * SampleRTT where a + b = 1 a between 0.8 and 0.9 b between 0.1 and 0.2 –Set timeout based on EstimatedRTT TimeOut = 2 * EstimatedRTT Karn/Partridge Algorithm Do not sample RTT when retransmitting Double timeout after each retransmission
Jacobson/Karels Algorithm (used today) New calculation for average RTT Diff = SampleRTT - EstimatedRTT EstimatedRTT = EstimatedRTT + ( * Diff) Deviation = Deviation + (|Diff|- Deviation) (where is a fraction between 0 and 1) Setting timeout value TimeOut = x EstimatedRTT + x Deviation (where = 1 and = 4) Notes –algorithm only as good as granularity of clock (500ms on Unix) –accurate timeout mechanism important to congestion control (later) TCP Extensions proposed –Store timestamp in outgoing segments –Use 32-bit timestamp –Make modifications to advertised window
Remote Procedure Call (RPC) Protocol Stack BLAST: fragments and reassembles large messages CHAN: synchronizes request and reply messages Simple RPC Protocol Stack SELECT: dispatches message to correct process Caller (client) Client stub RPC protocol Return value Arguments ReplyRequest Callee (server) Server stub RPC protocol Return value Arguments ReplyRequest BLAST ETH IP SELECT CHAN
Bulk Transfer (BLAST) Strategy –Accumulates acks –selective retransmission –aka partial acknowledgements Blast header format MID protects against wraparound NumFrags = number of fragments TYPE = DATA or SRR FragMask distinguishes fragments if Type=DATA, identifies this frag if Type=SRR, identifies missing frags SenderReceiver Fragment 1 Fragment 2 Fragment 3 Fragment 5 Fragment 4 Fragment 6 Fragment 3 Fragment 5 SRR Unlike AAL and IP, BLAST tries to recover from lost fragments
Request/Reply (CHAN) Guarantees message delivery Synchronizes client with server Supports at-most-once semantics Simple timeline Timeline using Implicit Acks ClientServer Request ACK Reply ACK ClientServer Request 1 Request 2 Reply 2 Reply 1 …
CHAN Header Format typedef struct { u_short Type; /* REQ, REP, ACK, PROBE */ u_short CID; /* unique channel id */ int MID; /* unique message id */ int BID; /* unique boot id */ int Length; /* length of message */ int ProtNum; /* high-level protocol */ } ChanHdr; CHAN Session State typedef struct { u_char type; /* CLIENT or SERVER */ u_char status; /* BUSY or IDLE */ int retries; /* number of retries */ int timeout; /* timeout value */ XkReturn ret_val; /* return value */ Msg *request; /* request message */ Msg *reply; /* reply message */ Semaphore reply_sem; /* client semaphore */ int mid; /* message id */ int bid; /* boot id */ } ChanState;
Dispatcher (SELECT) Dispatch to appropriate procedure Synchronous counterpart to UDP Address Space for Procedures –flat: unique id for each possible procedure –hierarchical: program + procedure number
SunRPC IP implements ~BLAST-equivalent SunRPC implements ~CHAN-equivalent UDP + SunRPC implement SELECT-equivalent –UDP dispatches to program (ports bound to programs) –SunRPC dispatches to procedure within program SUN RPC header: –XID (transaction id) similar to CHAN’s MID –Server does not remember last XID it serviced –Problem if client retransmits request while reply is in transit Data MsgType = CALL XID RPCVersion = 2 Program Version Procedure Credentials (variable) Verifier (variable) 031 Data MsgType = REPLY XID Status = ACCEPTED 031
Presentation Formatting Data types considered –integers –floats –strings –arrays –structs Application data Presentation encoding Application data Presentation decoding Message … Types of data not considered –images –video –multimedia documents