Bartek Wydrowski Steven Low

Name: Bartek Wydrowski Steven Low
Uploaded: 2017-08-18T17:52:44+00:00
Duration: PTM46S5
Channel: Shauna Shaw
Description: Bartek Wydrowski Steven Low

Bartek Wydrowski Steven Low
FAST TCP Bartek Wydrowski Steven Low FAST TCP: motivation, architecture, algorithms, performance July 18, 2003: IETF Meeting, Vienna (Allison Mankin) July 31, 2003: ISI/USC, CA (Aaron Falk) Aug 5, 2003: Internet2 Joint Tech Meeting, University of Kansas, KS Aug 15, 2003: Disney Digital Network Roundtable (plenary), CA (Howard Liu) June 8, 2004: Google (Urs Hoelzle, VP, Operations) July 11, 2004: HP Labs, Palo Alto, CA (Xiaoyun Zhu) July 12, 2004: IBM Research, Almaden, CA (Moidin Mohiudin) Dec 7, 2004: Cisco visit (Fred Baker, Graham Holmes, Chris McGugan, Chas Smith) Feb 13, 2004: Internet2 Joint Tech Meeting, Salt Lake City, Utah (Paul Love, James Williams) netlab.CALTECH.edu

Acks & Collaborators PSC Caltech UCLA StarLight CERN SLAC Internet2
Bunn, Choe, Doyle, Hegde, Jin, Li, Low Newman, Papadoupoulous, Ravot, Singh, Tang, J. Wang, Wei, Wydrowski, Xia UCLA Paganini, Z. Wang StarLight deFanti, Winkler CERN Martin SLAC Cottrell PSC Mathis Internet2 Almes, Shalunov Abilene GigaPoP’s GATech, NCSU, PSC, Seattle, Washington Cisco Aiken, Doraiswami, McGugan, Smith, Yip Level(3) Fernes LANL Wu

Outline Background, motivation FAST TCP MaxNet, SUPA FAST
Architecture and algorithms Experimental evaluations Loss recovery MaxNet, SUPA FAST

Performance at large windows
ns-2 simulation 27% txq=100 txq=10000 95% 1G Linux TCP Linux TCP FAST 19% average utilization capacity = 1Gbps; 180 ms round trip latency; 1 flow C. Jin, D. Wei, S. Ravot, etc (Caltech, Nov 02) DataTAG Network: CERN (Geneva) – StarLight (Chicago) – SLAC/Level3 (Sunnyvale) 10Gbps capacity = 155Mbps, 622Mbps, 2.5Gbps, 5Gbps, 10Gbps; 100 ms round trip latency; 100 flows J. Wang (Caltech, June 02)

Average Queue vs Buffer Size
Dummynet capacity = 800Mbps Delay =200ms 1 flows Buffer size: 50, …, 8000 pkts (S. Hedge, B. Wydrowski, etc, Caltech)

Is large queue necessary for high throughput?

Congestion control Example congestion measure pl(t) Loss (Reno)
xi(t) Example congestion measure pl(t) Loss (Reno) Queueing delay (Vegas)

TCP/AQM pl(t) xi(t) AQM: DropTail RED REM/PI AVQ TCP: Reno Vegas
Congestion control is a distributed asynchronous algorithm to share bandwidth It has two components TCP: adapts sending rate (window) to congestion AQM: adjusts & feeds back congestion information They form a distributed feedback control system Equilibrium & stability depends on both TCP and AQM And on delay, capacity, routing, #connections

Packet & flow level Reno TCP Packet level Flow level Equilibrium
ACK: W  W + 1/W Loss: W  W – 0.5W Packet level Flow level Equilibrium Dynamics pkts (Mathis formula)

Reno TCP Packet level Flow level Flow level dynamics determines
Designed and implemented first Flow level Understood afterwards Flow level dynamics determines Equilibrium: performance, fairness Stability Design flow level equilibrium & stability Implement flow level goals at packet level

Reno TCP Packet level Flow level Flow level dynamics determines
Designed and implemented first Flow level Understood afterwards Flow level dynamics determines Equilibrium: performance, fairness Stability Packet level design of FAST, HSTCP, STCP guided by flow level properties

Packet level Reno AIMD(1, 0.5) HSTCP AIMD(a(w), b(w)) STCP MIMD(a, b)
ACK: W  W + 1/W Loss: W  W – 0.5W Reno AIMD(1, 0.5) ACK: W  W + a(w)/W Loss: W  W – b(w)W HSTCP AIMD(a(w), b(w)) ACK: W  W Loss: W  W – 0.125W STCP MIMD(a, b) FAST

Flow level: Reno, HSTCP, STCP, FAST
Similar flow level equilibrium pkts/sec a = (Reno), (HSTCP), (STCP)

Flow level: Reno, HSTCP, STCP, FAST
Common flow level dynamics! window adjustment control gain flow level goal = Different gain k and utility Ui They determine equilibrium and stability Different congestion measure pi Loss probability (Reno, HSTCP, STCP) Queueing delay (Vegas, FAST)

Implementation strategy
Common flow level dynamics window adjustment control gain flow level goal = Small adjustment when close, large far away Need to estimate how far current state is wrt target Scalable Window adjustment independent of pi Depends only on current window Difficult to scale

Difficulties at large window
Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too small Dynamic problem Packet level: must oscillate on binary signal Flow level: unstable at large window 5

Problem: no target Reno: AIMD (1, 0.5) HSTCP: AIMD (a(w), b(w))
ACK: W  W + 1/W Loss: W  W – 0.5W HSTCP: AIMD (a(w), b(w)) ACK: W  W + a(w)/W Loss: W  W – b(w)W STCP: MIMD (1/100, 1/8) ACK: W  W Loss: W  W – 0.125W

Solution: estimate target
FAST Slow Start FAST Conv Equil Loss Rec Scalable to any w*

Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too small Dynamic problem Packet level: must oscillate on binary signal Flow level: unstable at large window

Problem: binary signal
TCP oscillation

Solution: multibit signal
FAST stabilized

Stable: 20ms delay Window
Ns-2 simulations, 50 identical FTP sources, single link 9 pkts/ms, RED marking

Stable: 20ms delay Window Queue

Unstable: 200ms delay Window

Unstable: 200ms delay Window Queue

Flow level (in)stability is robust
30% noise avg delay 16ms 20ms 30% noise avg delay 208ms 200ms

Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too small Dynamic problem Packet level: must oscillate on binary signal Flow level: unstable at large window Use multi-bit signal ! Stablize flow dynamics !

Stability: Reno/RED x y Rf(s) F1 G1 TCP Network AQM FN GL q p Rb’(s)
Small t Small c Large N RED: Small r Large delay Theorem (Low et al, Infocom’02) Reno/RED is locally stable if

Stability: scalable control
F1 FN G1 GL Rf(s) Rb’(s) TCP Network AQM x y q p Theorem (Paganini, Doyle, L, CDC’01) Provided R is full rank, feedback loop is locally stable for arbitrary delay, capacity, load and topology

Stability: FAST x y Rf(s) F1 G1 TCP Network AQM FN GL q p Rb’(s)
Application Stabilized TCP with current routers Queueing delay as congestion measure has right scaling Incremental deployment with ECN

Architecture <RTT timescale RTT timescale Loss recovery

Architecture Each component designed independently
upgraded asynchronously

Architecture Window Control Each component designed independently
upgraded asynchronously Window Control

Window control algorithm
Full utilization regardless of bandwidth-delay product Globally stable exponential convergence Fairness weighted proportional fairness parameter a

target backlog measured backlog

Theorem (Infocom04, CDC04, Infocom05) Mapping from w(t) to w(t+1) is contraction Global exponential convergence Full utilization after finite time Utility function: ai log xi (proportional fairness)

Dynamic sharing: 3 flows
FAST Linux Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 3 flows iperf throughput Linux 2.4.x (HSTCP: UCL)

Dynamic sharing: 3 flows
FAST Linux Steady throughput HSTCP BIC

Dynamic sharing on Dummynet
queue FAST Linux loss 30min throughput Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 14 flows iperf throughput Linux 2.4.x (HSTCP: UCL) HSTCP STCP

queue Room for mice ! FAST Linux loss 30min throughput HSTCP HSTCP BIC

Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts
Aggregate throughput small window 800pkts large 8000 Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts

Fairness Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts

Stability stable in diverse scenarios Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts

Responsiveness Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts

I2LSR, SC2004 Bandwidth Challenge
Harvey Newman’s group, Caltech OC48 OC192 November 8, 2004 Caltech and CERN transferred 2,881 GBytes in one hour (6.86Gbps) between Geneva - US - Geneva (25,280 km) through LHCnet/DataTag, Abilene and CENIC backbones using 18 FAST TCP streams on Linux kernel with 9000KB MTU at 174 Pbm/s

Internet2 Abilene Weather Map
OC48 OC192 7.1G: GENV-PITS-LOSA-SNVA-STTL-DNVR-KSCY-HSTON-ATLA-WASH-NYCM-CHIN-GENV Newman’s group, Caltech

“Ultrascale” protocol development: FAST TCP
Based on TCP Vegas Uses end-to-end delay and loss to dynamically adjust the congestion window Defines an explicit equilibrium Capacity = OC Gbps; 264 ms round trip latency; 1 flow BW use 30% BW use 40% BW use 50% BW use 79% Linux TCP Westwood BIC TCP FAST (Yang Xia, Caltech)

Periodic losses every 10mins FAST backs off to make room for Reno
(Yang Xia, Harvey Newman, Caltech)

Experiment by Yusung Kim KAIST, Korea, Oct 2004
Linux Experiment by Yusung Kim KAIST, Korea, Oct 2004 Dummynet Capacity = 622Mbps Delay=200ms Router buffer size = 1BDP (11,000 pkts) 1 flow Application: iperf BIC, FAST, HSTCP, STCP, Reno (Linux), CUBIC

All can achieve high throughput except Reno
RTT RTT = 400ms double baseRTT FAST Throughput Yusung Kim, KAIST, Korea 10/2004 All can achieve high throughput except Reno FAST adds negligible queueing delay Loss-based control (almost) fills buffer … adding delay and reducing ability to absorb bursts BIC HSTCP

FAST needs smaller buffer at both routers and hosts
queue FAST FAST cwnd Yusung Kim, KAIST, Korea 10/2004 FAST needs smaller buffer at both routers and hosts Loss-based control limited at host in these expts BIC HSTCP

Loss Recovery Section Overview
Linux & TCP loss recovery has problems; esp. in non-congestion loss environments. New Loss Architecture: Determining packet loss & PIF Decoupled window control Testing in high loss environment Receiver window issues Forward Retransmission SACK processing optimization Reorder Detection Testing in small buffer environment

New Loss Recovery Architecture
New Architecture for loss recovery motivated by new environments: High loss wireless, , Satellite Low loss, but large BDP Measure of Path ‘difficulty’ should be extended BDLP: Bandwidith x Delay x (1/(1-Loss))

Periodic losses every 10mins (Yang Xia, Harvey Newman, Caltech)

Haystack - 1 Flow (Atlanta-> Japan)
Iperf used to generate traffic. Sender is a Xeon 2.6 Ghz Window was constant: Burstiness in rate due to Host processing and ack spacing.

Haystack – 2 Flows from 1 machine (Atlanta -> Japan)

Linux Loss Recovery Problem
All outstanding packets marked as lost. SACKs reduce lost packets 2. Lost packets retransmitted slowly as cwnd is capped at 1 (bug). Timeout

New Loss Recovery Architecture
Decouple congestion control from loss recovery: No rate halving, cwnd resets, etc upon loss. Window primarily controlled by delay. Efficient Retransmit mechansim: Linux TCP does not account PIF well when there are retransmissions cwnd limits PIF, not write queue length. Accurately discriminate loss from reordering. Construct accurate way of determining packet loss and PIF.

Loss Recovery Efficient Loss recovery requires ability to accurately determine when a packet is lost and retransmit it immediately: Accurate RTT measurement to determine timeout. Accurate reordering detection. Efficient Forward Retransmission strategy. Efficient CPU utilization to keep up with the work, esp with SACK processing.

Loss Recovery – PIF model
A packet i is deemed lost at time t if: SENDTIMEi - t > RTT(t) + REORDERLAG(t) To implement this timeout mechanism we construct a “Transmission-Order-Queue” (TOQ) which is a queue of of in-flight packets (sent, but not lost or acked). Tail = oldest sent packet Head = most recently sent

Loss Recovery Architecture

Loss Recovery – Window control
Window control done when ACK for pkt j received: CWND = min( PIFrecv, PIFsent)  >=1 Pkt j=13 sent ACK for pkt j=13 received Data Data Sender Receiver Sender Receiver 13 12 11 10 9 8 7 1 2 3 4 5 6 13 ACKs ACKs 1 2 3 4 5 6 PIFrecv=10 PIFsent=12 7 8 10 12

Loss Recovery – Test scenario
DummyNet PC Dual Xeon 3.06Ghz 2Gb FreeBSD 5.1 1-800 Mbps Loss 40ms one-way delay Receiver PC Dual Xeon 2.6Ghz 2Gb Intel GbE Linux Sender PC Dual Xeon 2.6Ghz 2Gb Intel GbE Linux

Forward Retransmission
Receiver reorder queue: Highest seq# pkt received cwnd cwnd cwnd Packets which have been lost several times and are holding up the next acknowledgement and freeing of reorder-queue resource. Current Linux Receiver Window limits transmission speed with high loss high BDP. Forward retransmission needed to reduce require reorder and write queue sizes.

~11 BDPs or 88 Mb!! 80ms

800Mbps 0.3 Loss

10Mbps 0.3 Loss BDP = 0.08 x 833 = 66 pkts = 100kb ~600Kb

Number of Forward Retransmissions of a pkt depends at least on these factors. We can devise different algorithms to determine best FR rate. L pkt W S sk.snd_una oldest unacked pkt sk.snd_wnd Receiver advertised wnd sk.snd_nxt Next pkt to send L = sk.snd_nxt – skb.end_seq S = sk.snd_wnd – w W = sk.snd_nxt – sk.snd_una F = 1/(pass rate)

If we are short on receiver window space, that is S >> C is not true, we want to increase FR rate. (Say S = C, a pkt at L=C, if lost will delay transmission after 1 RTT). FR is proportional to C/S. 2. If L > C, that is, it has been retransmitted already (L/C times) we want to increase FR rate. FR is proportional to L/C. Therefore by 1 & 2 FR is proportional to L/S 3. We don’t want FR rate to exceed say 3.F, 3 times the expected number of retransmissions. In case S is small, we cap FR rate by 3•F•L/W. FR = min (L/S, 3•F•L/W) L = sk.snd_nxt – skb.end_seq S = sk.snd_wnd – w W = sk.snd_nxt – sk.snd_una F = 1/(pass rate) C = cwnd

SACK Processing Processing of SACK packets in Linux is CPU intensive as write queue needs to be traversed. All SKB’s prior to SACKed SKB are marked as LOST. Traversal can invalidate large amount of memory cache. TOQ allows us to eliminate LOST and RETRANSMITTED flags in SKB’s in the write queue. This allows a number of optimizations eliminates traversing write queue at each SACK. Possible to go directly to SKB and mark that as SACKed. Then finding pointer to SKB can be quite fast with SACK BLOCK pointer cache SEQ->*SKB lookup table

SACK Processing Architecture

SACK Processing CPU Utilization
With SACK optimization Without SACK optimization

Packet Reordering Studies have shown reordering is common on the Internet. Some causes: multi-path due to parallelism in routers/switches load-balancing across routes

Detecting Reordering record highest seq received so far (newest_seq)
(S)ACK Seq Number Arrival Order 4 3 2 1 5 5 4 3 2 1 record highest seq received so far (newest_seq) reorder(t+1) = max(reorder(t),newest_seq – seq) Easy if packet was never retransmitted; need to identify retransmitted packets carefully: If 2 retrans, and xmit time of 2nd << RTT ago, (S)ACK is for 1st xmit If multiple retrans, unique sender timestamp will identify which xmit caused (S)ACK.

Reordering Detection - Experiment
Linux Traffic Control (TC-iproute2) module implemented to perform controlled packet reordering: (in all experiments P = 2 R)

Reordering Experiment: WITH REORDER DETECTION Throughput Vs Reorder Injected @ Different Link Capacities

Reordering Experiment: Reorder Detected Vs Reorder Injected @ Different Link Capacities

Conclusions Decoupling of Loss and Congestion control that FAST has facilitated has allowed the development of a new highly efficient loss recovery mechanism. By using delay as congestion measure, it is possible to achieve close to maximum possible good-put with loss.

MaxNet: Quick Overview
MaxNet is: A Fully distributed flow control architecture for large networks (no per-flow state in router) Max-Min fair in principle. Stable for networks of arbitrary topology, number of users, capacity and delay. Fast convergence properties. Addresses short-flow control. Low queuing delay, drastically reduces router buffer size requirements. Based on similar analysis as FAST TCP Incrementally deployable; Integrates with FAST TCP

Why explicit signal is useful?
Queuing delay necessary with delay based protocols: Explicit protocols reduce queuing delay And Reduce requirement for router buffer sizes Explicit signal can be used to set right starting rate. Some challenges with delay based protocols: Alpha tuning baseRTT (sampling, route change etc) BQD Backward queuing delay Delay noise (OS etc) Interaction of Link layer retransmission. Interaction with Wireless coding etc.

Congestion Signal N Bits
MaxNet: Packet Format MaxNet requires N bits in the packet to carry an explicit signal about the path congestion level. The routers along the packet’s path modify this congestion signal. The congestion signal controls the source’s rate. Data Congestion Signal N Bits IPV6 IPV4 TCP

MaxNet: System TCP sender TCP receiver p1 p2 Router Router P P P P
Maxnet requires the participation of the Source, Router and Receiver. The source rate is controlled by a feedback value in the ACK packet. This feedback value is obtained from routers as the packet passes through MaxNet links on their way to the receiver. Each routers only remarks the packet if its congestion value is higher than in the packet, hence MaxNet. At the end of the path, the packet holds the highest congestion value off all routers along the path. TCP sender TCP receiver Data packets P P P P p1 p2 Xmit Rate Router Router P P P P ACK packets Source 1. Transmits packets at rate controlled by feedback value in ACK Pj Router 2. Computes congestion level Remarks packet if router’s congestion level is higher than level in packet Receiver 4. Relays P value and sends it back to sender in an ACK packet.

MaxNet: Source & Link Algorithm
Source Algorithm D(Pj) Pj – Feedback Signal Xmit rate Xi Source receives ACK j with feedback value Pj and determines transmission rate by demand function D(Pj): Xi = D(Pj) Link Algorithm Router monitors aggregate input traffic rate Yl(t) destined for link l which has capacity Cl(t): Every 10 ms router l computes its congestion level pl: pl(t+1) = pl(t) + b(Y(t)-aC(t)) a control target link utilization b controls convergence rate 2. For every data packet k with signal Pk router conditionally remarks it: if (Pk < pl(t)) Pk=pl(t)

Computing Explicit Signal in real routers
Congestion signal can be generated almost ‘implicitly’ by measuring the delay of a packet through the device. B(t+1)=B(t)+ Y(t) – C(t)  P(t+1)=P(t)+a(Y(t)-uC(t)) Input Timestamp OutputTimestamp Switch Fabric Input Queue Output Queue Advantage: Simplest to implement Disadvantage: Sufficient queuing delay must be able to build up

MaxNet & XCP Properties
Criteria MaxNet XCP Rate Allocation MaxMin Fair – if all sources have the same demand function. Weighted MaxMin – if sources weight their demand function. Constrained MaxMin (less than MaxMin) Stability Provable stability for networks of arbitrary topology, RTTs, capacity and arbitrary number of flows. (Linear analysis). Single link and aggregate of flows, all with same RTT only shown. No general proof exists. Convergence Speed Linear analysis shows faster convergence than ECN, loss (RENO), delay (FAST,VEGAS) based schemes. No control analysis available. Some simulation results show faster than TCP-RENO. Router operations per packet 2 = 1 addition +1 max 12 = 3 multiplications + 1 division + 6 additions + 2 comparisons

MaxNet & XCP Properties
Criteria MaxNet XCP Bits per Packet Naïve encoding: 40 Bits/pkt with naïve linear encoding. Exponential encoding: Even 20 bits per packet would give huge dynamic range. 96 Bits/pkt from BSD implementation. Incremental Deployment Yes, MaxNet can be thought of as an explicit version of FAST-TCP (where the congestion signal is implicit- delay). A combined protocol with FAST-TCP is possible which uses explict signal, delay and loss, allowing operation on paths with no explicit signal ability. Unknown Implementation progress TCP-FAST can be adopted. Linux MaxNet module in development. NS2 BSD Lossy environments Decouples loss from congestion measurement. Recent improvements to loss recovery for FAST-TCP apply equally to MaxNet. FAST-TCP was recently shown to achieve around 6Mbps goodput at 30% loss rate, on a 70ms 10Mbps link. For more information go to:

XCP Max-Min Fairness

Explicit Starting Rate
Having a mulit-bit field in the packet also allows for an explicit starting to be communicated from the network to the source. This would allow the source to start transmitting at a high rate after 1 RTT. A possible algorithm to determine the starting rate is: 1. Extended-SYN arrives at link: rate_i = alpha * ((C(t) – Y(t)) – aggregate_committed) aggregate_committed += rate_i set timeout for connection i: timeout = t + TO 2. 1st data packet for connection i arrives at link: set timeout for connection i: timeout = t + RTT clear timeout (state-space can be reduced if we eliminate this step) 3. Timeout for connection i occurs: aggregate_committed -= rate_i

Signaling Unified Protocol Architecture (SUPA FAST TCP)
Existing protocols focus on one type of congestion signal Future of FAST TCP is to combine congestion control SUPA FAST TCP

SUPA FAST TCP Network Components
Links may have one of the 4 congestion signaling abilities: A path may be a combination of any of these types of links. The challenge is how to detect the bottleneck capability and how to react in all situations.

Conclusion MaxNet provides a framework for doing explicit signal congestion control. A practical approach would involve combining different congestion signals. Evolution of Internet from loss-based protocols to explicit signaling is possible in an incremental way. Explicit protocols solve many of the challenges of using loss or delay as a congestion signal. Widespread deployment not near-term. However, specialized applications where there is pain may be deployed sooner.

Bartek Wydrowski Steven Low

Similar presentations

Presentation on theme: "Bartek Wydrowski Steven Low"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bartek Wydrowski Steven Low

Similar presentations

Presentation on theme: "Bartek Wydrowski Steven Low"— Presentation transcript:

Similar presentations

About project

Feedback