FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu
Acknowledgments Caltech Bunn, Choe, Doyle, Hegde, Jayaraman, Newman, Ravot, Singh, X. Su, J. Wang, Xia UCLA Paganini, Z. Wang CERN Martin SLAC Cottrell Internet2 Almes, Shalunov MIT Haystack Observatory Lapsley, Whitney TeraGrid Linda Winkler Cisco Aiken, Doraiswami, McGugan, Yip Level(3) Fernes LANL Wu
Outline Motivation & approach FAST architecture Window control algorithm Experimental evaluation skip: theoretical foundation
Performance at large windows capacity = 155Mbps, 622Mbps, 2.5Gbps, 5Gbps, 10Gbps; 100 ms round trip latency; 100 flows J. Wang (Caltech, June 02) ns-2 simulation 10Gbps 27% txq=100txq= % 1G Linux TCP Linux TCP FAST 19% average utilization capacity = 1Gbps; 180 ms round trip latency; 1 flow C. Jin, D. Wei, S. Ravot, etc (Caltech, Nov 02) DataTAG Network: CERN (Geneva) – StarLight (Chicago) – SLAC/Level3 (Sunnyvale) txq=100
Congestion control x i (t) p l (t) Example congestion measure p l (t) Loss (Reno) Queueing delay (Vegas)
TCP/AQM Congestion control is a distributed asynchronous algorithm to share bandwidth It has two components TCP: adapts sending rate (window) to congestion AQM: adjusts & feeds back congestion information They form a distributed feedback control system Equilibrium & stability depends on both TCP and AQM And on delay, capacity, routing, #connections p l (t) x i (t) TCP: Reno Vegas AQM: DropTail RED REM/PI AVQ
Difficulties at large window Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too small Dynamic problem Packet level: must oscillate on binary signal Flow level: unstable at large window 5
Packet & flow level ACK: W W + 1/W Loss: W W – 0.5W Packet level Reno TCP Flow level Equilibrium Dynamics pkts (Mathis formula)
Reno TCP Packet level Designed and implemented first Flow level Understood afterwards Flow level dynamics determines Equilibrium: performance, fairness Stability Design flow level equilibrium & stability Implement flow level goals at packet level
Reno TCP Packet level Designed and implemented first Flow level Understood afterwards Flow level dynamics determines Equilibrium: performance, fairness Stability Packet level design of FAST, HSTCP, STCP guided by flow level properties
Packet level ACK: W W + 1/W Loss: W W – 0.5W Reno AIMD(1, 0.5) ACK: W W + a(w)/W Loss: W W – b(w)W HSTCP AIMD(a(w), b(w)) ACK: W W Loss: W W – 0.125W STCP MIMD(a, b) FAST
Flow level: Reno, HSTCP, STCP, FAST Similar flow level equilibrium = (Reno), (HSTCP), (STCP) pkts/sec (Mathis formula)
Flow level: Reno, HSTCP, STCP, FAST Different gain and utility U i They determine equilibrium and stability Different congestion measure p i Loss probability (Reno, HSTCP, STCP) Queueing delay (Vegas, FAST) Common flow level dynamics! window adjustment control gain flow level goal =
Implementation strategy Common flow level dynamics window adjustment control gain flow level goal = Small adjustment when close, large far away Need to estimate how far current state is wrt target Scalable Window adjustment independent of p i Depends only on current window Difficult to scale
Outline Motivation & approach FAST architecture Window control algorithm Experimental evaluation skip: theoretical foundation
Architecture RTT timescale Loss recovery <RTT timescale
Architecture Each component designed independently upgraded asynchronously
Architecture Each component designed independently upgraded asynchronously Window Control
Uses delay as congestion measure Delay provides finer congestion info Dealy scales correctly with network capacity Can operate with low queuing delay FAST-TCP basic idea Loss CWindow Queue Delay FAST Loss Based TCP
Window control algorithm Full utilization regardless of bandwidth-delay product Globally stable exponential convergence Fairness weighted proportional fairness parameter
Window control algorithm target backlogmeasured backlog
Outline Motivation & approach FAST architecture Window control algorithm Experimental evaluation Abilene-HENP network Haystack Observatory DummyNet
Abilene Test OC48 OC192 (Yang Xia, Harvey Newman, Caltech) Periodic losses every 10mins
(Yang Xia, Harvey Newman, Caltech) Periodic losses every 10mins
(Yang Xia, Harvey Newman, Caltech) Periodic losses every 10mins FAST backs off to make room for Reno
“ Ultrascale ” protocol development: FAST TCP FAST TCP Based on TCP Vegas Uses end-to-end delay and loss to dynamically adjust the congestion window Defines an explicit equilibrium Linux TCP Westwood+ BIC TCP FAST BW use 30% BW use 50%BW use 79% Capacity = OC Gbps; 264 ms round trip latency; 1 flow BW use 40% (Yang Xia, Caltech)
Haystack Experiments Lapsley, MIT Haystack
Haystack - 1 Flow (Atlanta-> Japan) Iperf used to generate traffic. Sender is a Xeon 2.6 Ghz Window was constant: Burstiness in rate due to Host processing and ack spacing. Lapsley, MIT Haystack
Haystack – 2 Flows from 1 machine (Atlanta -> Japan) Lapsley, MIT Haystack
Timeout All outstanding packets marked as lost. 1.SACKs reduce lost packets 2. Lost packets retransmitted slowly as cwnd is capped at 1 (bug). Linux Loss Recovery
DummyNet Experiments Experiments using emulated network. 800 Mbps emulated bottleneck in DummyNet. Sender PC Dual Xeon 2.6Ghz 2Gb Intel GbE Linux DummyNet PC Dual Xeon 3.06Ghz 2Gb FreeBSD Mbps Receiver PC Dual Xeon 2.6Ghz 2Gb Intel GbE Linux
Dynamic sharing: 3 flows FASTLinux Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 3 flows iperf throughput Linux 2.4.x (HSTCP: UCL)
Dynamic sharing: 3 flows FASTLinux HSTCP BIC Steady throughput
FASTLinux throughput loss queue STCPHSTCP Dynamic sharing on Dummynet capacity = 800Mbps delay=120ms 14 flows iperf throughput Linux 2.4.x (HSTCP: UCL) 30min
FASTLinux throughput loss queue HSTCP 30min Room for mice ! HSTCP BIC
Average Queue vs Buffer Size Dummynet capacity = 800Mbps Delay =200ms 1 flows Buffer size: 50, …, 8000 pkts (S. Hedge, B. Wydrowski, etc, Caltech)
Is large queue necessary for high throughput?
FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004 -release: April 2004 Source freely available for any non-profit use netlab.caltech.edu/FAST
Aggregate throughput ideal performance Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts
Aggregate throughput small window 800pkts large window 8000 Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts
Fairness Jain’s index HSTCP ~ Reno Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts
Stability Dummynet: cap = 800Mbps; delay = ms; #flows = 1-14; 29 expts stable in diverse scenarios
FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004 -release: April 2004 Source freely available for any non-profit use netlab.caltech.edu/FAST
BACKUP Slides
IP Rights Caltech owns IP rights applicable more broadly than TCP leave all options open IP freely available if FAST TCP becomes IETF standard Code available on FAST website for any non-commercial use
WAN in Lab Caltech: John Doyle, Raj Jayaraman, George Lee, Steven Low (PI), Harvey Newman, Demetri Psaltis, Xun Su, Yang Xia Cisco: Bob Aiken, Vijay Doraiswami, Chris McGugan, Steven Yip netlab.caltech.edu NSF
Key Personnel Steven Low, CS/EE Harvey Newman, Physics John Doyle, EE/CDS Demetri Psaltis, EE Cisco Bob Aiken Vijay Doraiswami Chris McGugan Steven Yip Raj Jayaraman, CS Xun Su, Physics Yang Xia, Physics George Lee, CS 2 grad students 3 summer students Cisco engineers
Spectrum of tools log(cost) log(abstraction) mathsimulationemulationlive nkWANiLab NS SSFNet QualNet JavaSim Mathis formula Optimization Control theory Nonlinear model Stocahstic model DummyNet EmuLab ModelNet WAIL PlanetLab Abilene NLR DataTAG CENIC WAIL etc ? …we use them all
Spectrum of tools mathsimulationemulationlive nk WANiLab DistanceHigh SpeedHigh Low RealismHigh Low TrafficHighLow ConfigurableLowMediumHigh MonitoringLowMediumHigh CostHighMediumLow Critical in development e.g. Web100
Goal State-of-the-art hybrid WAN High speed, large distance 2.5G 10G 50 – 200ms Wireless devices connected by optical core Controlled & repeatable experiments Reconfigurable & evolvable Built in monitoring capability
WAN in Lab 5-year plan 6 Cisco ONS15454 4 routers 10s servers Wireless devices 800km fiber ~100ms RTT V. Doraiswami (Cisco) R. Jayaraman (Caltech)
WAN in Lab Year-1 plan 3 Cisco ONS 2 routers 10s servers Wireless devices V. Doraiswami (Cisco) R. Jayaraman (Caltech)
Hybrid Network Scenarios: Ad hoc network Cellular network Sensor network How optical core supports wireless edges? X. Su (Caltech)
Experiments Transport & network layer TCP, AQM, TCP/IP interaction Wireless hybrid networking Wireless media delivery Fixed wireless access Sensor networks Optical control plane Grid computing UltraLight
WAN in Lab Capacity: 2.5 – 10 Gbps Delay: 0 – 100 ms round trip Delay: 0 – 400 ms round trip Configurable & evolvable Topology, rate, delays, routing Always at cutting edge Flexible, active debugging Passive monitoring, AQM Integral part of R&A networks Transition from theory, implementation, demonstration, deployment Transition from lab to marketplace Global resource Part of global infrastructure UltraLight led by Newman Unique capabilities Calren2/Abilene Chicago Amsterdam CERN Geneva SURFNet StarLight WAN in Lab Caltech research & production networks Multi-Gbps ms delay Experiment
Network debugging Performance problems in real network Simulation will miss Emulation might miss Live network hard to debug WAN in Lab Passive monitoring inside network Active debugging possible
Passive monitoring Fiber splitter DAG RAID Timestamp Header GPS Monitor No overhead on system Can capture full info at OC48 UofWaikato’s DAG card captures at OC48 speed Can filter if necessary Disk speed = 2.5Gbps*40/1500 = 66Mbps Monitors synchronized by GPS or cheaper alternatives Data stored for offline analysis D. Wei (Caltech)
Passive monitoring D. Wei (Caltech) Fiber splitter DAG RAID Timestamp Header GPS Monitor Server router monitor Web100, MonALISA
UltraLight testbed UltraLight team (Newman)
Status Hardware Optical transport design: finalized IP infrastructure design: finalized (almost) Wireless infrastructure design: finalized Price negotiation/ordering/delivery: summer 04 Software Passive monitoring: summer student Management software: Physical lab Renovation: to be completed by summer 04
hardware design physical building fund raising NSF funds 10/03 Status usable testbed 12/04 monitoring traffic generation connected UltraLight useful testbed 12/05 ARO funds 5/04 expansion support management
CS Dept Jorgensen Lab Net Lab WAN in Lab G. Lee, R. Jayaraman, E. Nixon (Caltech)
Summary Testbed driven by research agenda Rich and strong networking effort Integrated approach: theory + implementation + experiments “A network that can break” Integral part of real testbeds Part of global infrastructure UltraLight led by Harvey Newman (Caltech) Integrated monitoring & measurement facility Fiber splitter passive monitors MonALISA