FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu.

FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu

Acknowledgments  Caltech Bunn, Choe, Doyle, Hegde, Jayaraman, Newman, Ravot, Singh, X. Su, J. Wang, Xia  UCLA Paganini, Z. Wang  CERN Martin  SLAC Cottrell  Internet2 Almes, Shalunov  MIT Haystack Observatory Lapsley, Whitney  TeraGrid Linda Winkler  Cisco Aiken, Doraiswami, McGugan, Yip  Level(3) Fernes  LANL Wu

Outline  Motivation & approach  FAST architecture  Window control algorithm  Experimental evaluation skip: theoretical foundation

Performance at large windows capacity = 155Mbps, 622Mbps, 2.5Gbps, 5Gbps, 10Gbps; 100 ms round trip latency; 100 flows J. Wang (Caltech, June 02) ns-2 simulation 10Gbps 27% txq=100txq=10000 95% 1G Linux TCP Linux TCP FAST 19% average utilization capacity = 1Gbps; 180 ms round trip latency; 1 flow C. Jin, D. Wei, S. Ravot, etc (Caltech, Nov 02) DataTAG Network: CERN (Geneva) – StarLight (Chicago) – SLAC/Level3 (Sunnyvale) txq=100

Congestion control x i (t) p l (t) Example congestion measure p l (t) Loss (Reno) Queueing delay (Vegas)

TCP/AQM  Congestion control is a distributed asynchronous algorithm to share bandwidth  It has two components TCP: adapts sending rate (window) to congestion AQM: adjusts & feeds back congestion information  They form a distributed feedback control system Equilibrium & stability depends on both TCP and AQM And on delay, capacity, routing, #connections p l (t) x i (t) TCP: Reno Vegas AQM: DropTail RED REM/PI AVQ

Difficulties at large window  Equilibrium problem Packet level: AI too slow, MD too drastic Flow level: required loss probability too small  Dynamic problem Packet level: must oscillate on binary signal Flow level: unstable at large window 5

Packet & flow level ACK: W  W + 1/W Loss: W  W – 0.5W  Packet level Reno TCP  Flow level Equilibrium Dynamics pkts (Mathis formula)

Reno TCP  Packet level Designed and implemented first  Flow level Understood afterwards  Flow level dynamics determines Equilibrium: performance, fairness Stability  Design flow level equilibrium & stability  Implement flow level goals at packet level

Reno TCP  Packet level Designed and implemented first  Flow level Understood afterwards  Flow level dynamics determines Equilibrium: performance, fairness Stability Packet level design of FAST, HSTCP, STCP guided by flow level properties

Packet level ACK: W  W + 1/W Loss: W  W – 0.5W  Reno AIMD(1, 0.5) ACK: W  W + a(w)/W Loss: W  W – b(w)W  HSTCP AIMD(a(w), b(w)) ACK: W  W + 0.01 Loss: W  W – 0.125W  STCP MIMD(a, b)  FAST

Flow level: Reno, HSTCP, STCP, FAST  Similar flow level equilibrium  = 1.225 (Reno), 0.120 (HSTCP), 0.075 (STCP) pkts/sec (Mathis formula)

Flow level: Reno, HSTCP, STCP, FAST  Different gain  and utility U i They determine equilibrium and stability  Different congestion measure p i Loss probability (Reno, HSTCP, STCP) Queueing delay (Vegas, FAST)  Common flow level dynamics! window adjustment control gain flow level goal =

Implementation strategy  Common flow level dynamics window adjustment control gain flow level goal =  Small adjustment when close, large far away Need to estimate how far current state is wrt target Scalable  Window adjustment independent of p i Depends only on current window Difficult to scale

Outline  Motivation & approach  FAST architecture  Window control algorithm  Experimental evaluation skip: theoretical foundation

Architecture RTT timescale Loss recovery <RTT timescale

Architecture Each component  designed independently  upgraded asynchronously

Architecture Each component  designed independently  upgraded asynchronously Window Control

Uses delay as congestion measure Delay provides finer congestion info Dealy scales correctly with network capacity Can operate with low queuing delay FAST-TCP basic idea Loss CWindow Queue Delay FAST Loss Based TCP

Window control algorithm  Full utilization regardless of bandwidth-delay product  Globally stable exponential convergence  Fairness weighted proportional fairness parameter 

Window control algorithm target backlogmeasured backlog

Outline  Motivation & approach  FAST architecture  Window control algorithm  Experimental evaluation Abilene-HENP network Haystack Observatory DummyNet

Abilene Test OC48 OC192 (Yang Xia, Harvey Newman, Caltech) Periodic losses every 10mins

(Yang Xia, Harvey Newman, Caltech) Periodic losses every 10mins

(Yang Xia, Harvey Newman, Caltech) Periodic losses every 10mins FAST backs off to make room for Reno

“ Ultrascale ” protocol development: FAST TCP FAST TCP  Based on TCP Vegas  Uses end-to-end delay and loss to dynamically adjust the congestion window  Defines an explicit equilibrium Linux TCP Westwood+ BIC TCP FAST BW use 30% BW use 50%BW use 79% Capacity = OC-192 9.5Gbps; 264 ms round trip latency; 1 flow BW use 40% (Yang Xia, Caltech)

Haystack Experiments Lapsley, MIT Haystack

Haystack - 1 Flow (Atlanta-> Japan) Iperf used to generate traffic. Sender is a Xeon 2.6 Ghz Window was constant: Burstiness in rate due to Host processing and ack spacing. Lapsley, MIT Haystack

Haystack – 2 Flows from 1 machine (Atlanta -> Japan) Lapsley, MIT Haystack

Timeout All outstanding packets marked as lost. 1.SACKs reduce lost packets 2. Lost packets retransmitted slowly as cwnd is capped at 1 (bug). Linux Loss Recovery

DummyNet Experiments  Experiments using emulated network.  800 Mbps emulated bottleneck in DummyNet. Sender PC Dual Xeon 2.6Ghz 2Gb Intel GbE Linux 2.4.22 DummyNet PC Dual Xeon 3.06Ghz 2Gb FreeBSD 5.1 800Mbps Receiver PC Dual Xeon 2.6Ghz 2Gb Intel GbE Linux 2.4.22

Dynamic sharing: 3 flows FASTLinux Dynamic sharing on Dummynet  capacity = 800Mbps  delay=120ms  3 flows  iperf throughput  Linux 2.4.x (HSTCP: UCL)

Dynamic sharing: 3 flows FASTLinux HSTCP BIC Steady throughput

FASTLinux throughput loss queue STCPHSTCP Dynamic sharing on Dummynet  capacity = 800Mbps  delay=120ms  14 flows  iperf throughput  Linux 2.4.x (HSTCP: UCL) 30min

FASTLinux throughput loss queue HSTCP 30min Room for mice ! HSTCP BIC

Average Queue vs Buffer Size Dummynet  capacity = 800Mbps  Delay =200ms  1 flows  Buffer size: 50, …, 8000 pkts (S. Hedge, B. Wydrowski, etc, Caltech)

Is large queue necessary for high throughput?

 FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004  -release: April 2004 Source freely available for any non-profit use netlab.caltech.edu/FAST

Aggregate throughput ideal performance Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

Aggregate throughput small window 800pkts large window 8000 Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

Fairness Jain’s index HSTCP ~ Reno Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts

Stability Dummynet: cap = 800Mbps; delay = 50-200ms; #flows = 1-14; 29 expts stable in diverse scenarios

 FAST TCP: motivation, architecture, algorithms, performance. IEEE Infocom March 2004  -release: April 2004 Source freely available for any non-profit use netlab.caltech.edu/FAST

BACKUP Slides

IP Rights  Caltech owns IP rights applicable more broadly than TCP leave all options open  IP freely available if FAST TCP becomes IETF standard  Code available on FAST website for any non-commercial use

WAN in Lab Caltech: John Doyle, Raj Jayaraman, George Lee, Steven Low (PI), Harvey Newman, Demetri Psaltis, Xun Su, Yang Xia Cisco: Bob Aiken, Vijay Doraiswami, Chris McGugan, Steven Yip netlab.caltech.edu NSF

Key Personnel  Steven Low, CS/EE  Harvey Newman, Physics  John Doyle, EE/CDS  Demetri Psaltis, EE Cisco  Bob Aiken  Vijay Doraiswami  Chris McGugan  Steven Yip  Raj Jayaraman, CS  Xun Su, Physics  Yang Xia, Physics  George Lee, CS  2 grad students  3 summer students  Cisco engineers

Spectrum of tools log(cost) log(abstraction) mathsimulationemulationlive nkWANiLab NS SSFNet QualNet JavaSim Mathis formula Optimization Control theory Nonlinear model Stocahstic model DummyNet EmuLab ModelNet WAIL PlanetLab Abilene NLR DataTAG CENIC WAIL etc ? …we use them all

Spectrum of tools mathsimulationemulationlive nk WANiLab DistanceHigh SpeedHigh Low RealismHigh Low TrafficHighLow ConfigurableLowMediumHigh MonitoringLowMediumHigh CostHighMediumLow Critical in development e.g. Web100

Goal State-of-the-art hybrid WAN  High speed, large distance 2.5G  10G 50 – 200ms  Wireless devices connected by optical core  Controlled & repeatable experiments  Reconfigurable & evolvable  Built in monitoring capability

WAN in Lab  5-year plan  6 Cisco ONS15454  4 routers  10s servers  Wireless devices  800km fiber  ~100ms RTT V. Doraiswami (Cisco) R. Jayaraman (Caltech)

WAN in Lab  Year-1 plan  3 Cisco ONS 15454  2 routers  10s servers  Wireless devices V. Doraiswami (Cisco) R. Jayaraman (Caltech)

Hybrid Network Scenarios:  Ad hoc network  Cellular network  Sensor network How optical core supports wireless edges? X. Su (Caltech)

Experiments  Transport & network layer TCP, AQM, TCP/IP interaction  Wireless hybrid networking Wireless media delivery Fixed wireless access Sensor networks  Optical control plane  Grid computing UltraLight

 WAN in Lab Capacity: 2.5 – 10 Gbps Delay: 0 – 100 ms round trip Delay: 0 – 400 ms round trip  Configurable & evolvable Topology, rate, delays, routing Always at cutting edge  Flexible, active debugging Passive monitoring, AQM  Integral part of R&A networks Transition from theory, implementation, demonstration, deployment Transition from lab to marketplace  Global resource Part of global infrastructure UltraLight led by Newman Unique capabilities Calren2/Abilene Chicago Amsterdam CERN Geneva SURFNet StarLight WAN in Lab Caltech research & production networks Multi-Gbps 50-200ms delay Experiment

Network debugging  Performance problems in real network Simulation will miss Emulation might miss Live network hard to debug  WAN in Lab Passive monitoring inside network Active debugging possible

Passive monitoring Fiber splitter DAG RAID Timestamp Header GPS Monitor  No overhead on system  Can capture full info at OC48 UofWaikato’s DAG card captures at OC48 speed Can filter if necessary Disk speed = 2.5Gbps*40/1500 = 66Mbps  Monitors synchronized by GPS or cheaper alternatives  Data stored for offline analysis D. Wei (Caltech)

Passive monitoring D. Wei (Caltech) Fiber splitter DAG RAID Timestamp Header GPS Monitor Server router monitor Web100, MonALISA

UltraLight testbed UltraLight team (Newman)

Status  Hardware Optical transport design: finalized IP infrastructure design: finalized (almost) Wireless infrastructure design: finalized Price negotiation/ordering/delivery: summer 04  Software Passive monitoring: summer student Management software: 2005 -  Physical lab Renovation: to be completed by summer 04

20072006200520032004 hardware design physical building fund raising NSF funds 10/03 Status usable testbed 12/04 monitoring traffic generation connected UltraLight useful testbed 12/05 ARO funds 5/04 expansion support management

CS Dept Jorgensen Lab Net Lab WAN in Lab G. Lee, R. Jayaraman, E. Nixon (Caltech)

Summary  Testbed driven by research agenda Rich and strong networking effort Integrated approach: theory + implementation + experiments “A network that can break”  Integral part of real testbeds Part of global infrastructure UltraLight led by Harvey Newman (Caltech)  Integrated monitoring & measurement facility Fiber splitter passive monitors MonALISA

FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu.

Similar presentations

Presentation on theme: "FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu.

Similar presentations

Presentation on theme: "FAST TCP Cheng Jin David Wei Steven Low netlab.CALTECH.edu."— Presentation transcript:

Similar presentations

About project

Feedback