Fixing Wifi Latency… Finally!
Low latency/Good bandwidth At all mcs rates
Wifi Latency A quick rant about benchmarks Fixed API Futures Overview Wifi Latency A quick rant about benchmarks Fixed API Futures
Grokking Bufferbloat fixes in 20 Minutes
Most of your benchmarks are wrong! Iperf3 “millabursts” Tmix PPS
Meet Ham, the marketing monkey Ham likes big numbers! Ham doesn’t care if they mean anything Ham is good with winning phrases Meet Ham, the marketing monkey Ham, has been my nemesis, for a very, very long time. He’s good at a couple things – naming We have “bufferbloat”, and “fq_codel”, and BQL” Ham comes up with things like “streamboost”, and “LagBuster” and Or optimizing for what you thought was important.
Typical Wifi Testbench Over a wire Over the air DUT Attenuators Over a wire Over the air Photos courtesy: Candelatech
Problems with WiFi Lab Bench Testing Distance Multipath (many) Legacy Devices (many) Competing devices Reasoning from non Internet scale RTTs Treating UDP and TCP results with equal weight Rate over range without looking at latency or rate control Tests that don’t run long enough Summary statistics that don’t look at inter-station behaviors
Real World WiFi uses
What works Plots of Long duration tests Varying RTTs Packet Captures Tracing Sampling Other tests NOT. JUST. BANDWIDTH. In particular, there isn’t just one number for RTT
100 stations ath10k Today Is it the average? Yea, out of the 5 flows out of a hundred flows that got started, each got 20Mbits The median? I don’t even know where to start.
Quick n Dirty’ Contended Web PLT Linux 4 Quick n Dirty’ Contended Web PLT Linux 4.4 txqueue 1000 (10ms base RTT + 1 sec delay) Flent -l 300 -H server –streams=12 tcp_ndown & wget -E -H -k -K -p https://www.slashdot.org FINISHED --2016-10-22 13:43:35-- Total wall clock time: 232.8s Downloaded: 62 files, 1.8M in 0.9s (77KB/s) Connect time accounted for 99.6% of the runtime, and the real throughput was: 77KB/sec. WTF?
Better Benchmarks Dslreports test for bufferbloat Flent has extensive plotting facilities & wraps netperf/ping/d-itg for rtt_fair*var: Lets you test multiple numbers of stations rrul, rrul_be: Exercises up/down and hw queues simultaneious tcp_nup, tcp_ndown: Test variable numbers of flows http rrul_voip About 90 other tests TSDE Tcptrace -G & Xplot.org Wireshark
100 stations Ath10k Airtime Fair (wip) Just curious: How many people here are willing to take, I don’t know, a 50% bandwidth hit to some other application, just so your skype call, videoconference, audio stream, or ssh session stayed fast and reliable over wifi?
Heresy! Optimize for “time to service” stations Pack Aggregates Better Serve Stations better Look at Quality of Experience (QOE) Voip Mean Opinion Scores (MOS) (under load) Web Page Load Time (under load) Latency under load (vs various mcs wifi rates) Long Term stability and “smoothness” Flow completion time (FCT) Loss, jitter and queuing
Control Queue Length Be fair to flows Be fair to stations Optimize for TXOPs ~1300/sec Into which you pour bandwidth
Queuing Delay Interference, slow rates, etc Mis-interpreted – queuing delay
Aggregation improvements
Web PLT improvements under competing load FIFO load time was 35 sec on the large page!
VOIP MOS scores & Bandwidth 3 stations contending (ath9k) LEDE Pending LEDE Today LEDE Pending Linux 4.4 Today LEDE Today Linux 4.4 Today OpenWrt CC OpenWrt CC Best Effort – scheduled sanely on the AP – Better than the HW VO queue!
Latency below 20ms FIXME – use the c2 plot
Two bloat-free coffee shops
Fix ISP bufferbloat FIRST We get to 22 seconds and it explodes at the end of the track. Courtesy dslreports: http://www.dslreports.com/speedtest/5408767 http://www.dslreports.com/speedtest/5448121
Upload Bloat Dslreports: http://www.dslreports.com/speedtest/results/bufferbloat?up=1
file:///home/d/git/sites/cerowrt/content/flent/crypto_fq _bug/airtime_plot.png Airtime fair pie file:///home/d/git/sites/cerowrt/content/flent/fq_codel_ osx/loss_sack.png Sack zoomed: file:///home/d/git/sites/cerowrt/content/flent/fq_codel_o sx/sacked_zoomed.png file:///home/d/git/sites/cerowrt/content/flent/qca-10.2- fqmac35-codel-5/wonderwhattodoaboutbeacons.png
What we did No changes to WiFi Backoff function stays the same Only modified the queuing and aggregation functions We added Per station queueing Removed the qdisc fq_codel per station RR FQ between stations (currently) DRR Airtime Fairness between stations (pending)
Overall Philosophy One aggregate in the hardware One aggregate queued on top, ready to go One being prepared The rest of the packets being fq’d per station Total delay = 2-12ms in the layers This is plenty of time (BQL not needed) Currently max sized txops (we can cut this) Codel moderates the whole thing
Per Device Old way 4 sets of 1000 packet queues per SSID New way 8192 queues, 4MB memory limit
WiFi Queue Rework All buffering moves into the MAC80211 layer Max 2 aggregates pending (~8ms)
API requirements Completion handler get_expected_throughput
Benefits Per device max queuing, not per SSID Minimal buffering
fq_codel per station Mixes flows up Controls queue length Enables lossless congestion control (ECN)
API Init the softirq layer Add a callback for the completion handler
Knobs and Stats Aggh! We eliminated the qdisc layer! tc qdisc show dev wlp3s0 qdisc noqueue 0: dev wlp3s0 root refcnt 2 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Ifconfig wlp3s0 # still works! Link encap:Ethernet HWaddr 30:10:b3:77:bf:7b inet addr:172.22.224.1 Bcast:172.22.224.255 Mask:255.255.255.0 inet6 addr: fe80::3210:b3ff:fe77:bf7b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:2868196 errors:0 dropped:0 overruns:0 frame:0 TX packets:2850840 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:2331543636 (2.3 GB) TX bytes:1237631655 (1.2 GB) Hmm… Maybe we could put sch_fq there? root@nemesis:/sys/kernel/debug/ieee80211/phy0/netd ev:wlp3s0# tc -s qdisc show dev wlp3s0 qdisc fq 8001: root refcnt 5 limit 10000p flow_limit 100p buckets 1024 orphan_mask 1023 quantum 3028 initial_quantum 15140 refill_delay 40.0ms Sent 16061543 bytes 10915 pkt (dropped 0, overlimits 0 requeues 0) Backlog 0b 0p requeues 0 107 flows (107 inactive, 0 throttled) 0 gc, 0 highprio, 72 throttled
Top Level Statistics root@nemesis:/sys/kernel/debug/ieee80211/phy0# cat aqm access name value R fq_flows_cnt 4096 R fq_backlog 0 R fq_overlimit 0 R fq_overmemory 0 R fq_collisions 14 R fq_memory_usage 0 RW fq_memory_limit 4194304 # 4MB on 802.11n per device not SSID RW fq_limit 2048 # packet limit IMHO too high by default RW fq_quantum 1514 # 300 currently by default, no...
Per station stats root@nemesis:/sys/kernel/debug/ieee80211/phy0/netd ev:wlp3s0/stations/80:2a:a8:18:1b:1d# cat aqm tid ac backlog-bytes backlog-packets new-flows drops marks overlimit collisions tx-bytes tx-packets 0 2 0 0 975208 1 14 0 0 992492788 1713258 1 3 0 0 0 0 0 0 0 0 0 2 3 0 0 0 0 0 0 0 0 0 3 2 0 0 0 0 0 0 0 0 0 4 1 0 0 0 0 0 0 0 0 0 5 1 0 0 3276 0 0 0 0 386568 3276 6 0 0 0 528 0 0 0 0 106306 528 7 0 0 0 1587 0 0 0 0 287247 1587 8 2 0 0 0 0 0 0 0 0 0 9 3 0 0 0 0 0 0 0 0 0 10 3 0 0 0 0 0 0 0 0 0 11 2 0 0 0 0 0 0 0 0 0 12 1 0 0 0 0 0 0 0 0 0 13 1 0 0 0 0 0 0 0 0 0 14 0 0 0 0 0 0 0 0 0 0 15 0 0 0 0 0 0 0 0 0 0
Per Station Stats /sys/kernel/debug/ieee80211/phy0/netdev:wlp3s0/stati ons/80:2a:a8:18:1b:1d:1b:1d# cat airtime RX: 471104269 us TX: 92761799 us Deficit: -168 us Queued as new: 0 times Queue cleared: 0 times Dequeue failed: 0 times
Per SSID stats /sys/kernel/debug/ieee80211/phy0/netdev:wlp3s0/aqm AC Backlog-bytes Backlog-packets New-flows drops marks overlimit collissions tx-bytes tx-packets 2 617594 14 1940667 617602
Problems Some devices don’t tell you how much they are aggregating Some don’t have a tight callback loop Some expose insufficient rate control information Some have massively “hidden buffers” in the firmware Some have all of these issues! All of them have other bugs!
TSQ Issue?
Problem: NAS & TSQ
Futures Add Airtime fairness Further reduce driver latency Improve rate controll Minimize Multicast Add more drivers/devices Figure out APIs
Airtime Fairness (ATF) Ath9k driver only Switch to choosing stations based on sharing the air fairly – sum the rx and tx time to any given station, figure out which stations should be serviced next Allocate the “air” Huge benefits with a mixture of slow (or legacy) and fast stations. The fast ones pack WAY more data into their slot, the slow ones slow down slightly.
Explicit bandwidth/latency tradeoff
Credits
Questions?