Harness Your Internet Activity
AAAA Deep Dive DNS-OARC, Buenos Aires March 2016 Ralf Weber
Geoff Houston talk at RIPE –DNS doesn’t use IPv6 Our default configuration at least didn't –DNS should use IPv6 What would be the impact? Find the state of IPv6 transport in the long tail –Alexa Top 1M isn’t long enough! –I’m not set up to do Geoff’s neat ad network trick! –I am set up to gather anonymized resolver data 3 Motivation for this talk
4 How Nominum Gets Data Customer Resolvers Receivers Hadoop HDFS Receivers Kafka Hadoop Loader n x 100B queries/day stats 600 cores 8T RAM n x Pbytes storage stats
Unique Name Query-Type Tuples –We do daily rollups so a day looked like a natural choice –Raw Data 1,152,389,150 (1.15 Billion ) To much to run and analyze from –Only used data that has been queried more than once 602,661,609 (602 million) Still a lot –Remove known PRSD and DNS tunnels 135,919,893 (135 million ) 5 Getting a test data set
135,919,893 Unique tuples 125,889,174 Unique names 27,466,881 Core domains Query Type distribution –108,509,872 A –11,663,222 AAAA –46,350 SPF –7,140 A6 –1,178 DNSKEY –12 HINFO –3 TLSA 6 What is in the test data set
7 Test Setup
Use a couple of dnsperf to run the queries simultaneously against the hosts –Every host gets 1000qps –Timeout is 60 seconds as every query is cold cache – dnsperf -d allq.new -Q q t 60 -S 1 –s IP –Test ran for nearly 38 hours over a weekend 8 Test Run
9 Result error codes
10 Result timings
11 Questions asked
12 Servers talked to
13 Question answered IPv4IPv6 IPv4 then 6IPv6 then 4 UDP Ok UDP Timeout TCP OK TCP Timout 0.08% 0.07% 0.18% 0.07%
14 Question answered per Protocol IPv4 IPv6 UDP Ok UDP Timeout TCP OK TCP Timout IPv4 then 6 IPv6 then % 0.09% 0.06% 0.08%
ip | timeout | ok | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Timeout offenders IPv4
ip | timeout | ok | | d.root-servers.net | | dina.ns.cloudflare.com | | i.root-servers.net | | i.gtld-servers.net | | buck.ns.cloudflare.com | | g.gtld-servers.net | | dee.ns.cloudflare.com | | f.gtld-servers.net | | | | marek.ns.cloudflare.com | | | | l.gtld-servers.net | | sns-pb.isc.org | | tinnie.arin.net | | a.gtld-servers.net | | d.gtld-servers.net | | c.gtld-servers.net | | ns3.nic.fr | | j.root-servers.net | | cumin.apnic.net | | sec3.apnic.net | | pri.authdns.ripe.net | | m.gtld-servers.net. 16 Timeout offenders IPv4
ip | timeout | ok :503:a83e::2:30 | | :503:231d::2:30 | | :500:2d::d | | :7fe::53 | | :cb00:2049:1::adf5:3a6b | | | | :cb00:2049:1::adf5:3a5d | | :500:2e::1 | | | | | | :cb00:2049:1::adf5:3bca | | :cb00:2049:1::adf5:3b4e | | | | :500:13::c7d4:35 | | :12f8:4::10 | | :dc0:1:0:4777::140 | | | | | | | | :dc0:2001:a:4608::59 | | | | :67c:e0::5 | | | | :660:3006:1::1:1 | | Timeout offenders IPv6 then IPv4
ip | timeout | ok :503:a83e::2:30 | | a.gtld-servers.net. 2001:503:231d::2:30 | | b.gtld-servers.net. 2001:500:2d::d | | d.root-servers.net. 2001:7fe::53 | | i.root-servers.net. 2400:cb00:2049:1::adf5:3a6b | | dina.ns.cloudflare.com | | g.gtld-servers.net. 2400:cb00:2049:1::adf5:3a5d | | dee.ns.cloudflare.com. 2001:500:2e::1 | | sns-pb.isc.org | | | | f.gtld-servers.net. 2400:cb00:2049:1::adf5:3bca | | marek.ns.cloudflare.com. 2400:cb00:2049:1::adf5:3b4e | | buck.ns.cloudflare.com | | :500:13::c7d4:35 | | tinnie.arin.net. 2001:12f8:4::10 | | d.dns.br. 2001:dc0:1:0:4777::140 | | sec3.apnic.net | | l.gtld-servers.net | | i.gtld-servers.net | | d.gtld-servers.net. 2001:dc0:2001:a:4608::59 | | sec1.apnic.net | | m.gtld-servers.net. 2001:67c:e0::5 | | pri.authdns.ripe.net | | a.gtld-servers.net. 2001:660:3006:1::1:1 | | ns3.nic.fr. 18 Timeout offenders IPv6 then IPv4
Servers that timeout are regular server that usually answer good I guess we see RRL in action Seems that people are not switching to TCP Good that DNS scales horizontally 6000 – 8000 qps is not much traffic outbound –Rule of thumb is 5 – 10% of inbound gets send out –Resolvers can easily do a couple of 100k qps inbound –Does this affect normal operation (another talk…) Maybe do a second test Looking into timeouts
Found another DNS tunnel in the dataset –Put it on our list –Removed queries (~500k) First test –All servers were asking the same at the same time –Total of 6000 qps Second Test –Offset start time by 30 minutes for each test run –Lowered qps to 800 per test (total 4800 qps) Test now ran over 48 hours 20 Second test…
21 Result error codes Test 2
22 Result timings Test 2
23 Questions asked Test 2
24 Servers talked to Test 2
25 Question answered Test 2 IPv4IPv6 IPv4 then 6IPv6 then 4 UDP Ok UDP Timeout TCP OK TCP Timout 0.03% 0.06% 0.03%
26 Question answered per Protocol Test 2 IPv4 IPv6 UDP Ok UDP Timeout TCP OK TCP Timout IPv4 then 6 IPv6 then % 0.03% 0.001% 0.03%
Second test had less servers not answering Overall answers were faster and better The more baskets you have the better Still wonder if the low auth qps has an impact on production servers –Payload is different –At least cold cache could see the same problem 27 Analysis
Turning on IPv6 as additional transport has only good effects –More baskets –More resilliency Should be enabled by default –Latest Cacheserve version has (IPv4 then IPv6) 28 Summary