1 Mean Time to Innocence Your Dashboards are Green – but your end users are still complaining. Now What? Phil Stanhope October 2015
2 30B Real-Time Steering Decisions per day 6B trace route and RUM latency measurements per day That’s over 6 Light years! 13 Hops per traceroute Traffic covering 80% of ASNs on the internet seen every few minutes 52K ASN monitored 200M BGP updates per day No major CDN can deliver 99.9 uptime – from the end users perspective. But is it fault. Real Time Feeds Cooked Time Series Data – Near Real Time Pre-Cooked across ~1000 dimensions every 5 minutes (Geography, Mobile Network, Fixed Line Networks, Target Markets Cities and IPSets) Outages & Hijacks Pairwise Comparisons Performance Alarms Some Numbers 2
3 ● Major Outages Major Impact Rare ● Regional Outages and Degradations Variable Impact Always Happening “We experienced an Internet connectivity issue with a provider outside of our network which affected traffic from some end-user networks.” AWS Business Impacting 3
4 ● Consolidated view across your Internet Infrastructure ● Determine the impact to Cloud, CDN and Hosting Infrastructure globally ● Immediate time to information What is Internet Intelligence? 4
5 Leverage Currently Deployed Dyn Assets ● Global Monitoring Infrastructure ● Custom Cloud Monitoring Infrastructure ● Real User Monitoring data ● Global Routing Infrastructure Monitors How is it Done? 5
6 Global Monitoring Infrastructure 6
7 7 Reachability Markets
8 8 What is being Monitored?
9 Waterfalls & RUM – Where do you start? 9
10 Rather than focus on entire page RUM and waterfall – focus on what happens OUTSIDE of normal your span of control as a cloud, content & security consumer: Monitor the critical content servers (CDNs both public and private) Monitor the cloud providers, DNS providers & core SaaS providers Give you the tooling to get to start answering mean time to innocence questions Is it a problem you have ability to address? Not if it’s your cloud provider’s transit. Or the ISPs recursive DNS. Is your CDN provider overloaded? Is there a more generalized congestion problem on the internet? Are the network paths to your users suboptimal – maybe even hijacked? Can you see a micro-outage? Can you see patterns with providers? Did a user come via a proxy gateway? Does the gateway fail to forward websockets? Let’s Dive in – Some Context 10
11 NOTE: This is a fake URL – it won’t work for you. Sorry. A single web page that shows combination of real-time and near-real-time forensic data Intentionally unbranded – what can you do with our datasets? Covers the internal APIs that we use – they are all becoming public. Talk to me! Common set of UX controls can be used to a variety of real-time and batch data: GeoViews, Sunburst, Matrix & Long-Term Trending Under the covers: ReactJS, D3, GeoJson/Topojson, jQuery, Go, Varnish, Nginx, Websockets Live Demo 11
12 Telemetry Data Cooking Pipeline 12 Users Cover 80% Of the ASNs On the internet Every minute Relays Globally distributed network. Handling 50K/sec per relay Probers Network of 300+ probers performing 10K traces/second AND synthetic DIG & HTTP[S] Geo annotated real-time API Time Series analyzed API Gatherers Real-time geo annotation, data transformation & filtering. Handling 100K/sec events Cookers Statistical analysis and aggregation services
13 Browser Recursive Authoritive Injector & Beacon GET - beacon = HMAC(secret, token) Javascript “injection” – just like injecting an advertisement into a page Writes a transparent iframe into the page Loading the iframe requires resolving beacon Guaranteed to cause recursive DNS cache miss time, client_ip, beacon time, recursive_ip,beacon HTTPDNSLOG & ANALYZE Collect GET - time, client_ip Dynamic HTML - containing customer resources to test Resources 1.. target origins tested Resource timing Information sent to collector Per resource timing info KEY: Gatherer token = encode(cust_id, client_ip, time, nodeid, referer) Time – 2 - Authoritative Time – 2 – Recursive (inferred)
14 5min, 1H & 1D Cooking – What’s going on in our Data Kitchen? 14 MHD Raw MHD formatted data at one minute granularity Client IP STATS Histograms 5 minute timing histograms across 6 latency features DNS IP STATS Histograms 5 minute timing histograms across 6 latency features IP Maps Client Recursive Recursive Client Client IP Sets Typed Label IP Sets Latencies Country City Continent ASN DNS IP Sets Typed Label IP Sets Latencies Country City Continent ASN Correlation Scores and Ranks Daily by Origin for every TYLIP feature All data is GEO Redundant Gathering, Raw, Intermediates & Aggregates Geo annotated real-time API Gatherers
15 QUESTIONS?