Download presentation
Presentation is loading. Please wait.
Published byAlice Weaver Modified over 9 years ago
1
Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1
2
Global Systems Have Global Data 2
3
The Rise of Big Distributed Data CDNs: – Akamai has ~20 million requests per second – CloudFlare has about 300 MB/s of logs, volume doubles every 4 months Sensor data (e.g., power grid, highways) Smart camera networks 3
4
Trends 4 Time Amount per dollar Data Volumes Wide-area Bandwidth
5
Analyzing Low-rate Events is Easy 5 Server Crashed! Alert me when server crashes!
6
High-rate Events can be Costly 6 Every minute, compute request counts by URL Requests
7
Backhaul has Bad Dynamics Example: backhaul count of events every 5 minutes Choice of summaries is made upfront statically Buyer’s remorse: Chose to collect unnecessary and expensive data Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data 7
8
Local Storage! 8 Every minute, compute request counts by URL Requests Local Aggregation and Storage Local Aggregation and Storage
9
Challenge: Bandwidth Scarcity 9 I want the request count for every URL every second I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value cutoff, or change frequency? I can do that for 900 KB/sec. Can I get the top 1000 URLs every second? Great, do it!
10
? ? ? ? ? ? ? Challenge: Varying Scarcity 10 Time Bandwidth Needed Available Can do First aggregate over longer time periods, up to 30 seconds. Then only keep the top URLs.
11
Crashed or partitioned Challenge: Backfill 11 Every minute, compute request counts by URL Requests Local Aggregation and Storage Requests Local Aggregation and Storage Processing keeps happening Now what??
12
Data Processing Requirements Aggregatable 12 Merge-able Data Merged Representatio n += Reducible Data Stored Data += Update
13
13 Raw byte strings e.g. MapReduce Database tables High- level API Merge + Aggregate Predictable performance Arbitrary Joins XX√X √XX√
14
The Data Cube Model 14 Counts by URL12:0 0 12:0112:02 www.mysite.com35… www.yoursite.com54… www.hersite.com812… Roll-up of mysite.com by time from 12:00 to 12:01: 8 Roll-up of sites at time 12:00: 16 Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions Aggregation function used for: Updates Roll-ups Merging cubes Degrading cubes
15
15 Data Cube Raw byte strings e.g. MapReduce Database tables High- level API Merge + Aggregate Predictable performance Arbitrary Joins XX√X √XX√ √√√X
16
Dataflow Operators Local Cube Dataflow Operators Network bottleneck Dataflow Operators Local Cube Dataflow Operators Dataflow Operators Merged Cube Dataflow Operators A Vision for Wide-Area Analytics 16 Dataflow adapted to bandwidth
17
Adaptivity 17 Dataflow Operators Local Cube Dataflow Operators Network bottleneck
18
Feedback control Network bottleneck Adaptivity 18 Dataflow Operators Local Cube Dataflow Operators Summarized Cube Key ingredients: – Cube summarization as mechanism – User-defined policies – Feedback control
19
Backup Slides 19
20
Conclusions The hard problems in wide-area analysis: – Reasoning about bandwidth/data quality tradeoffs – Optimizing data quality under changing conditions. – Jointly optimizing bandwidth and other resources We are building a system. – We call it JetStream. Stay tuned…. 20
21
Structuring the Storage 21 Database relations, used in relational databases. Data Cube A multidimensional array with a merge function. Used in OLAP analysis. Ex: Total Requests and average latency by URL and time Raw byte strings, used in MapReduce and similar systems
22
Clouds master 22
23
Bandwidth Costs do not Decline Smoothly 23 [TeleGeography's Global Bandwidth Research Service]
24
20% 24 Frankfurt- London 2012 Bandwidth Price Shifts
25
Diurnal Load Makes Overprovisioning Expensive 25 Leased lines waste capacity during off-peak Public internet gets congested during peak
26
Structured edge storage 26 SystemData ModelProsCons MapReduce and similar Key-value bitstrings Key-value or FS-style storage Can’t optimize Relational Databases RelationalOptimizer can reason about queries Complex to support Wide-areaData cubesNatural merges, can reason about
27
What is the distribution of request counts by URL? Requests Straw-Man 1: Backhaul 27 Requests
28
What is the distribution of request counts by URL? Requests Straw-Man II: Stream processing 28 Requests Borealis / System S Borealis / System S
29
Can iteratively pose different queries Requests Benefit: Iteration 29 Requests Local Aggregation and Storage Local Aggregation and Storage A revised query
30
Can adapt data volume collected to available bw Requests Benefit: adaptation 30 Requests Local Aggregation and Storage Local Aggregation and Storage Limited Bandwidth
31
Can adapt data volume collected to available bw Requests Benefit: adaptation 31 Requests Local Aggregation and Storage Local Aggregation and Storage Ample Bandwidth
32
A dataflow model for wide-area analytics 32 Operator Cube Defines data transformation on tuples. Can do input or output. Structured storage of data
33
Processing Source Cube Network bottleneck Processed Data Processing Source Cube Generated data Ingested Into Local cubes 33
34
Processed Data Processing 34
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.