Presentation is loading. Please wait.

Presentation is loading. Please wait.

Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1.

Similar presentations


Presentation on theme: "Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1."— Presentation transcript:

1 Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1

2 Global Systems Have Global Data 2

3 The Rise of Big Distributed Data CDNs: – Akamai has ~20 million requests per second – CloudFlare has about 300 MB/s of logs, volume doubles every 4 months Sensor data (e.g., power grid, highways) Smart camera networks 3

4 Trends 4 Time Amount per dollar Data Volumes Wide-area Bandwidth

5 Analyzing Low-rate Events is Easy 5 Server Crashed! Alert me when server crashes!

6 High-rate Events can be Costly 6 Every minute, compute request counts by URL Requests

7 Backhaul has Bad Dynamics Example: backhaul count of events every 5 minutes Choice of summaries is made upfront statically Buyer’s remorse: Chose to collect unnecessary and expensive data Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data 7

8 Local Storage! 8 Every minute, compute request counts by URL Requests Local Aggregation and Storage Local Aggregation and Storage

9 Challenge: Bandwidth Scarcity 9 I want the request count for every URL every second I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value cutoff, or change frequency? I can do that for 900 KB/sec. Can I get the top 1000 URLs every second? Great, do it!

10 ? ? ? ? ? ? ? Challenge: Varying Scarcity 10 Time Bandwidth Needed Available Can do First aggregate over longer time periods, up to 30 seconds. Then only keep the top URLs.

11 Crashed or partitioned Challenge: Backfill 11 Every minute, compute request counts by URL Requests Local Aggregation and Storage Requests Local Aggregation and Storage Processing keeps happening Now what??

12 Data Processing Requirements Aggregatable 12 Merge-able Data Merged Representatio n += Reducible Data Stored Data += Update

13 13 Raw byte strings e.g. MapReduce Database tables High- level API Merge + Aggregate Predictable performance Arbitrary Joins XX√X √XX√

14 The Data Cube Model 14 Counts by URL12:0 0 12:0112:02 www.mysite.com35… www.yoursite.com54… www.hersite.com812… Roll-up of mysite.com by time from 12:00 to 12:01: 8 Roll-up of sites at time 12:00: 16 Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions Aggregation function used for: Updates Roll-ups Merging cubes Degrading cubes

15 15 Data Cube Raw byte strings e.g. MapReduce Database tables High- level API Merge + Aggregate Predictable performance Arbitrary Joins XX√X √XX√ √√√X

16 Dataflow Operators Local Cube Dataflow Operators Network bottleneck Dataflow Operators Local Cube Dataflow Operators Dataflow Operators Merged Cube Dataflow Operators A Vision for Wide-Area Analytics 16 Dataflow adapted to bandwidth

17 Adaptivity 17 Dataflow Operators Local Cube Dataflow Operators Network bottleneck

18 Feedback control Network bottleneck Adaptivity 18 Dataflow Operators Local Cube Dataflow Operators Summarized Cube Key ingredients: – Cube summarization as mechanism – User-defined policies – Feedback control

19 Backup Slides 19

20 Conclusions The hard problems in wide-area analysis: – Reasoning about bandwidth/data quality tradeoffs – Optimizing data quality under changing conditions. – Jointly optimizing bandwidth and other resources We are building a system. – We call it JetStream. Stay tuned…. 20

21 Structuring the Storage 21 Database relations, used in relational databases. Data Cube A multidimensional array with a merge function. Used in OLAP analysis. Ex: Total Requests and average latency by URL and time Raw byte strings, used in MapReduce and similar systems

22 Clouds master 22

23 Bandwidth Costs do not Decline Smoothly 23 [TeleGeography's Global Bandwidth Research Service]

24 20% 24 Frankfurt- London 2012 Bandwidth Price Shifts

25 Diurnal Load Makes Overprovisioning Expensive 25 Leased lines waste capacity during off-peak Public internet gets congested during peak

26 Structured edge storage 26 SystemData ModelProsCons MapReduce and similar Key-value bitstrings Key-value or FS-style storage Can’t optimize Relational Databases RelationalOptimizer can reason about queries Complex to support Wide-areaData cubesNatural merges, can reason about

27 What is the distribution of request counts by URL? Requests Straw-Man 1: Backhaul 27 Requests

28 What is the distribution of request counts by URL? Requests Straw-Man II: Stream processing 28 Requests Borealis / System S Borealis / System S

29 Can iteratively pose different queries Requests Benefit: Iteration 29 Requests Local Aggregation and Storage Local Aggregation and Storage A revised query

30 Can adapt data volume collected to available bw Requests Benefit: adaptation 30 Requests Local Aggregation and Storage Local Aggregation and Storage Limited Bandwidth

31 Can adapt data volume collected to available bw Requests Benefit: adaptation 31 Requests Local Aggregation and Storage Local Aggregation and Storage Ample Bandwidth

32 A dataflow model for wide-area analytics 32 Operator Cube Defines data transformation on tuples. Can do input or output. Structured storage of data

33 Processing Source Cube Network bottleneck Processed Data Processing Source Cube Generated data Ingested Into Local cubes 33

34 Processed Data Processing 34


Download ppt "Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1."

Similar presentations


Ads by Google