Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1.

Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1

Global Systems Have Global Data 2

The Rise of Big Distributed Data CDNs: – Akamai has ~20 million requests per second – CloudFlare has about 300 MB/s of logs, volume doubles every 4 months Sensor data (e.g., power grid, highways) Smart camera networks 3

Trends 4 Time Amount per dollar Data Volumes Wide-area Bandwidth

Analyzing Low-rate Events is Easy 5 Server Crashed! Alert me when server crashes!

High-rate Events can be Costly 6 Every minute, compute request counts by URL Requests

Backhaul has Bad Dynamics Example: backhaul count of events every 5 minutes Choice of summaries is made upfront statically Buyer’s remorse: Chose to collect unnecessary and expensive data Analyst’s remorse: Summaries insufficient for analysis. No way to retroactively get more data 7

Local Storage! 8 Every minute, compute request counts by URL Requests Local Aggregation and Storage Local Aggregation and Storage

Challenge: Bandwidth Scarcity 9 I want the request count for every URL every second I can’t do that, Ari. That costs 100 MB/sec. You only have 12 MB/sec. Want to impose a rank cutoff, value cutoff, or change frequency? I can do that for 900 KB/sec. Can I get the top 1000 URLs every second? Great, do it!

? ? ? ? ? ? ? Challenge: Varying Scarcity 10 Time Bandwidth Needed Available Can do First aggregate over longer time periods, up to 30 seconds. Then only keep the top URLs.

Crashed or partitioned Challenge: Backfill 11 Every minute, compute request counts by URL Requests Local Aggregation and Storage Requests Local Aggregation and Storage Processing keeps happening Now what??

Data Processing Requirements Aggregatable 12 Merge-able Data Merged Representatio n += Reducible Data Stored Data += Update

13 Raw byte strings e.g. MapReduce Database tables High- level API Merge + Aggregate Predictable performance Arbitrary Joins XX√X √XX√

The Data Cube Model 14 Counts by URL12:0 0 12:0112:02 www.mysite.com35… www.yoursite.com54… www.hersite.com812… Roll-up of mysite.com by time from 12:00 to 12:01: 8 Roll-up of sites at time 12:00: 16 Cube: A multidimensional array, with one or more aggregates, indexed by a set of dimensions Aggregation function used for: Updates Roll-ups Merging cubes Degrading cubes

15 Data Cube Raw byte strings e.g. MapReduce Database tables High- level API Merge + Aggregate Predictable performance Arbitrary Joins XX√X √XX√ √√√X

Dataflow Operators Local Cube Dataflow Operators Network bottleneck Dataflow Operators Local Cube Dataflow Operators Dataflow Operators Merged Cube Dataflow Operators A Vision for Wide-Area Analytics 16 Dataflow adapted to bandwidth

Adaptivity 17 Dataflow Operators Local Cube Dataflow Operators Network bottleneck

Feedback control Network bottleneck Adaptivity 18 Dataflow Operators Local Cube Dataflow Operators Summarized Cube Key ingredients: – Cube summarization as mechanism – User-defined policies – Feedback control

Backup Slides 19

Conclusions The hard problems in wide-area analysis: – Reasoning about bandwidth/data quality tradeoffs – Optimizing data quality under changing conditions. – Jointly optimizing bandwidth and other resources We are building a system. – We call it JetStream. Stay tuned…. 20

Structuring the Storage 21 Database relations, used in relational databases. Data Cube A multidimensional array with a merge function. Used in OLAP analysis. Ex: Total Requests and average latency by URL and time Raw byte strings, used in MapReduce and similar systems

Clouds master 22

Bandwidth Costs do not Decline Smoothly 23 [TeleGeography's Global Bandwidth Research Service]

20% 24 Frankfurt- London 2012 Bandwidth Price Shifts

Diurnal Load Makes Overprovisioning Expensive 25 Leased lines waste capacity during off-peak Public internet gets congested during peak

Structured edge storage 26 SystemData ModelProsCons MapReduce and similar Key-value bitstrings Key-value or FS-style storage Can’t optimize Relational Databases RelationalOptimizer can reason about queries Complex to support Wide-areaData cubesNatural merges, can reason about

What is the distribution of request counts by URL? Requests Straw-Man 1: Backhaul 27 Requests

What is the distribution of request counts by URL? Requests Straw-Man II: Stream processing 28 Requests Borealis / System S Borealis / System S

Can iteratively pose different queries Requests Benefit: Iteration 29 Requests Local Aggregation and Storage Local Aggregation and Storage A revised query

Can adapt data volume collected to available bw Requests Benefit: adaptation 30 Requests Local Aggregation and Storage Local Aggregation and Storage Limited Bandwidth

Can adapt data volume collected to available bw Requests Benefit: adaptation 31 Requests Local Aggregation and Storage Local Aggregation and Storage Ample Bandwidth

A dataflow model for wide-area analytics 32 Operator Cube Defines data transformation on tuples. Can do input or output. Structured storage of data

Processing Source Cube Network bottleneck Processed Data Processing Source Cube Generated data Ingested Into Local cubes 33

Processed Data Processing 34

Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1.

Similar presentations

Presentation on theme: "Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1.

Similar presentations

Presentation on theme: "Making Every Bit Count in Wide Area Analytics Ariel Rabkin Joint work with: Matvey Arye, Siddhartha Sen, Michael J. Freedman, and Vivek Pai 1."— Presentation transcript:

Similar presentations

About project

Feedback