Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.

Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012

Big Data Analytics Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data Some figures of scale – Peta / Tera bytes of online services data processed daily – 200M tweets per day (Twitter) – 1B of content pieces shared per day (Facebook) – 8,000 Exabytes of global data by 2015 (The Economist) 2

Research Agenda 3 Machine learning Optimization Database queries Distributed computing system

Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 4

Range Partition Special interest: balanced range partition 5 1 23 24 1024 8 83 120 1 23 24 1024 8 120 52 1 23 24 1024 120 83 52 1-100 101-250 950-1024... (120,10) (120,5) (120,4) 1 2 k 12m

Range Partition Requirements 6

Two Approaches Sampling based methods – Take a sample of data items – Compute partition boundaries using the sample Quantile summary methods – At each node compute a local quantile summary – Merge at the coordinator node 7

Related Work 8

Related Work (cont’d) 9

Problem Range partition data while making one pass through data with minimal communication between the coordinator and sites 10

Sampling Based Method 11 1 2 k coordinator...... Pros – simplicity, scalability Cons – how many samples to take from each site? data size imbalance: number of data input records per machine may differ from one machine to another

Data Sizes Imbalance DatasetRecordsBytesSites DataSet-162M150G262 DataSet-237M25G80 DataSet-313M0.26G1 DataSet-47M1.2T301 DataSet-5106M7T5652 12

Origins of Data Sizes Imbalance JOIN SELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL Lookup Table If the record value of column X is in the lookup table, then return the row UNPIVOT Input: Col 1Col 2 12, 3 23, 9, 8, 13 … Output:(1,2), (1,3), (2,3), (2,9), … 13

Weighted Sampling Scheme 14

SAMPLE 15 1 2 k coordinator......

MERGE 16 coordinator............

PARTITION 17 coordinator 0 1 Range12345

Sufficient Sample Size 18

Constant Factor Imbalance 19

Proof Outline 20

Performance DataSet-1 21

Performance (cont’d) 22

Summary for Range Partitioning Novel weighted sampling scheme Provable performance guarantees Simple and practical – Coder transfer to Cosmos More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012 23

SUM Tracking Problem 25 1 2 3 k SUM:

SUM Tracking 26

Applications 27 input data

State of the Art 28

The Challenge Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? – Random permutation – Random i.i.d. – Fractional Brownian motion 29

Communication Complexity Bounds 30

Communication Complexity Bounds Unknown Drift Case 31

Our Tracker Algorithm XiXi M i = 1 SkSk S1S1 S S site coordinator S, S k S, S 1 S = S 1 + … + S k 32

Two Applications Second Frequency Moment Bayesian Linear Regression 33

App 1: Second Frequency Moment 34

AMS Sketch 35 {0,1} valued hash

App 1: Second Frequency Moment (cont’d) 36

App 2: Bayesian Linear Regression 37

App 2: Bayesian Linear Regression (cont’d) 38

Summary for Sum Tracking Studied the sum tracking problem with non- monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion Proposed a novel algorithm with nearly optimal communication complexity Details: ACM PODS 2012 39

Partition a graph with two objectives – Sparsely connected components – Balanced number of vertices per component Applications – Parallel processing – Community detection 41 Problem

Problem (cont’d) 42 1 2 3 k Requirements – Streaming algorithm – Single pass / incremental – Efficient computing Desired – Approximation guarantees – Average-case efficient

Summary for Graph Partitioning Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics Provable approximation guarantees More details available soon 43

Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.

Similar presentations

Presentation on theme: "Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.

Similar presentations

Presentation on theme: "Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos."— Presentation transcript:

Similar presentations

About project

Feedback