Download presentation
Presentation is loading. Please wait.
Published byGiles Lane Modified over 8 years ago
1
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012
2
Big Data Analytics Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data Some figures of scale – Peta / Tera bytes of online services data processed daily – 200M tweets per day (Twitter) – 1B of content pieces shared per day (Facebook) – 8,000 Exabytes of global data by 2015 (The Economist) 2
3
Research Agenda 3 Machine learning Optimization Database queries Distributed computing system
4
Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 4
5
Range Partition Special interest: balanced range partition 5 1 23 24 1024 8 83 120 1 23 24 1024 8 120 52 1 23 24 1024 120 83 52 1-100 101-250 950-1024... (120,10) (120,5) (120,4) 1 2 k 12m
6
Range Partition Requirements 6
7
Two Approaches Sampling based methods – Take a sample of data items – Compute partition boundaries using the sample Quantile summary methods – At each node compute a local quantile summary – Merge at the coordinator node 7
8
Related Work 8
9
Related Work (cont’d) 9
10
Problem Range partition data while making one pass through data with minimal communication between the coordinator and sites 10
11
Sampling Based Method 11 1 2 k coordinator...... Pros – simplicity, scalability Cons – how many samples to take from each site? data size imbalance: number of data input records per machine may differ from one machine to another
12
Data Sizes Imbalance DatasetRecordsBytesSites DataSet-162M150G262 DataSet-237M25G80 DataSet-313M0.26G1 DataSet-47M1.2T301 DataSet-5106M7T5652 12
13
Origins of Data Sizes Imbalance JOIN SELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL Lookup Table If the record value of column X is in the lookup table, then return the row UNPIVOT Input: Col 1Col 2 12, 3 23, 9, 8, 13 … Output:(1,2), (1,3), (2,3), (2,9), … 13
14
Weighted Sampling Scheme 14
15
SAMPLE 15 1 2 k coordinator......
16
MERGE 16 coordinator............
17
PARTITION 17 coordinator 0 1 Range12345
18
Sufficient Sample Size 18
19
Constant Factor Imbalance 19
20
Proof Outline 20
21
Performance DataSet-1 21
22
Performance (cont’d) 22
23
Summary for Range Partitioning Novel weighted sampling scheme Provable performance guarantees Simple and practical – Coder transfer to Cosmos More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012 23
24
Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 24
25
SUM Tracking Problem 25 1 2 3 k SUM:
26
SUM Tracking 26
27
Applications 27 input data
28
State of the Art 28
29
The Challenge Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? – Random permutation – Random i.i.d. – Fractional Brownian motion 29
30
Communication Complexity Bounds 30
31
Communication Complexity Bounds Unknown Drift Case 31
32
Our Tracker Algorithm XiXi M i = 1 SkSk S1S1 S S site coordinator S, S k S, S 1 S = S 1 + … + S k 32
33
Two Applications Second Frequency Moment Bayesian Linear Regression 33
34
App 1: Second Frequency Moment 34
35
AMS Sketch 35 {0,1} valued hash
36
App 1: Second Frequency Moment (cont’d) 36
37
App 2: Bayesian Linear Regression 37
38
App 2: Bayesian Linear Regression (cont’d) 38
39
Summary for Sum Tracking Studied the sum tracking problem with non- monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion Proposed a novel algorithm with nearly optimal communication complexity Details: ACM PODS 2012 39
40
Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 40
41
Partition a graph with two objectives – Sparsely connected components – Balanced number of vertices per component Applications – Parallel processing – Community detection 41 Problem
42
Problem (cont’d) 42 1 2 3 k Requirements – Streaming algorithm – Single pass / incremental – Efficient computing Desired – Approximation guarantees – Average-case efficient
43
Summary for Graph Partitioning Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics Provable approximation guarantees More details available soon 43
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.