Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.

Similar presentations


Presentation on theme: "Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos."— Presentation transcript:

1 Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos Tsourakakis, Bozidar Radunovic, Zhenming Liu, Fei Xu, Jingren Zhou INQUEST Workshop, September 2012

2 Big Data Analytics Our goal: innovation in the area of algorithms for large scale computations to move the frontier of the computer science of big data Some figures of scale – Peta / Tera bytes of online services data processed daily – 200M tweets per day (Twitter) – 1B of content pieces shared per day (Facebook) – 8,000 Exabytes of global data by 2015 (The Economist) 2

3 Research Agenda 3 Machine learning Optimization Database queries Distributed computing system

4 Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 4

5 Range Partition Special interest: balanced range partition 5 1 23 24 1024 8 83 120 1 23 24 1024 8 120 52 1 23 24 1024 120 83 52 1-100 101-250 950-1024... (120,10) (120,5) (120,4) 1 2 k 12m

6 Range Partition Requirements 6

7 Two Approaches Sampling based methods – Take a sample of data items – Compute partition boundaries using the sample Quantile summary methods – At each node compute a local quantile summary – Merge at the coordinator node 7

8 Related Work 8

9 Related Work (cont’d) 9

10 Problem Range partition data while making one pass through data with minimal communication between the coordinator and sites 10

11 Sampling Based Method 11 1 2 k coordinator...... Pros – simplicity, scalability Cons – how many samples to take from each site? data size imbalance: number of data input records per machine may differ from one machine to another

12 Data Sizes Imbalance DatasetRecordsBytesSites DataSet-162M150G262 DataSet-237M25G80 DataSet-313M0.26G1 DataSet-47M1.2T301 DataSet-5106M7T5652 12

13 Origins of Data Sizes Imbalance JOIN SELECT FROM A INNER JOIN B ON A.KEY==B.KEY ORDER BY COL Lookup Table If the record value of column X is in the lookup table, then return the row UNPIVOT Input: Col 1Col 2 12, 3 23, 9, 8, 13 … Output:(1,2), (1,3), (2,3), (2,9), … 13

14 Weighted Sampling Scheme 14

15 SAMPLE 15 1 2 k coordinator......

16 MERGE 16 coordinator............

17 PARTITION 17 coordinator 0 1 Range12345

18 Sufficient Sample Size 18

19 Constant Factor Imbalance 19

20 Proof Outline 20

21 Performance DataSet-1 21

22 Performance (cont’d) 22

23 Summary for Range Partitioning Novel weighted sampling scheme Provable performance guarantees Simple and practical – Coder transfer to Cosmos More info: Sampling Based Range Partition Methods for Big Data Analytics, V., Xu, Zhou, MSR-TR-2012-18, Mar 2012 23

24 Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 24

25 SUM Tracking Problem 25 1 2 3 k SUM:

26 SUM Tracking 26

27 Applications 27 input data

28 State of the Art 28

29 The Challenge Q: What are communication cost efficient algorithms for the sum tracking problem with random input streams? – Random permutation – Random i.i.d. – Fractional Brownian motion 29

30 Communication Complexity Bounds 30

31 Communication Complexity Bounds Unknown Drift Case 31

32 Our Tracker Algorithm XiXi M i = 1 SkSk S1S1 S S site coordinator S, S k S, S 1 S = S 1 + … + S k 32

33 Two Applications Second Frequency Moment Bayesian Linear Regression 33

34 App 1: Second Frequency Moment 34

35 AMS Sketch 35 {0,1} valued hash

36 App 1: Second Frequency Moment (cont’d) 36

37 App 2: Bayesian Linear Regression 37

38 App 2: Bayesian Linear Regression (cont’d) 38

39 Summary for Sum Tracking Studied the sum tracking problem with non- monotonic distributed streams under random permutation, random i. i. d. and fractional Brownian motion Proposed a novel algorithm with nearly optimal communication complexity Details: ACM PODS 2012 39

40 Outline Range Partition with Fei Xu and Jingren Zhou Count Tracking with Zhenming Liu and Bozidar Radunovic Graph Partitioning (def. only) with Charalampos Tsourakakis and Bozidar Radunovic 40

41 Partition a graph with two objectives – Sparsely connected components – Balanced number of vertices per component Applications – Parallel processing – Community detection 41 Problem

42 Problem (cont’d) 42 1 2 3 k Requirements – Streaming algorithm – Single pass / incremental – Efficient computing Desired – Approximation guarantees – Average-case efficient

43 Summary for Graph Partitioning Designed a streaming algorithm whose average-case performance appears superior to any of previously proposed online heuristics Provable approximation guarantees More details available soon 43


Download ppt "Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos."

Similar presentations


Ads by Google