Presentation is loading. Please wait.

Presentation is loading. Please wait.

Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.

Similar presentations


Presentation on theme: "Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data."— Presentation transcript:

1 Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data

2 “Big Data” in one slide The 3 V’s: Volume Velocity Variety – Unstructured, semi-structured, graphs, images, videos, … – Will assume well-structured data: Integers, real numbers Points in a multi-dimensional space Records in relational database Random Sampling on Big Data 2 focus of this talk

3 Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models MapReduce, Pregel, Dremel, Spark… BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach! Random Sampling on Big Data 3

4 Downsizing data A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures Good old RAM model no longer applies Random Sampling on Big Data 4

5 Outline for the talk Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice Random Sampling on Big Data 5

6 Simple Random Sampling Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times Sampling with replacement – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model Random Sampling on Big Data 6

7 Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic Random Sampling on Big Data 7

8 Reservoir Sampling Random Sampling on Big Data 8 [Waterman ??; Knuth’s book]

9 Random Sampling on Big Data 9

10 Correctness Proof Random Sampling on Big Data 10

11 Random Sampling on Big Data 11

12 Reservoir Sampling Correctness Proof Random Sampling on Big Data 12 a b c d b a c d a b c d b c a d b d a c s = 2

13 Sampling from Distributed Streams Random Sampling on Big Data 13

14 Reduction from Coin Flip Sampling Random Sampling on Big Data 14

15 The Algorithm Random Sampling on Big Data 15

16 Communication Cost of Algorithm Random Sampling on Big Data 16 [Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]

17 Random Sampling for Range Queries Random Sampling on Big Data 17 [ Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]

18 Online Range Sampling Random Sampling on Big Data 18 [Wang, Christensen, Li, Yi, VLDB’16]

19 Indexing Spatial Data Numerous spatial indexing structures in the literature Random Sampling on Big Data 19 R-tree

20 RS-tree Random Sampling on Big Data 20

21 RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: Active nodes 5 Random Sampling on Big Data 21

22 RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 Active nodes Random Sampling on Big Data 22

23 RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 Active nodes 7 Pick 7 or 14 with equal prob. Random Sampling on Big Data 23

24 RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes Pick 3, 8, or 14 with prob. 1:1:2 Random Sampling on Big Data 24

25 RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes Random Sampling on Big Data 25

26 RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes 12 Pick 3, 8, or 12 with equal prob Random Sampling on Big Data 26

27 Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

28 Frequency Estimation on Distributed Data Random Sampling on Big Data 28

29 Frequency Estimation: Standard Solutions Random Sampling on Big Data 29

30 Importance Sampling Random Sampling on Big Data 30

31 Random Sampling on Big Data 31 [Huang, Yi, Liu, Chen, INFOCOM’11]

32 Random Sampling on Big Data 32

33 Median and Quantiles (order statistics) Random Sampling on Big Data 33

34 Estimating Median by Random Sampling Random Sampling on Big Data 34 1567815678 234910 1357913579 +

35 Application 1: Streaming Computation Random Sampling on Big Data 35 [Wang, Luo, Yi, Cormode, SIGMOD’13]

36 Application 2: Distributed Data Random Sampling on Big Data 36

37 Generalization:  -approximations Random Sampling on Big Data 37

38 Random Sampling on Big Data 38 [Huang, Yi, FOCS’14]

39 Complex Analytical Queries (from TPC-H) SELECT SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘2015-11-11’ AND o_orderdate <= ‘2015-11-13’ AND l_returnflag = ‘R’ This query finds the total revenue lost due to returned orders, between 2015-11-11 and 2015-11-13 in China. Random Sampling on Big Data 39

40 Online Aggregation Returns an estimate with a confidence interval Confidence interval reduces over time Query processing terminates when target accuracy is met Random Sampling on Big Data 40 [Hellerstein, Haas, Wang, SIGMOD’97]

41 Ripple Join: Simple Random Sampling Suppose there are 2 tables: – Customers (CID, Nation) – Orders (OrderID, SellerID1, BuyerID2, Revenue) Say, the query asks for the total revenue of all orders made between a buyer in China and seller in the US Simple random sampling: – Take a 0.01% sample (1MB data) from Customers (10GB) – Take a 0.01% sample (1MB data) from Orders (10GB) – Only get 1MB * 0.01% * 0.01% = 0.01 byte of joined data! (even assuming we only sample buyers in China and sellers in US) 41 Random Sampling on Big Data [Haas, Hellerstein, SIGMOD’99]

42 Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 42 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

43 Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 43 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

44 Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 44 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

45 Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 45 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

46 Other Issues Estimating confidence intervals Choosing the optimal walk plan Dealing with arbitrary joins Selection predicates Random Sampling on Big Data 46

47 Comparison with Existing Algorithm Ripple join (1999, 2008) 1x-5x faster than full join Linear dependency on data size Standalone system prototype (supports online aggregation only) Wander join (new) 10x-100x faster than full join Very small dependency on data size Seamless integration into RDBMS – PostgreSQL (finished) – SparkSQL (planned) Random Sampling on Big Data 47

48 SQL Integration SELECT ONLINE SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘2015-11-11’ AND o_orderdate <= ‘2015-11-13’ AND l_returnflag = ‘R’ WITHINTIME 20 CONFIDENCE.95 ERROR 0.01 User specifies any two of the three

49 Scalability Hardware: Intel-i7, 32GB RAM Software: PostgreSQL 9.4 Random Sampling on Big Data 49

50 Thank you!


Download ppt "Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data."

Similar presentations


Ads by Google