Download presentation
Presentation is loading. Please wait.
Published byClifton Bradley Modified over 8 years ago
1
Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data
2
“Big Data” in one slide The 3 V’s: Volume Velocity Variety – Unstructured, semi-structured, graphs, images, videos, … – Will assume well-structured data: Integers, real numbers Points in a multi-dimensional space Records in relational database Random Sampling on Big Data 2 focus of this talk
3
Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models MapReduce, Pregel, Dremel, Spark… BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach! Random Sampling on Big Data 3
4
Downsizing data A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures Good old RAM model no longer applies Random Sampling on Big Data 4
5
Outline for the talk Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice Random Sampling on Big Data 5
6
Simple Random Sampling Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times Sampling with replacement – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model Random Sampling on Big Data 6
7
Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic Random Sampling on Big Data 7
8
Reservoir Sampling Random Sampling on Big Data 8 [Waterman ??; Knuth’s book]
9
Random Sampling on Big Data 9
10
Correctness Proof Random Sampling on Big Data 10
11
Random Sampling on Big Data 11
12
Reservoir Sampling Correctness Proof Random Sampling on Big Data 12 a b c d b a c d a b c d b c a d b d a c s = 2
13
Sampling from Distributed Streams Random Sampling on Big Data 13
14
Reduction from Coin Flip Sampling Random Sampling on Big Data 14
15
The Algorithm Random Sampling on Big Data 15
16
Communication Cost of Algorithm Random Sampling on Big Data 16 [Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]
17
Random Sampling for Range Queries Random Sampling on Big Data 17 [ Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]
18
Online Range Sampling Random Sampling on Big Data 18 [Wang, Christensen, Li, Yi, VLDB’16]
19
Indexing Spatial Data Numerous spatial indexing structures in the literature Random Sampling on Big Data 19 R-tree
20
RS-tree Random Sampling on Big Data 20
21
RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: Active nodes 5 Random Sampling on Big Data 21
22
RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 Active nodes Random Sampling on Big Data 22
23
RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 Active nodes 7 Pick 7 or 14 with equal prob. Random Sampling on Big Data 23
24
RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes Pick 3, 8, or 14 with prob. 1:1:2 Random Sampling on Big Data 24
25
RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes Random Sampling on Big Data 25
26
RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes 12 Pick 3, 8, or 12 with equal prob Random Sampling on Big Data 26
27
Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible
28
Frequency Estimation on Distributed Data Random Sampling on Big Data 28
29
Frequency Estimation: Standard Solutions Random Sampling on Big Data 29
30
Importance Sampling Random Sampling on Big Data 30
31
Random Sampling on Big Data 31 [Huang, Yi, Liu, Chen, INFOCOM’11]
32
Random Sampling on Big Data 32
33
Median and Quantiles (order statistics) Random Sampling on Big Data 33
34
Estimating Median by Random Sampling Random Sampling on Big Data 34 1567815678 234910 1357913579 +
35
Application 1: Streaming Computation Random Sampling on Big Data 35 [Wang, Luo, Yi, Cormode, SIGMOD’13]
36
Application 2: Distributed Data Random Sampling on Big Data 36
37
Generalization: -approximations Random Sampling on Big Data 37
38
Random Sampling on Big Data 38 [Huang, Yi, FOCS’14]
39
Complex Analytical Queries (from TPC-H) SELECT SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘2015-11-11’ AND o_orderdate <= ‘2015-11-13’ AND l_returnflag = ‘R’ This query finds the total revenue lost due to returned orders, between 2015-11-11 and 2015-11-13 in China. Random Sampling on Big Data 39
40
Online Aggregation Returns an estimate with a confidence interval Confidence interval reduces over time Query processing terminates when target accuracy is met Random Sampling on Big Data 40 [Hellerstein, Haas, Wang, SIGMOD’97]
41
Ripple Join: Simple Random Sampling Suppose there are 2 tables: – Customers (CID, Nation) – Orders (OrderID, SellerID1, BuyerID2, Revenue) Say, the query asks for the total revenue of all orders made between a buyer in China and seller in the US Simple random sampling: – Take a 0.01% sample (1MB data) from Customers (10GB) – Take a 0.01% sample (1MB data) from Orders (10GB) – Only get 1MB * 0.01% * 0.01% = 0.01 byte of joined data! (even assuming we only sample buyers in China and sellers in US) 41 Random Sampling on Big Data [Haas, Hellerstein, SIGMOD’99]
42
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 42 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
43
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 43 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
44
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 44 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
45
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 45 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
46
Other Issues Estimating confidence intervals Choosing the optimal walk plan Dealing with arbitrary joins Selection predicates Random Sampling on Big Data 46
47
Comparison with Existing Algorithm Ripple join (1999, 2008) 1x-5x faster than full join Linear dependency on data size Standalone system prototype (supports online aggregation only) Wander join (new) 10x-100x faster than full join Very small dependency on data size Seamless integration into RDBMS – PostgreSQL (finished) – SparkSQL (planned) Random Sampling on Big Data 47
48
SQL Integration SELECT ONLINE SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘2015-11-11’ AND o_orderdate <= ‘2015-11-13’ AND l_returnflag = ‘R’ WITHINTIME 20 CONFIDENCE.95 ERROR 0.01 User specifies any two of the three
49
Scalability Hardware: Intel-i7, 32GB RAM Software: PostgreSQL 9.4 Random Sampling on Big Data 49
50
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.