Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data
“Big Data” in one slide The 3 V’s: Volume Velocity Variety – Unstructured, semi-structured, graphs, images, videos, … – Will assume well-structured data: Integers, real numbers Points in a multi-dimensional space Records in relational database Random Sampling on Big Data 2 focus of this talk
Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models MapReduce, Pregel, Dremel, Spark… BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach! Random Sampling on Big Data 3
Downsizing data A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures Good old RAM model no longer applies Random Sampling on Big Data 4
Outline for the talk Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice Random Sampling on Big Data 5
Simple Random Sampling Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times Sampling with replacement – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model Random Sampling on Big Data 6
Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic Random Sampling on Big Data 7
Reservoir Sampling Random Sampling on Big Data 8 [Waterman ??; Knuth’s book]
Random Sampling on Big Data 9
Correctness Proof Random Sampling on Big Data 10
Random Sampling on Big Data 11
Reservoir Sampling Correctness Proof Random Sampling on Big Data 12 a b c d b a c d a b c d b c a d b d a c s = 2
Sampling from Distributed Streams Random Sampling on Big Data 13
Reduction from Coin Flip Sampling Random Sampling on Big Data 14
The Algorithm Random Sampling on Big Data 15
Communication Cost of Algorithm Random Sampling on Big Data 16 [Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]
Random Sampling for Range Queries Random Sampling on Big Data 17 [ Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]
Online Range Sampling Random Sampling on Big Data 18 [Wang, Christensen, Li, Yi, VLDB’16]
Indexing Spatial Data Numerous spatial indexing structures in the literature Random Sampling on Big Data 19 R-tree
RS-tree Random Sampling on Big Data 20
RS-tree: A 1D Example Report: Active nodes 5 Random Sampling on Big Data 21
RS-tree: A 1D Example Report: 5 Active nodes Random Sampling on Big Data 22
RS-tree: A 1D Example Report: 5 Active nodes 7 Pick 7 or 14 with equal prob. Random Sampling on Big Data 23
RS-tree: A 1D Example Report: 5 7 Active nodes Pick 3, 8, or 14 with prob. 1:1:2 Random Sampling on Big Data 24
RS-tree: A 1D Example Report: 5 7 Active nodes Random Sampling on Big Data 25
RS-tree: A 1D Example Report: 5 7 Active nodes 12 Pick 3, 8, or 12 with equal prob Random Sampling on Big Data 26
Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible
Frequency Estimation on Distributed Data Random Sampling on Big Data 28
Frequency Estimation: Standard Solutions Random Sampling on Big Data 29
Importance Sampling Random Sampling on Big Data 30
Random Sampling on Big Data 31 [Huang, Yi, Liu, Chen, INFOCOM’11]
Random Sampling on Big Data 32
Median and Quantiles (order statistics) Random Sampling on Big Data 33
Estimating Median by Random Sampling Random Sampling on Big Data
Application 1: Streaming Computation Random Sampling on Big Data 35 [Wang, Luo, Yi, Cormode, SIGMOD’13]
Application 2: Distributed Data Random Sampling on Big Data 36
Generalization: -approximations Random Sampling on Big Data 37
Random Sampling on Big Data 38 [Huang, Yi, FOCS’14]
Complex Analytical Queries (from TPC-H) SELECT SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘ ’ AND o_orderdate <= ‘ ’ AND l_returnflag = ‘R’ This query finds the total revenue lost due to returned orders, between and in China. Random Sampling on Big Data 39
Online Aggregation Returns an estimate with a confidence interval Confidence interval reduces over time Query processing terminates when target accuracy is met Random Sampling on Big Data 40 [Hellerstein, Haas, Wang, SIGMOD’97]
Ripple Join: Simple Random Sampling Suppose there are 2 tables: – Customers (CID, Nation) – Orders (OrderID, SellerID1, BuyerID2, Revenue) Say, the query asks for the total revenue of all orders made between a buyer in China and seller in the US Simple random sampling: – Take a 0.01% sample (1MB data) from Customers (10GB) – Take a 0.01% sample (1MB data) from Orders (10GB) – Only get 1MB * 0.01% * 0.01% = 0.01 byte of joined data! (even assuming we only sample buyers in China and sellers in US) 41 Random Sampling on Big Data [Haas, Hellerstein, SIGMOD’99]
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 42 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 43 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 44 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 45 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600
Other Issues Estimating confidence intervals Choosing the optimal walk plan Dealing with arbitrary joins Selection predicates Random Sampling on Big Data 46
Comparison with Existing Algorithm Ripple join (1999, 2008) 1x-5x faster than full join Linear dependency on data size Standalone system prototype (supports online aggregation only) Wander join (new) 10x-100x faster than full join Very small dependency on data size Seamless integration into RDBMS – PostgreSQL (finished) – SparkSQL (planned) Random Sampling on Big Data 47
SQL Integration SELECT ONLINE SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘ ’ AND o_orderdate <= ‘ ’ AND l_returnflag = ‘R’ WITHINTIME 20 CONFIDENCE.95 ERROR 0.01 User specifies any two of the three
Scalability Hardware: Intel-i7, 32GB RAM Software: PostgreSQL 9.4 Random Sampling on Big Data 49
Thank you!