Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.

Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data

“Big Data” in one slide The 3 V’s: Volume Velocity Variety – Unstructured, semi-structured, graphs, images, videos, … – Will assume well-structured data: Integers, real numbers Points in a multi-dimensional space Records in relational database Random Sampling on Big Data 2 focus of this talk

Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models MapReduce, Pregel, Dremel, Spark… BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach! Random Sampling on Big Data 3

Downsizing data A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures Good old RAM model no longer applies Random Sampling on Big Data 4

Outline for the talk Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice Random Sampling on Big Data 5

Simple Random Sampling Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times Sampling with replacement – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model Random Sampling on Big Data 6

Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic Random Sampling on Big Data 7

Reservoir Sampling Random Sampling on Big Data 8 [Waterman ??; Knuth’s book]

Random Sampling on Big Data 9

Correctness Proof Random Sampling on Big Data 10

Reservoir Sampling Correctness Proof Random Sampling on Big Data 12 a b c d b a c d a b c d b c a d b d a c s = 2

Sampling from Distributed Streams Random Sampling on Big Data 13

Reduction from Coin Flip Sampling Random Sampling on Big Data 14

The Algorithm Random Sampling on Big Data 15

Communication Cost of Algorithm Random Sampling on Big Data 16 [Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]

Random Sampling for Range Queries Random Sampling on Big Data 17 [ Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]

Online Range Sampling Random Sampling on Big Data 18 [Wang, Christensen, Li, Yi, VLDB’16]

Indexing Spatial Data Numerous spatial indexing structures in the literature Random Sampling on Big Data 19 R-tree

RS-tree Random Sampling on Big Data 20

RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: Active nodes 5 Random Sampling on Big Data 21

RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 Active nodes Random Sampling on Big Data 22

RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 Active nodes 7 Pick 7 or 14 with equal prob. Random Sampling on Big Data 23

RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes Pick 3, 8, or 14 with prob. 1:1:2 Random Sampling on Big Data 24

RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes Random Sampling on Big Data 25

RS-tree: A 1D Example 12345678910111213141516 14579121416 38 1214 7 5 Report: 5 7 Active nodes 12 Pick 3, 8, or 12 with equal prob Random Sampling on Big Data 26

Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

Frequency Estimation on Distributed Data Random Sampling on Big Data 28

Frequency Estimation: Standard Solutions Random Sampling on Big Data 29

Importance Sampling Random Sampling on Big Data 30

Random Sampling on Big Data 31 [Huang, Yi, Liu, Chen, INFOCOM’11]

Median and Quantiles (order statistics) Random Sampling on Big Data 33

Estimating Median by Random Sampling Random Sampling on Big Data 34 1567815678 234910 1357913579 +

Application 1: Streaming Computation Random Sampling on Big Data 35 [Wang, Luo, Yi, Cormode, SIGMOD’13]

Application 2: Distributed Data Random Sampling on Big Data 36

Generalization:  -approximations Random Sampling on Big Data 37

Random Sampling on Big Data 38 [Huang, Yi, FOCS’14]

Complex Analytical Queries (from TPC-H) SELECT SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘2015-11-11’ AND o_orderdate <= ‘2015-11-13’ AND l_returnflag = ‘R’ This query finds the total revenue lost due to returned orders, between 2015-11-11 and 2015-11-13 in China. Random Sampling on Big Data 39

Online Aggregation Returns an estimate with a confidence interval Confidence interval reduces over time Query processing terminates when target accuracy is met Random Sampling on Big Data 40 [Hellerstein, Haas, Wang, SIGMOD’97]

Ripple Join: Simple Random Sampling Suppose there are 2 tables: – Customers (CID, Nation) – Orders (OrderID, SellerID1, BuyerID2, Revenue) Say, the query asks for the total revenue of all orders made between a buyer in China and seller in the US Simple random sampling: – Take a 0.01% sample (1MB data) from Customers (10GB) – Take a 0.01% sample (1MB data) from Orders (10GB) – Only get 1MB * 0.01% * 0.01% = 0.01 byte of joined data! (even assuming we only sample buyers in China and sellers in US) 41 Random Sampling on Big Data [Haas, Hellerstein, SIGMOD’99]

Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 42 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$2100 31$100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

Other Issues Estimating confidence intervals Choosing the optimal walk plan Dealing with arbitrary joins Selection predicates Random Sampling on Big Data 46

Comparison with Existing Algorithm Ripple join (1999, 2008) 1x-5x faster than full join Linear dependency on data size Standalone system prototype (supports online aggregation only) Wander join (new) 10x-100x faster than full join Very small dependency on data size Seamless integration into RDBMS – PostgreSQL (finished) – SparkSQL (planned) Random Sampling on Big Data 47

SQL Integration SELECT ONLINE SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘2015-11-11’ AND o_orderdate <= ‘2015-11-13’ AND l_returnflag = ‘R’ WITHINTIME 20 CONFIDENCE.95 ERROR 0.01 User specifies any two of the three

Scalability Hardware: Intel-i7, 32GB RAM Software: PostgreSQL 9.4 Random Sampling on Big Data 49

Thank you!

Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.

Similar presentations

Presentation on theme: "Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.

Similar presentations

Presentation on theme: "Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data."— Presentation transcript:

Similar presentations

About project

Feedback