Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.

Slides:



Advertisements
Similar presentations
Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.
Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Mr Greenhalgh S4 Computing Int 1 Things you could do with knowing before the Exam…
Fast Algorithms For Hierarchical Range Histogram Constructions
Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,
Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Reverse Hashing for High-speed Network Monitoring: Algorithms, Evaluation, and Applications Robert Schweller 1, Zhichun Li 1, Yan Chen 1, Yan Gao 1, Ashish.
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Data Mining.
Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.
An Incremental Refining Spatial Join Algorithm for Estimating Query Results in GIS Wan D. Bae, Shayma Alkobaisi, Scott T. Leutenegger Department of Computer.
Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Small Summaries for Big Data
A Paradigm Shift in Database Optimization: From Indices to Aggregates Presented to: The Data Warehousing & Data Mining mini-track – AMCIS 2002 as Research-in-Progress.
 1  Outline  stages and topics in simulation  generation of random variates.
Goodbye rows and tables, hello documents and collections.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
MySQL. Dept. of Computing Science, University of Aberdeen2 In this lecture you will learn The main subsystems in MySQL architecture The different storage.
Database Management 9. course. Execution of queries.
Int. Workshop on Advanced Computing and Analysis Techniques in Physics Research (ACAT2005), Zeuthen, Germany, May 2005 Bitmap Indices for Fast End-User.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
1 LD-Sketch: A Distributed Sketching Design for Accurate and Scalable Anomaly Detection in Network Data Streams Qun Huang and Patrick P. C. Lee The Chinese.
HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.
© 1999 FORWISS FORWISS MISTRAL Performance of TPC-D Benchmark and Datawarehouses Prof. R. Bayer, Ph.D. Dr. Volker Markl Dept. of Computer Science, Technical.
Data Stream Algorithms Ke Yi Hong Kong University of Science and Technology.
Histograms for Selectivity Estimation
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.
Yanlei Diao, University of Massachusetts Amherst Future Directions in Sensor Data Management Yanlei Diao University of Massachusetts, Amherst.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Copyright 2007, Information Builders. Slide 1 Machine Sizing and Scalability Mark Nesson, Vashti Ragoonath June 2008.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉 教授 : 許毅然 作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.
Calculating frequency moments of Data Stream
Generalized Hash Teams for Join and Group-By Alfons Kemper Donald Kossmann Christian Wiesner Universität Passau Germany.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
Rainer Gemulla, Wolfgang Lehner and Peter J. Haas VLDB 2006 A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets 2008/8/27 1.
Sampling Based Range Partition for Big Data Analytics + Some Extras Milan Vojnović Microsoft Research Cambridge, United Kingdom Joint work with Charalampos.
Big Data Yuan Xue CS 292 Special topics on.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.
Presented by: Omar Alqahtani Fall 2016
Cloud Computing and Architecuture
Data Transformation: Normalization
Wander Join: Online Aggregation via Random Walks
Ripple Joins for Online Aggregation
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Spatial Online Sampling and Aggregation
Random Sampling on Big Data: Techniques and Applications
CSCI1600: Embedded and Real Time Software
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Random Sampling over Joins Revisited
Range-Efficient Counting of Distinct Elements
Overview of big data tools
Range-Efficient Computation of F0 over Massive Data Streams
Lu Tang , Qun Huang, Patrick P. C. Lee
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data

“Big Data” in one slide The 3 V’s: Volume Velocity Variety – Unstructured, semi-structured, graphs, images, videos, … – Will assume well-structured data: Integers, real numbers Points in a multi-dimensional space Records in relational database Random Sampling on Big Data 2 focus of this talk

Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: – Distributed/parallel systems – Simpler programming models MapReduce, Pregel, Dremel, Spark… BSP – Failure tolerance and recovery – Drop certain features: ACID, CAP, noSQL This talk is not about this approach! Random Sampling on Big Data 3

Downsizing data A second approach to computational scalability: scale down the data! – A compact representation of a large data set – Too much redundancy in big data anyway – What we finally want is small: human readable analysis / decisions – Necessarily gives up some accuracy: approximate answers – Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach – Can scale out computation and scale down data at the same time – Algorithms need to work under new system architectures Good old RAM model no longer applies Random Sampling on Big Data 4

Outline for the talk Simple random sampling – Sampling from a data stream – Sampling from distributed streams – Sampling for range queries Not-so-simple sampling – Importance sampling: Frequency estimation on distributed data – Paired sampling: Medians and quantiles – Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice Random Sampling on Big Data 5

Simple Random Sampling Sampling without replacement – Randomly draw an element – Don’t put it back – Repeat s times Sampling with replacement – Randomly draw an element – Put it back – Repeat s times Trivial in the RAM model Random Sampling on Big Data 6

Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications – Data stored on disk – Network traffic Random Sampling on Big Data 7

Reservoir Sampling Random Sampling on Big Data 8 [Waterman ??; Knuth’s book]

Random Sampling on Big Data 9

Correctness Proof Random Sampling on Big Data 10

Random Sampling on Big Data 11

Reservoir Sampling Correctness Proof Random Sampling on Big Data 12 a b c d b a c d a b c d b c a d b d a c s = 2

Sampling from Distributed Streams Random Sampling on Big Data 13

Reduction from Coin Flip Sampling Random Sampling on Big Data 14

The Algorithm Random Sampling on Big Data 15

Communication Cost of Algorithm Random Sampling on Big Data 16 [Cormode, Muthukrishnan, Yi, Zhang, PODS’10, JACM’12] [Woodruff, Tirthapura, DISC’11]

Random Sampling for Range Queries Random Sampling on Big Data 17 [ Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD’15 Best Demo Award]

Online Range Sampling Random Sampling on Big Data 18 [Wang, Christensen, Li, Yi, VLDB’16]

Indexing Spatial Data Numerous spatial indexing structures in the literature Random Sampling on Big Data 19 R-tree

RS-tree Random Sampling on Big Data 20

RS-tree: A 1D Example Report: Active nodes 5 Random Sampling on Big Data 21

RS-tree: A 1D Example Report: 5 Active nodes Random Sampling on Big Data 22

RS-tree: A 1D Example Report: 5 Active nodes 7 Pick 7 or 14 with equal prob. Random Sampling on Big Data 23

RS-tree: A 1D Example Report: 5 7 Active nodes Pick 3, 8, or 14 with prob. 1:1:2 Random Sampling on Big Data 24

RS-tree: A 1D Example Report: 5 7 Active nodes Random Sampling on Big Data 25

RS-tree: A 1D Example Report: 5 7 Active nodes 12 Pick 3, 8, or 12 with equal prob Random Sampling on Big Data 26

Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

Frequency Estimation on Distributed Data Random Sampling on Big Data 28

Frequency Estimation: Standard Solutions Random Sampling on Big Data 29

Importance Sampling Random Sampling on Big Data 30

Random Sampling on Big Data 31 [Huang, Yi, Liu, Chen, INFOCOM’11]

Random Sampling on Big Data 32

Median and Quantiles (order statistics) Random Sampling on Big Data 33

Estimating Median by Random Sampling Random Sampling on Big Data

Application 1: Streaming Computation Random Sampling on Big Data 35 [Wang, Luo, Yi, Cormode, SIGMOD’13]

Application 2: Distributed Data Random Sampling on Big Data 36

Generalization:  -approximations Random Sampling on Big Data 37

Random Sampling on Big Data 38 [Huang, Yi, FOCS’14]

Complex Analytical Queries (from TPC-H) SELECT SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘ ’ AND o_orderdate <= ‘ ’ AND l_returnflag = ‘R’ This query finds the total revenue lost due to returned orders, between and in China. Random Sampling on Big Data 39

Online Aggregation Returns an estimate with a confidence interval Confidence interval reduces over time Query processing terminates when target accuracy is met Random Sampling on Big Data 40 [Hellerstein, Haas, Wang, SIGMOD’97]

Ripple Join: Simple Random Sampling Suppose there are 2 tables: – Customers (CID, Nation) – Orders (OrderID, SellerID1, BuyerID2, Revenue) Say, the query asks for the total revenue of all orders made between a buyer in China and seller in the US Simple random sampling: – Take a 0.01% sample (1MB data) from Customers (10GB) – Take a 0.01% sample (1MB data) from Orders (10GB) – Only get 1MB * 0.01% * 0.01% = 0.01 byte of joined data! (even assuming we only sample buyers in China and sellers in US) 41 Random Sampling on Big Data [Haas, Hellerstein, SIGMOD’99]

Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 42 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 43 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 44 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

Sampling by Random Walks NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 Random Sampling on Big Data 45 NationCID US1 2 China3 UK4 China5 US6 China7 UK8 Japan9 UK10 BuyerIDSellerIDRevenue 43$ $100 13$300 53$500 56$230 56$800 31$300 53$200 33$100 74$600

Other Issues Estimating confidence intervals Choosing the optimal walk plan Dealing with arbitrary joins Selection predicates Random Sampling on Big Data 46

Comparison with Existing Algorithm Ripple join (1999, 2008) 1x-5x faster than full join Linear dependency on data size Standalone system prototype (supports online aggregation only) Wander join (new) 10x-100x faster than full join Very small dependency on data size Seamless integration into RDBMS – PostgreSQL (finished) – SparkSQL (planned) Random Sampling on Big Data 47

SQL Integration SELECT ONLINE SUM(l_price * (1-l_discount)) FROM customer, orders, lineitem, nation WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND c_nationkey = n_nationkey AND n_name = ‘CHINA’ AND o_orderdate >= ‘ ’ AND o_orderdate <= ‘ ’ AND l_returnflag = ‘R’ WITHINTIME 20 CONFIDENCE.95 ERROR 0.01 User specifies any two of the three

Scalability Hardware: Intel-i7, 32GB RAM Software: PostgreSQL 9.4 Random Sampling on Big Data 49

Thank you!