Wander Join: Online Aggregation via Random Walks

Slides:

Advertisements

Similar presentations

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Summarizing Distributed Data Ke Yi HKUST += ?. Small summaries for BIG data  Allow approximate computation with guarantees and small space – save space,

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

A Fast Growing Market. Interesting New Players Lyzasoft.

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,

Efficient replica maintenance for distributed storage systems Byung-Gon Chun, Frank Dabek, Andreas Haeberlen, Emil Sit, Hakim Weatherspoon, M. Frans Kaashoek,

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

1 Data Warehouses BUAD/American University Data Warehouses.

Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Presented By Anirban Maiti Chandrashekar Vijayarenu

Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.

What is Big Query?.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Sameer Agarwal, Aurojit Panda, Barzan Moxafari Samuel Madden, Ion Stoica.

Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.

Random Sampling in Database Systems: Techniques and Applications Ke Yi Hong Kong University of Science and Technology Big Data.

Scalable Approximate Query Processing Florin Rusu.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Dense-Region Based Compact Data Cube

Some TPC-H queries on Teradata and PostgreSQL

DB storage architectures: Rows, Columns, LSM trees

CSCI5570 Large Scale Data Processing Systems

Data warehouse.

Data Warehousing CIS 4301 Lecture Notes 4/20/2006.

Parallel Databases.

So, what was this course about?

Data Warehouse.

A paper on Join Synopses for Approximate Query Answering

Ripple Joins for Online Aggregation

Introduction to NewSQL

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Data Warehouse.

Spatial Online Sampling and Aggregation

DB storage architectures: Rows, Columns, LSM trees

April 30th – Scheduling / parallel

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

StreamApprox Approximate Stream Analytics in Apache Flink

Random Sampling on Big Data: Techniques and Applications

AQUA: Approximate Query Answering

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Random Sampling over Joins Revisited

Managing batch processing Transient Azure SQL Warehouse Resource

Parallel Analytic Systems

2018, Spring Pusan National University Ki-Joune Li

OLAP Query Performance in Column-Oriented Databases

Performance And Scalability In Oracle9i And SQL Server 2000

PRSim: Sublinear Time SimRank Computation on Large Power-Law Graphs.

Presentation transcript:

Wander Join: Online Aggregation via Random Walks Feifei Li Bin Wu, Ke Yi Zhuoyue Zhao University of Utah Hong Kong University Shanghai Jiao Tong of Science and Technology University

Wander Join: Online Aggregation via Random Walks Database Workloads Transactional (OLTP) Deduct 𝑥 dollars from account A, credit 𝑥 dollars to account B Challenge: Efficiency and correctness (ACID) Analytical (OLAP) Large fraction of data Many tables Complex conditions Challenge: Efficiency Correctness? Wander Join: Online Aggregation via Random Walks

Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. Wander Join: Online Aggregation via Random Walks

Online Aggregation [Haas, Hellerstein, Wang, SIGMOD’97] SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME 60000 CONFIDENCE 95 REPORTINTERVAL 1000 Confidence interval: Pr 𝑌 −𝜀<𝑌< 𝑌 +𝜀 >0.95 𝑌 +𝜀 𝑌 𝑌 −𝜀 Wander Join: Online Aggregation via Random Walks

Ripple Join [Haas, Hellerstein, SIGMOD’99] Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion Join with sampled tuples from other tables Works well for full Cartesian product But most joins are sparse … Wander Join: Online Aggregation via Random Walks

Wander Join: Online Aggregation via Random Walks A Running Example Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 1 3 2 5 6 7 8 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 What’s the total revenue of all orders from customers in China? 𝑁: size of each table, e.g., 10 9 𝑛: # tuples taken from each table 𝑠: # estimators, e.g., 10 3 𝑛 3 ⋅ 1 𝑁 2 =𝑠 𝑛= 𝑁 2/3 𝑠 1/3 = 10 7 Wander Join: Online Aggregation via Random Walks

Wander Join: Online Aggregation via Random Walks Join as a Graph 𝑅 1 𝑅 2 𝑅 3 Conceptual only Never materialized Wander Join: Online Aggregation via Random Walks

Wander Join: Online Aggregation via Random Walks Join as a Graph 𝑅 1 𝑅 2 𝑅 3 Conceptual only Never materialized Wander Join: Online Aggregation via Random Walks

Wander Join: Online Aggregation via Random Walks Join as a Graph Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 1 3 2 5 6 7 8 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = ‘China’ C.CID = O.BuyerID O.OrderID = I.OrderID Wander Join: Online Aggregation via Random Walks

Sampling by Random Walks Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 1 3 2 5 6 7 8 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = ‘China’ C.CID = O.BuyerID O.OrderID = I.OrderID Wander Join: Online Aggregation via Random Walks

Sampling by Random Walks Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 1 3 2 5 6 7 8 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = ‘China’ C.CID = O.BuyerID O.OrderID = I.OrderID Wander Join: Online Aggregation via Random Walks

Sampling by Random Walks Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 1 3 2 5 6 7 8 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = ‘China’ C.CID = O.BuyerID O.OrderID = I.OrderID Wander Join: Online Aggregation via Random Walks

Sampling by Random Walks Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 1 3 2 5 6 7 8 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 SELECT SUM(Price) FROM Customers C, Orders O, Items I WHERE C.Nation = ‘China’ C.CID = O.BuyerID O.OrderID = I.OrderID 𝑁: size of each table size, e.g., 10 9 𝑛: # tuples taken from each table = # random walks 𝑠: # estimators, e.g., 10 3 𝑛=𝑠= 10 3 Unbiased estimator: $𝟓𝟎𝟎 𝐬𝐚𝐦𝐩𝐥𝐢𝐧𝐠 𝐩𝐫𝐨𝐛. = $𝟓𝟎𝟎 𝟏/𝟑⋅𝟏/𝟒⋅𝟏/𝟑 Wander Join: Online Aggregation via Random Walks

Walk Plan Optimization 𝑅 1 𝑅 2 𝑅 3 Structure of the data graph Selection predicates Starting table: use index Table in the middle: reject random walk Data distribution Non-uniformity may not be a bad thing! 1 5 6 3 𝑅 1 𝑅 2 𝑉𝑎𝑟 𝑅 1 → 𝑅 2 <𝑉𝑎𝑟 𝑅 2 → 𝑅 1 𝑅 1 𝑅 2 5 6 3 1 𝑉𝑎𝑟 𝑅 1 → 𝑅 2 >𝑉𝑎𝑟 𝑅 2 → 𝑅 1 Wander Join: Online Aggregation via Random Walks

Wander Join: Online Aggregation via Random Walks Walk Plan Optimizer Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful Wander Join: Online Aggregation via Random Walks

Convergence Comparison Wander Join: Online Aggregation via Random Walks

Wander Join in PostgreSQL Logarithmic growth due to B-tree lookup to find random neighbours Wander Join: Online Aggregation via Random Walks

Running on Insufficient Memory (4GB) Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloud… The algorithm is embarrassingly parallel Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB’09] Wander Join: Online Aggregation via Random Walks

Accuracy Achieved in 1/10 Time of Full Join Wander Join: Online Aggregation via Random Walks

Wander Join vs Ripple Join Sampling methodology Independent but non-uniform Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Easy, 𝑂(𝑛) time Complicated, 𝑂( 𝑛 𝑘 ) time 𝑘: # tables Convergence time (20GB data, 3 tables) ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO Wander Join: Online Aggregation via Random Walks

Online Aggregation vs Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report Wander Join: Online Aggregation via Random Walks

Thank you!

Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD’90] Nation CID US 1 2 China 3 UK 4 5 6 7 8 Japan 9 10 BuyerID OrderID 4 8 3 5 1 2 7 9 10 OrderID ItemID Price 4 301 $2100 2 304 $100 3 201 $300 306 $500 401 $230 1 101 $800 5 $200 $600 Wander Join: Online Aggregation via Random Walks

Sampling from a B-tree [Olken, ’93] 4 3 2 Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions need to modify existing B-tree implementations Wander Join: Online Aggregation via Random Walks

Rejection Sampling [Olken, ’93] Imagine each node has maximum fanout Reject as soon as it walks out of bound Wander Join: Online Aggregation via Random Walks

Wander Join: Online Aggregation via Random Walks Non-Uniform Sampling 1 3⋅4 1 3⋅4 1 3⋅4 1 3⋅4 1 3⋅2 1 3⋅2 1 3⋅3 1 3⋅3 1 3⋅3 As long as we can compute the sampling probability, wander join still works! Wander Join: Online Aggregation via Random Walks

Compare with BlinkDB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, ’13] Wander Join BlinkDB Methodology Query  Sampling Sampling  Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced Wander Join: Online Aggregation via Random Walks