Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

Lindsey Bleimes Charlie Garrod Adam Meyerson

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.

CrowdER - Crowdsourcing Entity Resolution

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

CS4432: Database Systems II

A Paper on RANDOM SAMPLING OVER JOINS by SURAJIT CHAUDHARI RAJEEV MOTWANI VIVEK NARASAYYA PRESENTED BY, JEEVAN KUMAR GOGINENI SARANYA GOTTIPATI.

An Improved Data Stream Summary: The Count-Min Sketch and its Applications Graham Cormode, S. Muthukrishnan 2003.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Fast Algorithms For Hierarchical Range Histogram Constructions

2/14/13CMPS 3120 Computational Geometry1 CMPS 3120: Computational Geometry Spring 2013 Planar Subdivisions and Point Location Carola Wenk Based on: Computational.

Approximation, Chance and Networks Lecture Notes BISS 2005, Bertinoro March Alessandro Panconesi University La Sapienza of Rome.

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Linear Inequalities and Linear Programming Chapter 5 Dr.Hayk Melikyan/ Department of Mathematics and CS/ Linear Programming in two dimensions:

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

Circumventing Data Quality Problems Using Multiple Join Paths Yannis Kotidis, Athens University of Economics and Business Amélie Marian, Rutgers University.

Evaluating Hypotheses

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Surface Simplification Using Quadric Error Metrics Michael Garland Paul S. Heckbert.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Efficient Gathering of Correlated Data in Sensor Networks

Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Database Management 9. course. Execution of queries.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung

Lesley Charles November 23, 2009.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

View Materialization & Maintenance Strategies By Ashkan Bayati & Ali Reza Vazifehdoost.

Leonardo Guerreiro Azevedo Geraldo Zimbrão Jano Moreira de Souza Approximate Query Processing in Spatial Databases Using Raster Signatures Federal University.

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

A Binary Linear Programming Formulation of the Graph Edit Distance Presented by Shihao Ji Duke University Machine Learning Group July 17, 2006 Authors:

SQL: Interactive Queries (2) Prof. Weining Zhang Cs.utsa.edu.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

CS 540 Database Management Systems

Updating SF-Tree Speaker: Ho Wai Shing.

Parallel Databases.

Approximating the MST Weight in Sublinear Time

A paper on Join Synopses for Approximate Query Answering

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

AQUA: Approximate Query Answering

On Spatial Joins in MapReduce

Random Sampling over Joins Revisited

CUBE MATERIALIZATION E0 261 Jayant Haritsa

DATABASE HISTOGRAMS E0 261 Jayant Haritsa

Presentation transcript:

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere

1. Introduction 2. AQUA 3. Problem with joins 4. Join synopses 5. Allocation 6. Maintenance of join synopses 7. Experimental Evaluation

 Traditional query processing: Exact answers, minimize response time, maximize throughput  Data warehouses: full precision of exact answer not needed, less time, minimum number of accesses to base data.  Random Sampling techniques: generate approximate answers  Foreign keys joins: Large tables increase the size of base relation. Schemes for providing approximate join aggregates that rely on using random samples of base relations suffer from disadvantages

 Approximate query answering : Improves response time, avoids access to original database  Maintains smaller sized statistical summaries – “SYNOPSES”  Provides confidence bounds.  It has 3 components ◦ Statistics Collection ◦ Query Rewriting ◦ Maintenance.  It sits on top of a DBMS.

 Natural set of synopses would be random samples from each of the base relation ◦ Non Uniform Result Sample ◦ Small join result sizes  Using samples on base relations is not feasible.

 Naïve way - execute all possible join queries and collect samples  Join synopses - samples are taken from small set of distinguished joins  This scheme is for foreign key joins Model database schema as a graph  vertex - base relation  directed edge (u to v) – if u has at least one attribute which is foreign key in v

 There is a 1-1 correspondence between a tuple in a relation ‘r’ & a tuple in the output of any foreign key join involving ‘r’ & any of its descendants in the graph.  The subgraph of G on the ‘k’ nodes in any k-way foreign key join must be a connected subgraph with a single root node

 For each node u in G, corresponding to a relation r1, define J(u) to be the output of the maximum foreign key join r 1 xr 2 x..xr k with source r 1.  Let Su be a uniform random sample of r 1.  The join synopsis J(S u ) is the output of S u xr 2 xr3…..xr k.  J(S u ) is a uniform random sample of J(u) with |S u | tuples.  Thus we can extract from our synopsis a uniform random sample of the output of any k-way foreign key join.

 Optimal strategy for allocating the available space among the various join synopses when certain properties of the query work load are known.  Let ‘S’ be a set of queries with selects, aggregates, group by’s & foreign key joins.  For each relation R i, find fraction F i of queries in S for which R i is the source relation in a foreign key join.  The average relative error bound over the queries is proportional to sum(F i /sqrt(n i )).

 Heuristic allocation: When properties of work load are not known.  There are 3 procedures-  EqJoin – Divides up the space allotted equally amongst relations  CubeJoin – Divides up the space in proportion to the cube root of their join synopsis tuple size.  PropJoin – Divides up the space in proportion to their join synopsis tuple size.

 Need to maintain the join synopses when base relation is updated (insert or delete)  does not require frequent access to base relation  If a new tuple is inserted ◦ Let P u be the probability of newly arrived tuple for relation u in random sample S u ◦ Let uxr 2 xr 3 …..xr k be the max foreign key join with source u. ◦ We add ‘T’ (new tuple) to S u with probability P u. ◦ If ‘T’ is added to S u, we add to J(S u ) the tuple Txr 2 xr 3 …..xr k

◦ If T is added to S u and S u exceeds its target size, then select uniformly at random a tuple T’ to evict from S u and remove the tuple in J(S u ) corresponding to T’.  On delete of a tuple T from u ◦ T is in S u delete the tuple from S u and remove the tuple from J(S u ) corresponding to T ◦ If sample becomes too small due to many deletions repopulate by scanning relation u.  This algorithm performs lookups with the base relation with small probability P u

 Test bed – TPC-D decision support benchmark. DB of around 300 MB.  Machine – 296MHz UltraSPARC-II, 256 MB of memory, Solaris 5.6.  Query used is based on Q5 & an aggregate computed on join of Lineitem, Customer, Order, Supplier, Nation, Region.  The query used is

 Approximate query answering is becoming increasingly essential in data warehouses.  One of the fundamental problems faced here : computing approximate answers to aggregates on multi way joins.  Join synopses : Solution for schemas that involve foreign key joins.  Provides better performance than schemas based on base samples  Can be maintained efficiently during updates  Approximating answers for group by, rank and set valued queries still remains.