Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1.

Slides:



Advertisements
Similar presentations
Uncertainty in Data Integration Ai Jing
Advertisements

Limitations of the relational model 1. 2 Overview application areas for which the relational model is inadequate - reasons drawbacks of relational DBMSs.
University of Washington Database Group The Complexity of Causality and Responsibility for Query Answers and non-Answers Alexandra Meliou, Wolfgang Gatterbauer,
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
CS4432: Database Systems II
CSE544 Database Statistics Tuesday, February 15 th, 2011 Dan Suciu , Winter
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
LAHAR: Extracting Events from Probabilistic Streams Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Chris Re, Julie Letchner, Magdalena Balazinska and Dan Suciu University of Washington.
Probabilistic Histograms for Probabilistic Data Graham Cormode AT&T Labs-Research Antonios Deligiannakis Technical University of Crete Minos Garofalakis.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
Efficient Query Evaluation on Probabilistic Databases
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DATABASE SYSTEMS GROUP DEPARTMENT INSTITUTE FOR INFORMATICS Probabilistic Similarity Queries in Uncertain Databases.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Sensitivity Analysis & Explanations for Robust Query Evaluation in Probabilistic Databases Bhargav Kanagal, Jian Li & Amol Deshpande.
Implementing Mapping Composition Todd J. Green * University of Pennsylania with Philip A. Bernstein (Microsoft Research), Sergey Melnik (Microsoft Research),
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Presenter: Miguel Garzon Torres CrUise Lab - SITE SQL Coverage Measurement for Testing Database Applications María José Suárez-Cabal University of Oviedo.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Efficient Query Evaluation over Temporally Correlated Probabilistic Streams Bhargav Kanagal, Amol Deshpande ΗΥ-562 Advanced Topics on Databases Αλέκα Σεληνιωτάκη.
Cooperative Query Answering Based on a talk by Erick Martinez.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
1 On Provenance of Non-Answers for Queries over Extracted Data Jiansheng Huang Ting Chen AnHai Doan Jeffrey F. Naughton.
Probabilistic Ranking of Database Query Results Surajit Chaudhuri, Microsoft Research Gautam Das, Microsoft Research Vagelis Hristidis, Florida International.
Access Path Selection in a Relational Database Management System Selinger et al.
General Database Statistics Using Maximum Entropy Raghav Kaushik 1, Christopher Ré 2, and Dan Suciu 3 1 Microsoft Research 2 University of Wisconsin--Madison.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Querying Business Processes Under Models of Uncertainty Daniel Deutch, Tova Milo Tel-Aviv University ERP HR System eComm CRM Logistics Customer Bank Supplier.
9/7/2012ISC329 Isabelle Bichindaritz1 The Relational Database Model.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
CMU SCS Carnegie Mellon Univ. Dept. of Computer Science Database Applications Lecture#6: Relational calculus.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
1 Relational Algebra and Calculas Chapter 4, Part A.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.
Query Caching and View Selection for XML Databases Bhushan Mandhani Dan Suciu University of Washington Seattle, USA.
Differential Privacy Xintao Wu Oct 31, Sanitization approaches Input perturbation –Add noise to data –Generalize data Summary statistics –Means,
1 Scalable Probabilistic Databases with Factor Graphs and MCMC Michael Wick, Andrew McCallum, and Gerome Miklau VLDB 2010.
A Unified Approach to Ranking in Probabilistic Databases Jian Li, Barna Saha, Amol Deshpande University of Maryland, College Park, USA VLDB
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
Chapter 3 The Relational Model. Why Study the Relational Model? Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. “Legacy.
A Course on Probabilistic Databases
A Black-Box Approach to Query Cardinality Estimation
A paper on Join Synopses for Approximate Query Answering
Approximate Lineage for Probabilistic Databases
Relational Algebra Chapter 4, Part A
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Probabilistic Databases
Materializing Views With Minimal Size To Answer Queries
Wellington Cabrera Advisor: Carlos Ordonez
Probabilistic Databases with MarkoViews
CS249: Neural Language Model
Presentation transcript:

Materialized Views in Probabilistic Databases for Information Exchange and Query Optimization Christopher Re and Dan Suciu University of Washington 1

Motivating Example: Optimization materialized – but imprecise – view Similarity between users (8M+) 2

Single Slide Summary Renewed interest in probabilistic data ▫ Trio, MayBMS, Maryland, Purdue, UW ▫ Classical : Integration, record linkage, etc. ▫ Emerging: iLike “Similarity Scores” Today: Apply materialized views to pDBs ▫ Faster execution ▫ Better maintainability The Catch: Every view using lineage, but… ▫ Correlations cause lineage to become large When can we get the benefits of materialized views in prob DBs? 3

Overview Motivation and Background Technical Meat Experiments Conclusion 4

Probabilistic DBs Restaurant Example ▫ Block Independent Disjoint (BID) ▫ Popular: Barbara92, Trio, Mystiq, Green et al. ▫ Query Evaluation  Safe Queries  Multisimulation Possible Worlds Key Rating(Chef,Dish; Rating) Value Attributes ChefDishRateP TDCrabHigh0.8 Med0.1 Low0.1 TDLambHigh0.3 Low0.7 5

q1 p2 Restaurant Example ChefRestaurantP TDD. Lounge0.9 TDP.Kitchen0.7 RestaurantDish D. LoungeCrab P. KitchenCrab P. KitchenLamb ChefDishRateP TDCrabHigh0.8 Med0.1 Low0.1 TDLambHigh0.3 Low0.7 W(Chef,Restaurant) WorksAt S(Restaurant,Dish) Serves R(Chef,Dish,Rate) Rated V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) “Chef and restaurant pairs where chef serves a highly rated dish” ChefRestaurantP TDD. Lounge0.72p1*q1 TDP.Kitchen0.602p2*(1 – (1-q1)(1-q2)) V2(c) :- W(c,r),S(r,d),R(c,d,’High’) p1 q2 6 Understand w.o. “lineage”? ChefDishRateP TDCrabHigh0.8 TDLambHigh0.3 Lineage could be large Reprocessing lineage is expensive “Chefs who serve a highly rated dish”

Views and Query Semantic Add worlds, if V is true Views: Conjunctive, Constants DB Semantics: Possible Worlds View Semantics Output of V 7

Overview Motivation and Background Technical Meat Experiments Conclusion 8

Is output of V(H) on any BID database a BID table? ▫ Represent with Schema + marginal probs. Yes, if there is s.t.  V is K-“block independent”  V is K-“disjoint in blocks” Technical Question: Representation this talk 9

K-“block Independence” All tuples from distinct “blocks” Multiply probs 10 KA 1a b 2a b p1 p2 q1 q2 p1 * q2 Intuition: Fails if tuples in different blocks depend on same tuple

Critical tuples Preliminary notion ▫ Def: t is a disjoint critical tuple for a Boolean view V() if exists W RestDish D. LCrab ChefDishRate TDCrabHigh W(Chef, Restaurant) S(Restaurant,Dish)R(Chef,Dish,Rate) V() :- W(‘TD’,’DL’),S(‘DL’,d),R(‘TD’,d,’High’) all tuples are disjoint critical 11 a world, W ChefRest TDDL

▫ property of view V on any DB ▫ Exists t1 critical for V(a) & t2 critical for V(b) ▫ t1 and t2 in same block in a prob. relation Doubly Critical tuples K a b 12 Tuples in V(H) K’AP k1a1p1 a2p2 k2…q1 BID R(K,A) Thm: A conjunctive view V is K-Block independent iff no K-doubly critical tuples t1 t2 Critical

Thm: Deciding if a view is block independent is decidable and Good News: Simple but cautious test ▫ Test: “Can a prob tuple unify with different heads?”  If so, not block independent Thm: If view has no self-joins, test is complete. K={c,r} Complexity… and a Practical test V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) V(c) :- W(c,r),S(r,d),R(c,d,’High’) 13 K={c} In wild, practical test almost always works

Additional Results How to pick K in the view Dealing with disjointness ▫ “Disjoint in blocks” Partial representability. ▫ Some views not representable,  But a query on a view is still correct ▫ In general, hard, but practical test Sets of Views 14

Overview Motivation and Background Technical Meat Experiments Conclusion 15

Experiments: Wild Queries, % rep. Three Datasets ▫ iLike ▫ SQL Server  Adventure works  Northwinds 99.5% of iLike workload use representable views 16 96% partially 63% representable

17 Experiments TPC-H data Q10 Expected performance increase from materialized views

Conclusion Materialized views for probabilistic data ▫ Problem: Retain classical benefits of views Contributions ▫ A complete theoretical solution ▫ Practical solutions Verified Experimentally ▫ Views exist in practice ▫ Query processing benefits, as expected 18

19

Experiments TPC-H data Q5 unsafe query. Key ▫ PTPC: w.o prob ▫ MC: Monte Carlo ▫ LIN: w. lineage ▫ NOLIN: Our technique NB: LIN not an End-to-End running time. So needs another ~ MC additional seconds! 20

Information Exchange ChefRestaurantP TDD. Lounge0.9 TDP.Kitchen0.7 MSC.Bistro0.8 RestuarantDish D. LoungeCrab P. KitchenCrab P. KitchenLamb C. BistroFish ChefDishRateP TDCrabHigh0.8 Med0.1 Low0.1 TDLambHigh0.3 Low0.7 MSFishHigh0.6 Low0.3 W(Chef,Restaurant) WorksAt S(Restaurant,Dish) Serves R(Chef,Dish,Rate) Rated V(c,r) :- W(c,r),S(r,d),R(c,d,’High’) 21

Technical Question 2: Partially representable Question 2: Given a BID database, a view V and a query Q, can we answer the result of V(D) from Q? Show a query that is partially representable and one that correctly uses it, and one that does not. Does not define a unique probability distribution 22