PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Computer Science CPSC 322 Lecture 25 Top Down Proof Procedure (Ch 5.2.2)
X012 P(x) A probability distribution is as shown. If it is repeated and the 2 distributions combined then the following would be the table of.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Representing and Querying Correlated Tuples in Probabilistic Databases
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Beyond Boolean Queries Ranked retrieval  Thus far, our queries have all been Boolean.  Documents either match or don’t.  Good for expert users with.
Online Filtering, Smoothing & Probabilistic Modeling of Streaming Data In short, Applying probabilistic models to Streams Bhargav Kanagal & Amol Deshpande.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
A COURSE ON PROBABILISTIC DATABASES June, 2014Probabilistic Databases - Dan Suciu 1.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Efficient Query Evaluation on Probabilistic Databases
E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu.
A COURSE ON PROBABILISTIC DATABASES Dan Suciu University of Washington June, 2014Probabilistic Databases - Dan Suciu 1.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
the fourth iteration of this loop is shown here
Program Verification as Probabilistic Inference Sumit Gulwani Nebojsa Jojic Microsoft Research, Redmond.
The Theory of NP-Completeness
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Evaluating Hypotheses
Introduction to Simulation. What is simulation? A simulation is the imitation of the operation of a real-world system over time. It involves the generation.
1 Discrete Structures CS 280 Example application of probability: MAX 3-SAT.
MystiQ The HusQies* *Nilesh Dalvi, Brian Harris, Chris Re, Dan Suciu University of Washington.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.
Daniel Kroening and Ofer Strichman 1 Decision Procedures in First Order Logic Decision Procedures for Equality Logic.
1 Assessment of Imprecise Reliability Using Efficient Probabilistic Reanalysis Farizal Efstratios Nikolaidis SAE 2007 World Congress.
Fundamentals of Python: From First Programs Through Data Structures
SAT Solver Math Foundations of Computer Science. 2 Boolean Expressions  A Boolean expression is a Boolean function  Any Boolean function can be written.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Bug Localization with Machine Learning Techniques Wujie Zheng
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7.1 – 7.2 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Boolean.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Slide 1 Propositional Definite Clause Logic: Syntax, Semantics and Bottom-up Proofs Jim Little UBC CS 322 – CSP October 20, 2014.
Discrete Random Variables. Numerical Outcomes Consider associating a numerical value with each sample point in a sample space. (1,1) (1,2) (1,3) (1,4)
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
IST 210 The Relational Language Todd S. Bacastow January 2004.
Thinking Mathematically
CPSC 422, Lecture 21Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 21 Oct, 30, 2015 Slide credit: some slides adapted from Stuart.
Review of Propositional Logic Syntax
Computer Science CPSC 322 Lecture 22 Logical Consequences, Proof Procedures (Ch 5.2.2)
Tommy Messelis * Stefaan Haspeslagh Patrick De Causmaecker *
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
Efficient Query Evaluation on Probabilistic Databases Nilesh Dalvi Dan Suciu Modified by Veeranjaneyulu Sadhanala.
More SQL: Complex Queries,
Propositional Equivalence
A Course on Probabilistic Databases
Queries with Difference on Probabilistic Databases
Propositional Calculus: Boolean Algebra and Simplification
Elementary Metamathematics
Computers & Programming Languages
Lecture 16: Probabilistic Databases
NP-Completeness Proofs
PROPOSITIONAL LOGIC - SYNTAX-
Probabilistic Databases
Lecture 5 Binary Operation Boolean Logic. Binary Operations Addition Subtraction Multiplication Division.
Evaluating Boolean expressions
Probabilistic Databases with MarkoViews
Presentation transcript:

PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA

 Introduction  Challenges in Probabilistic Databases  Possible Worlds  DNF Formula based Query Evaluation  Monte Carlo(MC) Simulation  Critical Region  Multi Simulations  Experiments & Results  Conclusions & Future Work

 Probabilistic databases are used to model data which contain unreliable, inconsistent and imprecise information but SQL query evaluation on such data is difficult.  The imprecision in the data lead to large number of answers of low quality and users are interested only in the answers with the highest probabilities.  In comparison to previous approaches which restricted the SQL queries and made precise query evaluation, the algorithm in the paper computes and ranks efficiently the top-k answers to SQL query on probabilistic database.  In this paper there is a new approach to query evaluation on probabilistic databases, by combining top-k style queries with approximation algorithms with provable guarantees.

 Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1).  #P complete queries are not handled efficiently in probabilistic databases.

 Major challenge in probabilistic database is query evaluation. Dalvi & Suciu have shown that most SQL queries are #P Complete and the algorithm described in the paper handles such queries efficiently.  Computing the exact output probabilities is computationally hard.  Any algorithm computing the output probabilities needs to iterate through all possible worlds(here all possible subsets of TitleMatchp).  Another challenge is that number of potential answers needed to compute a probability is large and we generally see that the user is likely to end up inspecting just the first few of them.

 A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T:  A probabilistic database over schema S is a pair (W,P) where W = {W1,...,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world.

 Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S.  The possible world semantics are shown on the figure below from the example considered in the paper :

 In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses.  Let (W,P) be a probabilistic database and let t1, t2,... be all the tuples in all possible worlds. We interpret each tuple as a boolean propositional variable, and each possible world W as a truth assignment to these propositional variables, as follows: ti = true if ti belongs to W, and ti = false if ti does not belong to W.  Consider now a DNF formula E over tuples: clearly E is true in some worlds and false in others. Define its probability P(E) to be the sum of P(W) for all worlds W where E true. Continuing our example, the expression E = (t1^t5)_t2 is true in the possible worlds W3,W7,W10,W11, and its probability is thus P(E) = P(W3) + P(W7) + P(W10) + P(W11).

 Example Query: qe = SELECT * FROM AMZNReviews a, AMZNReviews b, TitleMatch ax, TitleMatch by, IMDBMovie x, IMDBMovie y, IMDBDirector d WHERE... Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, ax p, by p, x, y, d) where:  ax p and by p are probabilistic tuples. (From TupleMatch table)  Thus, every row returned by qe defines a boolean expression t.E = ax p Λ by p.

 Next we group the rows by their directors, and for each group G = {(ax p 1, by p 1 ),..., (ax p m, by p m )}  DNF formula: G.E = (ax p 1 Λ by p 1 ) V... V (ax p m Λ bx p m )The director’s probability give by P(G.E).  How to calculate the director’s probability?  Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem.  Alternative approach is the Monte Carlo simulation which is far better then the brute force approach.

 An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true.  In Luby & Karp algorithm which is a variant of MC the important part for our algorithm is that after running N steps the algorithm guarantees with high probability that p is in some interval where p belongs to {a^N,b^N} as shown below:  Algorithm : fix an order on the disjuncts: t1, t2,..., tm C := 0 repeat Choose a random disjunct ti Є G Choose a random truth assignment s.t. ti.E = true if forall j < i tj.E = false then C := C + 1 until N times return ˜p = C/N

 Critical region - (c, d) = (topk(a1,..., an), topk+1(b1,..., bn))  Top objects - T = {Gi | d <= ai}  Bottom objects - B = {Gi | bi <= c}  Assumptions : Intervals :

 The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top- k answers.  G = {G1……………….Gn} set of n objects with p1………..pn unknown probabilities where TopK is a subset of G  We assume c < d from now on, and call Gi a crosser if [c, d] subset [ai, bi]  Gi is a double crosser if ai < c, d < bi  Gi is a lower(upper) crosser if ai < c (d < bi)

 Algorithm : MS TopK(G, k) : /* G = {G1,...,Gn} */ Let [a1, b1] =... = [an, bn] = [0, 1], (c, d) = (0, 1) while c <= d do Case 1: choose a double crosser to simulate Case 2: choose upper and lower crosser to simulate Case 3: choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while return TopK = T = {Gi | d <= ai}

The experiments are done on 4 queries that illustrate different scales for the number of groups and the average size for each group, where S->Small & L->Large  SS  SL  LS  LL

 Through this paper we have proved that using the algorithm we can get near optimal answers for top-k queries on probabilistic databases, with applications to imprecision in data.  In the future we also need to take care of a data model where the probabilities are not listed explicitly.

THANK YOU ?QUESTIONS?