PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA
Introduction Challenges in Probabilistic Databases Possible Worlds DNF Formula based Query Evaluation Monte Carlo(MC) Simulation Critical Region Multi Simulations Experiments & Results Conclusions & Future Work
Probabilistic databases are used to model data which contain unreliable, inconsistent and imprecise information but SQL query evaluation on such data is difficult. The imprecision in the data lead to large number of answers of low quality and users are interested only in the answers with the highest probabilities. In comparison to previous approaches which restricted the SQL queries and made precise query evaluation, the algorithm in the paper computes and ranks efficiently the top-k answers to SQL query on probabilistic database. In this paper there is a new approach to query evaluation on probabilistic databases, by combining top-k style queries with approximation algorithms with provable guarantees.
Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1). #P complete queries are not handled efficiently in probabilistic databases.
Major challenge in probabilistic database is query evaluation. Dalvi & Suciu have shown that most SQL queries are #P Complete and the algorithm described in the paper handles such queries efficiently. Computing the exact output probabilities is computationally hard. Any algorithm computing the output probabilities needs to iterate through all possible worlds(here all possible subsets of TitleMatchp). Another challenge is that number of potential answers needed to compute a probability is large and we generally see that the user is likely to end up inspecting just the first few of them.
A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T: A probabilistic database over schema S is a pair (W,P) where W = {W1,...,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world.
Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S. The possible world semantics are shown on the figure below from the example considered in the paper :
In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses. Let (W,P) be a probabilistic database and let t1, t2,... be all the tuples in all possible worlds. We interpret each tuple as a boolean propositional variable, and each possible world W as a truth assignment to these propositional variables, as follows: ti = true if ti belongs to W, and ti = false if ti does not belong to W. Consider now a DNF formula E over tuples: clearly E is true in some worlds and false in others. Define its probability P(E) to be the sum of P(W) for all worlds W where E true. Continuing our example, the expression E = (t1^t5)_t2 is true in the possible worlds W3,W7,W10,W11, and its probability is thus P(E) = P(W3) + P(W7) + P(W10) + P(W11).
Example Query: qe = SELECT * FROM AMZNReviews a, AMZNReviews b, TitleMatch ax, TitleMatch by, IMDBMovie x, IMDBMovie y, IMDBDirector d WHERE... Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, ax p, by p, x, y, d) where: ax p and by p are probabilistic tuples. (From TupleMatch table) Thus, every row returned by qe defines a boolean expression t.E = ax p Λ by p.
Next we group the rows by their directors, and for each group G = {(ax p 1, by p 1 ),..., (ax p m, by p m )} DNF formula: G.E = (ax p 1 Λ by p 1 ) V... V (ax p m Λ bx p m )The director’s probability give by P(G.E). How to calculate the director’s probability? Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem. Alternative approach is the Monte Carlo simulation which is far better then the brute force approach.
An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true. In Luby & Karp algorithm which is a variant of MC the important part for our algorithm is that after running N steps the algorithm guarantees with high probability that p is in some interval where p belongs to {a^N,b^N} as shown below: Algorithm : fix an order on the disjuncts: t1, t2,..., tm C := 0 repeat Choose a random disjunct ti Є G Choose a random truth assignment s.t. ti.E = true if forall j < i tj.E = false then C := C + 1 until N times return ˜p = C/N
Critical region - (c, d) = (topk(a1,..., an), topk+1(b1,..., bn)) Top objects - T = {Gi | d <= ai} Bottom objects - B = {Gi | bi <= c} Assumptions : Intervals :
The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top- k answers. G = {G1……………….Gn} set of n objects with p1………..pn unknown probabilities where TopK is a subset of G We assume c < d from now on, and call Gi a crosser if [c, d] subset [ai, bi] Gi is a double crosser if ai < c, d < bi Gi is a lower(upper) crosser if ai < c (d < bi)
Algorithm : MS TopK(G, k) : /* G = {G1,...,Gn} */ Let [a1, b1] =... = [an, bn] = [0, 1], (c, d) = (0, 1) while c <= d do Case 1: choose a double crosser to simulate Case 2: choose upper and lower crosser to simulate Case 3: choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while return TopK = T = {Gi | d <= ai}
The experiments are done on 4 queries that illustrate different scales for the number of groups and the average size for each group, where S->Small & L->Large SS SL LS LL
Through this paper we have proved that using the algorithm we can get near optimal answers for top-k queries on probabilistic databases, with applications to imprecision in data. In the future we also need to take care of a data model where the probabilities are not listed explicitly.
THANK YOU ?QUESTIONS?