2  Introduction  Challenges in Probabilistic Databases  Possible Worlds  DNF Formula based Query Evaluation  Monte Carlo(MC) Simulation  Critical Region  Multi Simulations  Experiments & Results  Conclusions & Future Work

3  Probabilistic databases are used to model data which contain unreliable, inconsistent and imprecise information but SQL query evaluation on such data is difficult.  The imprecision in the data lead to large number of answers of low quality and users are interested only in the answers with the highest probabilities.  In comparison to previous approaches which restricted the SQL queries and made precise query evaluation, the algorithm in the paper computes and ranks efficiently the top-k answers to SQL query on probabilistic database.  In this paper there is a new approach to query evaluation on probabilistic databases, by combining top-k style queries with approximation algorithms with provable guarantees.

4  Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1).  #P complete queries are not handled efficiently in probabilistic databases.

5  Major challenge in probabilistic database is query evaluation. Dalvi & Suciu have shown that most SQL queries are #P Complete and the algorithm described in the paper handles such queries efficiently.  Computing the exact output probabilities is computationally hard.  Any algorithm computing the output probabilities needs to iterate through all possible worlds(here all possible subsets of TitleMatchp).  Another challenge is that number of potential answers needed to compute a probability is large and we generally see that the user is likely to end up inspecting just the first few of them.

6  A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T:  A probabilistic database over schema S is a pair (W,P) where W = {W1,...,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world.

7  Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S.  The possible world semantics are shown on the figure below from the example considered in the paper :

8  In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses.  Let (W,P) be a probabilistic database and let t1, t2,... be all the tuples in all possible worlds. We interpret each tuple as a boolean propositional variable, and each possible world W as a truth assignment to these propositional variables, as follows: ti = true if ti belongs to W, and ti = false if ti does not belong to W.  Consider now a DNF formula E over tuples: clearly E is true in some worlds and false in others. Define its probability P(E) to be the sum of P(W) for all worlds W where E true. Continuing our example, the expression E = (t1^t5)_t2 is true in the possible worlds W3,W7,W10,W11, and its probability is thus P(E) = P(W3) + P(W7) + P(W10) + P(W11).

9  Example Query: qe = SELECT * FROM AMZNReviews a, AMZNReviews b, TitleMatch ax, TitleMatch by, IMDBMovie x, IMDBMovie y, IMDBDirector d WHERE... Each answer returned by qe will have 7 tuple variables defined in where clause: (a, b, ax p, by p, x, y, d) where:  ax p and by p are probabilistic tuples. (From TupleMatch table)  Thus, every row returned by qe defines a boolean expression t.E = ax p Λ by p.

10  Next we group the rows by their directors, and for each group G = {(ax p 1, by p 1 ),..., (ax p m, by p m )}  DNF formula: G.E = (ax p 1 Λ by p 1 ) V... V (ax p m Λ bx p m )The director’s probability give by P(G.E).  How to calculate the director’s probability?  Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem.  Alternative approach is the Monte Carlo simulation which is far better then the brute force approach.

11  An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true.  In Luby & Karp algorithm which is a variant of MC the important part for our algorithm is that after running N steps the algorithm guarantees with high probability that p is in some interval where p belongs to {a^N,b^N} as shown below:  Algorithm : fix an order on the disjuncts: t1, t2,..., tm C := 0 repeat Choose a random disjunct ti Є G Choose a random truth assignment s.t. ti.E = true if forall j < i tj.E = false then C := C + 1 until N times return ˜p = C/N

12  Critical region - (c, d) = (topk(a1,..., an), topk+1(b1,..., bn))  Top objects - T = {Gi | d <= ai}  Bottom objects - B = {Gi | bi <= c}  Assumptions : Intervals :

13  The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top- k answers.  G = {G1……………….Gn} set of n objects with p1……… unknown probabilities where TopK is a subset of G  We assume c < d from now on, and call Gi a crosser if [c, d] subset [ai, bi]  Gi is a double crosser if ai < c, d < bi  Gi is a lower(upper) crosser if ai < c (d < bi)

14  Algorithm : MS TopK(G, k) : /* G = {G1,...,Gn} */ Let [a1, b1] =... = [an, bn] = [0, 1], (c, d) = (0, 1) while c <= d do Case 1: choose a double crosser to simulate Case 2: choose upper and lower crosser to simulate Case 3: choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while return TopK = T = {Gi | d <= ai}

15 The experiments are done on 4 queries that illustrate different scales for the number of groups and the average size for each group, where S->Small & L->Large  SS  SL  LS  LL


17  Through this paper we have proved that using the algorithm we can get near optimal answers for top-k queries on probabilistic databases, with applications to imprecision in data.  In the future we also need to take care of a data model where the probabilities are not listed explicitly.


