E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Computer Science CPSC 322 Lecture 25 Top Down Proof Procedure (Ch 5.2.2)
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
C4.5 algorithm Let the classes be denoted {C1, C2,…, Ck}. There are three possibilities for the content of the set of training samples T in the given node.
Hypothesis Testing Developing Null and Alternative Hypotheses Developing Null and Alternative Hypotheses Type I and Type II Errors Type I and Type II Errors.
Top-K Query Evaluation on Probabilistic Data Christopher Ré, Nilesh Dalvi and Dan Suciu University of Washington.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Efficient Query Evaluation on Probabilistic Databases
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
PART 7 Constructing Fuzzy Sets 1. Direct/one-expert 2. Direct/multi-expert 3. Indirect/one-expert 4. Indirect/multi-expert 5. Construction from samples.
The Theory of NP-Completeness
1 Ranked Queries over sources with Boolean Query Interfaces without Ranking Support Vagelis Hristidis, Florida International University Yuheng Hu, Arizona.
Modeling Modern Information Retrieval
Evaluating Hypotheses
Introduction to Simulation. What is simulation? A simulation is the imitation of the operation of a real-world system over time. It involves the generation.
1 Discrete Structures CS 280 Example application of probability: MAX 3-SAT.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
1 Joint work with Shmuel Safra. 2 Motivation 3 Motivation.
Propositional Equivalence Goal: Show how propositional equivalences are established & introduce the most important such equivalences.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
1 Assessment of Imprecise Reliability Using Efficient Probabilistic Reanalysis Farizal Efstratios Nikolaidis SAE 2007 World Congress.
Fundamentals of Python: From First Programs Through Data Structures
The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.
Sullivan – Fundamentals of Statistics – 2 nd Edition – Chapter 9 Section 1 – Slide 1 of 39 Chapter 9 Section 1 The Logic in Constructing Confidence Intervals.
Randomized Algorithms (Probabilistic algorithm) Flip a coin, when you do not know how to make a decision!
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Boolean Algebra and Computer Logic Mathematical Structures for Computer Science Chapter 7.1 – 7.2 Copyright © 2006 W.H. Freeman & Co.MSCS Slides Boolean.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Slide 1 Propositional Definite Clause Logic: Syntax, Semantics and Bottom-up Proofs Jim Little UBC CS 322 – CSP October 20, 2014.
1 Relational Algebra and Calculas Chapter 4, Part A.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Copyright © Cengage Learning. All rights reserved.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
Machine Learning Chapter 5. Evaluating Hypotheses
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
Review of Propositional Logic Syntax
NP-completeness Section 7.4 Giorgi Japaridze Theory of Computability.
Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.
Quantum Computing MAS 725 Hartmut Klauck NTU
SESSION 39 & 40 Last Update 11 th May 2011 Continuous Probability Distributions.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
6-1 Copyright © 2014, 2011, and 2008 Pearson Education, Inc.
Slides by JOHN LOUCKS St. Edward’s University.
Sampling Distributions
Propositional Equivalence
Probabilistic Data Management
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Queries with Difference on Probabilistic Databases
Data Integration with Dependent Sources
Lecture 16: Probabilistic Databases
Discrete Event Simulation - 4
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Lecture 5 Binary Operation Boolean Logic. Binary Operations Addition Subtraction Multiplication Division.
Presentation transcript:

E FFICIENT T OP - K Q UERY E VALUATION ON P ROBABILISTIC D ATA P APER B Y C HRISTOPHER R´ E N ILESH D ALVI D AN S UCIU Presented By Chandrashekar Vijayarenu Anirban Maiti

A GENDA Overview DNF Evaluation Query Evaluation Problem definition Monte Carlo (MC) Simulation Multi simulation(MS) Experiments

P ROBABILISTIC D ATABASES Probabilistic database is an uncertain database in which the possible worlds have associated probabilities. The simplistic definition is that every tuple belongs to the database with some probability (between 0 - 1) Example:

S ELECT Q UERY ON P ROBABILISTIC D ATABASE Find directors with a highly rated Drama and low rated comedy SELECT DISTINCT d.dirName AS Director FROM AMZNReviews a, AMZNReviews b, TitleMatch ax, TitleMatch by, IMDBMovie x, IMDBMovie y, IMDBDirector d WHERE a.asin=ax.asin and b.asin=by.asin and ax.mid=x.mid and by.mid=y.mid and x.did=y.did and y.did=d.did and x.genre=’comedy’ and y.genre=’drama’ and abs(x.year - y.year) <= 5 and a.rating 4 Major challenge in probabilistic database is query evaluation. Dalvi and Suciu have shown that most SQL queries are #P Complete. This paper propose a new approach to query evaluation on probabilistic databases, by combining top-k style queries. User specifies a SQL Query and a number k, and the system returns the highest ranked k answers.

Q UERY P ROCESSING C HALLENGES Compute exact output probabilities is computationally hard. Meaning, any algorithm computing the probabilities need to iterate through all possible subsets of TitleMatch. Potential answers for which we need to calculate the probability is large. User is likely to end up inspecting just the first few of them. This paper introduces the multisimulation algorithm which enables effective processing of probabilistic queries with some error guarantees.

O VERVIEW

P OSSIBLE W ORLDS A possible world is thus any subset of the tuples in the database and its probability can be computed as a product of the probabilities of the tuples in it, and the respective probabilities of the tuples that are not in that world. Consider the following probabilistic database containing two relations S and T: A probabilistic database over schema S is a pair (W,P) where W = {W1,...,Wn} is a set of database instances over S, and P : W ->[0, 1] is a probability distribution (i.e. P j=1,n P(Wj) = 1). Each instance Wj for which P(Wj) > 0 is called a possible world. Let Jp be a database instance over schema Sp. Then Mod(Jp) is the probabilistic database (W,P) over the schema S obtained as described below

DNF F ORMULA In boolean logic, a disjunctive normal form (DNF) is a standardization of a logical formula which is a disjunction of conjunctive clauses. In our Example E = (t 1 Λ t 5 ) V t 2 = true in the possible worlds W 3,W 7,W 10,W 11, and its probability is thus P(E) = P(W 3 ) + P(W 7 ) + P(W 10 ) + P(W 11 ).

Q UERY E VALUATION  DNF E VALUATION qe = SELECT * FROM AMZNReviews a, AMZNReviews b, TitleMatch ax, TitleMatch by, IMDBMovie x, IMDBMovie y, IMDBDirector d WHERE... Each answer returned by qe will have 7 tuple variables defined in where clause: ( a, b, ax p, by p, x, y, d ) ax p and by p are probabilistic tuples. (From TupleMatch table) Thus, every row returned by qe defines a boolean expression t.E = ax p Λ by p.

I NTRODUCE G ROUP B Y Next we group the rows by their directors, and for each group G = {(ax p 1, by p 1 ),..., (ax p m, by p m )} DNF formula: G.E = (ax p 1 Λ by p 1 ) V... V (ax p m Λ bx p m ) The director’s probability give by P(G.E). How to calculate the director’s probability? Brute Force Approach: Choose every possible world and calculate the truth value of the boolean expression. p = P(G.E) is the frequency with which G.E = true #P Hard problem Monte Carlo Simulation : Choose the possible world at random and calculate the truth value.

P ROBLEM D EFINITION G = {G1,...,Gn} of n objects, with unknown Probabilities p 1,..., p n, and a number k <= n. Our goal is to find a set of k objects with the highest probabilities, denoted Top-K Subset of G. Solution: The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers

M ONTE C ARLO (MC) S IMULATION An MC algorithm repeatedly chooses at random a possible world, and computes the truth value of the Boolean expression G.E (Eq.(3)); the probability p = P(G.E) is approximated by the frequency ˜p with which G.E was true. Luby and Karp have described the variant shown in Algorithm fix an order on the disjuncts: t1, t2,..., tm C := 0 repeat Choose a random disjunct ti Є G Choose a random truth assignment s.t. ti.E = true if forall j < i tj.E = false then C := C + 1 until N times return ˜p = C/N

M ULTISIMULATION (MS) Assumptions: Intervals:

M ULTISIMULATION (MS) Critical Region: The critical region, top objects, and bottom objects are: (c, d) = (topk(a1,..., an), topk+1(b1,..., bn))……Eq (5) T = {Gi | d <= ai} B = {Gi | bi <= c} Algorithm: MS TopK(G, k) : /* G = {G1,...,Gn} */ Let [a1, b1] =... = [an, bn] = [0, 1], (c, d) = (0, 1) while c <= d do Case 1: choose a double crosser to simulate Case 2: choose upper and lower crosser to simulate Case 3: choose a maximal crosser to simulate Update (c, d) using Eq.(5) end while return TopK = T = {Gi | d <= ai}

E XAMPLE : L ET US SELECT T OP G1 G2 G3 G4 G5 cd

E XPERIMENTS

E XPERIMENTS RESULTS

T HANK YOU