Querying Probabilistic XML Databases Asma Souihli Oct. 24 th 2012 Network and Computer Science Department.

Slides:



Advertisements
Similar presentations
NP-Hard Nattee Niparnan.
Advertisements

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.
Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
Jianxin Li, Chengfei Liu, Rui Zhou Swinburne University of Technology, Australia Wei Wang University of New South Wales, Australia Top-k Keyword Search.
Propositional and First Order Reasoning. Terminology Propositional variable: boolean variable (p) Literal: propositional variable or its negation p 
CSCI 3160 Design and Analysis of Algorithms Tutorial 4
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Fast Algorithms For Hierarchical Range Histogram Constructions
The Theory of NP-Completeness
Properties of SLUR Formulae Ondřej Čepek, Petr Kučera, Václav Vlček Charles University in Prague SOFSEM 2012 January 23, 2012.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
Efficient Query Evaluation on Probabilistic Databases
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Frequent Subgraph Pattern Mining on Uncertain Graph Data
1 Polynomial Church-Turing thesis A decision problem can be solved in polynomial time by using a reasonable sequential model of computation if and only.
1 Optimization problems such as MAXSAT, MIN NODE COVER, MAX INDEPENDENT SET, MAX CLIQUE, MIN SET COVER, TSP, KNAPSACK, BINPACKING do not have a polynomial.
Algorithms in Exponential Time. Outline Backtracking Local Search Randomization: Reducing to a Polynomial-Time Case Randomization: Permuting the Evaluation.
The Theory of NP-Completeness
Complexity 19-1 Complexity Andrei Bulatov More Probabilistic Algorithms.
1 Discrete Structures CS 280 Example application of probability: MAX 3-SAT.
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
Complexity Classes Kang Yu 1. NP NP : nondeterministic polynomial time NP-complete : 1.In NP (can be verified in polynomial time) 2.Every problem in NP.
Xpath Query Evaluation. Goal Evaluating an Xpath query against a given document – To find all matches We will also consider the use of types Complexity.
1 Distributed Monitoring of Peer-to-Peer Systems By Serge Abiteboul, Bogdan Marinoiu Docflow meeting, Bordeaux.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
Nattee Niparnan. Easy & Hard Problem What is “difficulty” of problem? Difficult for computer scientist to derive algorithm for the problem? Difficult.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
Pricing Combinatorial Markets for Tournaments Presented by Rory Kulz.
A Survey Based Seminar: Data Cleaning & Uncertain Data Management Speaker: Shawn Yang Supervisor: Dr. Reynold Cheng Prof. David Cheung
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Randomized Algorithms for Bayesian Hierarchical Clustering
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
NP-Complete Problems. Running Time v.s. Input Size Concern with problems whose complexity may be described by exponential functions. Tractable problems.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Linear Program Set Cover. Given a universe U of n elements, a collection of subsets of U, S = {S 1,…, S k }, and a cost function c: S → Q +. Find a minimum.
Bayesian networks and their application in circuit reliability estimation Erin Taylor.
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Scrubbing Query Results from Probabilistic Databases Jianwen Chen, Ling Feng, Wenwei Xue.
NP-complete Languages
Non-LP-Based Approximation Algorithms Fabrizio Grandoni IDSIA
1 The Theory of NP-Completeness 2 Review: Finding lower bound by problem transformation Problem X reduces to problem Y (X  Y ) iff X can be solved by.
Indian Institute of Technology, Delhi Randomized Algorithms (2-SAT & MAX-3-SAT) (14/02/2008) Sandhya S. Pillai(2007MCS3120) Sunita Sharma(2007MCS2927)
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Richard Anderson Lecture 26 NP-Completeness
Computability and Complexity
Time Complexity Costas Busch - LSU.
Approximate Inference
Computing Full Disjunctions
Probabilistic Data Management
Approximate Lineage for Probabilistic Databases
Spatial Online Sampling and Aggregation
Queries with Difference on Probabilistic Databases
Propositional Calculus: Boolean Algebra and Simplification
Complexity 6-1 The Class P Complexity Andrei Bulatov.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Instructors: Fei Fang (This Lecture) and Dave Touretzky
NP-Complete Problems.
المشرف د.يــــاســـــــــر فـــــــؤاد By: ahmed badrealldeen
CSE 6408 Advanced Algorithms.
Switching Lemmas and Proof Complexity
Probabilistic Databases with MarkoViews
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Presentation transcript:

Querying Probabilistic XML Databases Asma Souihli Oct. 24 th 2012 Network and Computer Science Department

XML for semi-structured data (tree-like structure) 2

Probabilistic Data - PrXML Jung-Hee Yun and Chin-Wan Chung,

Context Uncertainty 4

Context  In many of these tasks, information is described in a semi-structured manner  Especially when the source (e.g., XML or HTML) is already in this form  Representation by means of a hierarchy of nodes is natural 5

Outline 1.PrXML Models Local Dependency Long-distance Dependency 2.Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3.The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4.Conclusions 6

PrXML Models – Local Dependency Local dependency (mux and ind nodes) 7

Long-distance dependency (Conjunction of independent events  cie) Local dependency (mux and ind nodes) Parent node Child Ancestor node Child e2e2 e 3 Λ e Parent node e2e2 Child e1e1 ¬e1¬e1 Tractable translation PrXML {ind,mux} PrXML {cie} S. Abiteboul, B. Kimelfeld, Y. Sagiv, and P. Senellart (With e 1 = 0.3) Child PrXML Models – Long-distance Dependency 8

Repository t1t1 t2t2 Employee Name Details work Place Telecom Paristech Contact Phoneaddress +33(0) Paris 13 Asma Souihli... e2e2 e3e3 e 4 Λ¬e 5 address Paris 15 e5e5 e1e1 e1e1 e6e6 e7e7 e8e8 Example Pr (e 1 ) =.9 Pr (e 2 ) =.8 Pr (e 3 ) =.4 Pr (e 4 ) =.1 Pr (e 5 ) =.6 Pr (e 6 ) =.3 Pr (e 7 ) =.2 Pr (e 8 ) =.8 9 K&K World

Outline 1.PrXML Models Local Dependency Long-distance Dependency 2.Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3.The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4.Conclusions 10

Querying P-documents – Types of Queries o Tree Pattern Queries (TPQ) o Tree Pattern Queries with joins (TPQJ) 11 ABCABC A B C D A B C D

Example  Q1: / Employee [Name= "Asma Souihli"] // / text() enst.fr: e 2 Λ e 8 Λ e 1 C1 gmail.com: e 2 Λ e 8 Λ e 6 C2 senellart.com: e 2 Λ e 9 Λ e 10 C3 gmail.com : e 2 Λ e 9 Λ e 6 C4 Repository t1t1 Employee Name Details Contact Asma Souihli e2e2 e1e1 e6e6 e8e8 t2t2 e 10 e6e6 Contact e9e9 e 1 =.9 e 2 =.8 e 9 =.6 e 10 =.7 e 6 =.3 e 8 =.8 12

Querying PrXML – Probabilistic Lineage  Probability to find an Pr(Q1) = Pr( C1 V C2 V C3 V C4 )  Possible results: Pr( ) = Pr( C2 V C4 ) Pr( ) = Pr( C1 ) Pr( ) = Pr( C3 ) Probabilistic lineage (DNF shape) 13

 When is a linear computation possible? o if C 1 and C 2 are independent, then: Pr(C 1 ∧ C 2 ) = Pr(C 1 ) × Pr(C 2 ) Pr(C 1 ∨ C 2 ) = 1 − ( (1 − Pr(C 1 ) ) × (1 − Pr(C 2 )) ) o if C 1 and C 2 are inconsistent (mutually exclusive), then: Pr(C 1 ∨ C 2 ) = Pr(C 1 ) + Pr(C 2 ) 14 V + Λ Querying PrXML – Probabilistic Lineage

Back to the Example ) = Pr( C1 ) = Pr(e 2 Λ e 8 Λ e 1 ) =.8 x.8 x.9 = ) = Pr( C3 ) = ) = Pr( C2 V C4 ) = (e 2 Λ e 8 Λ e 6 ) V ( e 2 Λ e 9 Λ e 6 ) Factorization: Pr ) = (e 2 Λ e 6 ) Λ (e 8 V e 9 ) =.8 x.3 x (1 -(1-.8)(1-.6)) = e 1 =.9 e 2 =.8 e 9 =.6 e 10 =.7 e 6 =.3 e 8 =.8 15

Pr(Q1) = Pr( C1 V C2 V C3 V C4 ) = Pr [ (e 2 Λ e 8 Λ e 1 ) V (e 2 Λ e 8 Λ e 6 ) V (e 2 Λ e 9 Λ e 10 ) V (e 2 Λ e 9 Λ e 6 ) ] Factorization: = Pr [e 2 Λ ( (e 8 Λ (e 1 V e 6 ) ) V (e 9 Λ (e 10 V e 6 ) ) ) ] Difficult to evaluate ! e 1 =.9 e 2 =.8 e 9 =.6 e 10 =.7 e 6 =.3 e 8 =.8 Querying PrXML – Probabilistic Lineage 16

Solutions..  One possible (naïve) way, is to find the truth value assignments that satisfy the propositional formula (probabilistic lineage) (out of 2 #literals possible assignments/worlds !)  And sum the probabilities of these satisfying assignments to get the answer e1e1 e2e2 e6e6 e8e8 e9e9 e 10 Probability C1 V C2 V C3 V C4 false false true0.3345false truefalse0.87false …………………… 17

 Probabilities of the satisfying assignments for the DNF (lineage formula) : #P-Hard problem o No polynomial time algorithm for the exact solution if P≠NP o #P problems ask "how many" rather than "are there any“ 18 How many graph coloring using k colors are there for a particular graph G? Querying PrXML – Complexity of Queries

 A union of sets (clauses) problem: #P-Hard problem 19 Querying PrXML – Complexity of Queries

Outline 1.PrXML Models Local Dependency Long-distance Dependency 2.Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3.The ProApproX System Computation Algorithms Lineage Decomposition Techniques Evaluation Plans Experiments 4.Conclusions 20

 Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query 21 Query translation BaseX (querying) PrXML database ProApproX (Processing) User input : XPath Query Q 5 Result Pr(Q) Answer User Interface The ProApproX System [CIKM 2012, SIGMOD 2011]

Back to the Example Q1: / Employee [Name= "Asma Souihli"] // / text()  To get the lineage for the boolean projection : for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli"] for $x3 in $x1// /text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) return text{distinct-values(for $att in $atts return string($att))}  To get lineages of answers: for $val in distinct-values(/employee [name="Asma Souihli"]// /text()) order by $val return {$val}{ for $x1 in /employee for $x2 in $x1/name[.="Asma Souihli "] for $x3 in $x1// /text() let $leaves:=($x2,$x3) let $atts:=(for $i in $leaves return $i/ancestor-or-self::*/attribute(event)) where $x3=$val return {distinct-values(for $att in $atts return string($att))} } 22

 Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query  Is built on top of a native XML DBMS  Processes the lineage formula to get the probability of the query (and of each matching answer) 23 Query translation BaseX (querying) PrXML database ProApproX (Processing) User input : XPath Query Q 5 Result Pr(Q) Answer User Interface The ProApproX System and 2011

 Translates into a probabilistic database with only cie nodes  Translates the user query into a lineage query  Is built on top of a native XML DBMS  Processes the lineage formula to get the probability of the query (and of each matching answer) 24 The ProApproX System Query translation BaseX (querying) PrXML database ProApproX (Processing) User input : XPath Query Q 5 Result Pr(Q) Answer User Interface [CIKM 2012, SIGMOD 2011] and

 Additive approximation: o For a fixed error ε and a DNF F, A(F) is an additive ε-approximation of Pr(F) with a probability of at least δ (a fixed reliability factor) if: Pr(F) - ε ≤ A(F) ≤ Pr(F) + ε  Multiplicative Approximation o For a fixed error ε, a DNF F, A(F) is an multiplicative ε-approximation of Pr(F) with a probability of at least δ if: (1 - ε) Pr(F) ≤ A(F) ≤ (1 + ε) Pr(F) 25 The ProApproX System – Computation Algorithms

DEMO 1 [SIGMOD 2011] 26

27 The ProApproX System – Computation Algorithms  Exact Computations: o The naïve algorithm – Possible worlds Finding the satisfying assignments out of 2 #variables possible truth value assignments (2 n ) o The sieve algorithm – The inclusion-exclusion principle Exponential in the number of clauses m (2 m )

 Approximations: o Naïve Monte Carlo sampling for additive app. : Linear but could take exponentially many samples to converge to a good approximation for low probabilities o Biased Monte Carlo sampling for multiplicative app. : Running time grows in ( 3 ln ) in the number of clauses o Self-Adjusting Coverage Algorithm for the DNF probability problem: Linear in the length of F times ln(1/) / 2 28 M. Karp, M. Luby, and N. Madras Kimelfeld, Kosharovsky, and Sagiv The ProApproX System – Computation Algorithms

29  Possibility to derive a multiplicative approximation from an additive approximation (and vice versa)  Cost models and cost constants:

30 e 3 Λ (e 4 V e 5 ) Factorization Exact /naïve Algo. OR Approximation Pr(F) + V Λ V Λ The ProApproX System – Lineage Decomposition Techniques (e 6 Λ e 8 )(e 1 Λ e 2 )(e 3 Λ e 4 )(e 3 Λ e 5 )(¬e3)(¬e3) (e 6 Λ e 7 ) (e8)(e8) VVVVV F =F = V VVVVVV F =F =

DEMO 2 [CIKM 2012] 31

 Propagation of (and ) :  Many possible values for 1 and 2 can be found  Best assignments are not always obvious 32 The ProApproX System – Evaluation Plans Proposition1. Let = 1 2, and assume p̃ 1 and p̃ 2 are additive approximations of Pr( 1 ) and Pr( 2 ), to a factor of 1 and 2, respectively. Then 1-(1- p̃ 1 )(1- p̃ 2 ) is an additive approximation of Pr() to a factor of if: ≤ V

The ProApproX System – Possible Evaluation Plans Deterministic exploration: cost 1 =1 cost=200 cost 2 =35 cost 3 =8 cost 4 =6cost 7 =1 cost 8 =15 cost 5 =3 cost 6 =2 cost 9 =8 cost 10 =12 cost 11 =10cost 12 =9

Running time of the different algorithms on the MondialDB dataset The ProApproX System – Experiments 34

Proportion of time (MondialDB - Best Tree ) Relative error on the probabilities computed by the algorithm on the MondialDB over each non join query with respect to the exact probability values ( = 0.1, = 95%) The ProApproX System – Experiments 35

Running time of the different algorithms on a given query of the movie dataset. (times greater than 5s are not shown) The ProApproX System – Experiments 36

Running time of the different algorithms on the synthetic dataset The ProApproX System – Experiments 37

Outline 1.PrXML Models Local Dependency Long-distance Dependency 2.Querying P-documents Types of Queries Probabilistic Lineage Complexity of Queries 3.The ProApproX System Lineage Decomposition Techniques Computation Algorithms Demo Evaluation Plans Experiments 4.Conclusions 38

Contributions  We have introduced an original optimizer-like approach to evaluating query results over probabilistic XML  Over a more expressive PrXML model  Positive tree-pattern queries, possibly with joins 39

Contributions  Main observation - optimal probability evaluation algorithm to use depends on the characteristics of the formula: o Few variables naïve algorithm o Few clauses sieve algorithm o Monte-Carlo is very good at approximating high probabilities o Sometimes the structure of a query makes the probability of a query easy to evaluate (EvalDP) o Refined approximation methods best when everything else fails (coverage) 40

Contributions  Different algorithms can be used to evaluate the probability of different parts of a formula: o Formula decomposition technique o Exploration of possible evaluation trees (cost model)  The system is built on top of a native DBMS 41

 Exploiting the structure of the query to obtain factorized lineage  Most evaluation algorithms scale effortlessly (with the exception of the self-adjusting coverage algorithm, which requires synchronization) o distribute the probability computation over multi-core or distributed architectures  Processing DNFs, but the technique could probably be extended to arbitrary formulas  Define the range of negated TPQ queries having a DNF lineage Perspectives 42

Thank you.