A Temporal Probabilistic Database Model for Information Extraction

Slides:



Advertisements
Similar presentations
Artificial Intelligence
Advertisements

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
First Order Logic Logic is a mathematical attempt to formalize the way we think. First-order predicate calculus was created in an attempt to mechanize.
Computer Science CPSC 322 Lecture 25 Top Down Proof Procedure (Ch 5.2.2)
Relational Calculus and Datalog
Biointelligence Lab School of Computer Sci. & Eng.
Queries with Difference on Probabilistic Databases Sanjeev Khanna Sudeepa Roy Val Tannen University of Pennsylvania 1.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Answer Set Programming Overview Dr. Rogelio Dávila Pérez Profesor-Investigador División de Posgrado Universidad Autónoma de Guadalajara
Logic.
2001/12/181/50 Discovering Robust Knowledge from Databases that Change Author: Chun-Nan Hsu, Craig A. Knoblock Advisor: Dr. Hsu Graduate: Yu-Wei Su.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Efficient Query Evaluation on Probabilistic Databases
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Constraint Logic Programming Ryan Kinworthy. Overview Introduction Logic Programming LP as a constraint programming language Constraint Logic Programming.
Relational Data Mining in Finance Haonan Zhang CFWin /04/2003.
Catriel Beeri Pls/Winter 2004/5 type reconstruction 1 Type Reconstruction & Parametric Polymorphism  Introduction  Unification and type reconstruction.
Institut für Scientific Computing – Universität WienP.Brezany Fragmentation Univ.-Prof. Dr. Peter Brezany Institut für Scientific Computing Universität.
Proof System HY-566. Proof layer Next layer of SW is logic and proof layers. – allow the user to state any logical principles, – computer can to infer.
Let remember from the previous lesson what is Knowledge representation
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Belief Revision Lecture 1: AGM April 1, 2004 Gregory Wheeler
Notes for Chapter 12 Logic Programming The AI War Basic Concepts of Logic Programming Prolog Review questions.
Chapter 12 Review of Calculus and Probability
DECIDABILITY OF PRESBURGER ARITHMETIC USING FINITE AUTOMATA Presented by : Shubha Jain Reference : Paper by Alexandre Boudet and Hubert Comon.
Discrete Mathematical Structures (Counting Principles)
1  Special Cases:  Query Semantics: (“Marginal Probabilities”)  Run query Q against each instance D i ; for each answer tuple t, sum up the probabilities.
Christopher Re and Dan Suciu University of Washington Efficient Evaluation of HAVING Queries on a Probabilistic Database.
Pattern-directed inference systems
HAWKES LEARNING Students Count. Success Matters. Copyright © 2015 by Hawkes Learning/Quant Systems, Inc. All rights reserved. Section 1.1 Thinking Mathematically.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
Formal Specification of Intrusion Signatures and Detection Rules By Jean-Philippe Pouzol and Mireille Ducassé 15 th IEEE Computer Security Foundations.
Propositional Calculus CS 270: Mathematical Foundations of Computer Science Jeremy Johnson.
Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. L06-1 September 26, 2006http:// Type Inference September.
LAC group, 16/06/2011. So far...  Directed graphical models  Bayesian Networks Useful because both the structure and the parameters provide a natural.
Copyright © Cengage Learning. All rights reserved.
LDK R Logics for Data and Knowledge Representation Propositional Logic: Reasoning First version by Alessandro Agostini and Fausto Giunchiglia Second version.
Logical Agents Chapter 7. Outline Knowledge-based agents Logic in general Propositional (Boolean) logic Equivalence, validity, satisfiability.
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Module A: Formal Relational.
© Copyright 2008 STI INNSBRUCK Intelligent Systems Propositional Logic.
DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.
272: Software Engineering Fall 2012 Instructor: Tevfik Bultan Lecture 9: Test Generation from Models.
1 Reasoning with Infinite stable models Piero A. Bonatti presented by Axel Polleres (IJCAI 2001,
Daniel Kroening and Ofer Strichman 1 Decision Procedures An Algorithmic Point of View Basic Concepts and Background.
Logical Agents. Outline Knowledge-based agents Logic in general - models and entailment Propositional (Boolean) logic Equivalence, validity, satisfiability.
Boolean Algebra & Logic Gates
Probabilistic Inference Modulo Theories
Chapter 7. Classification and Prediction
CS589 Principles of DB Systems Spring 2014 Unit 2: Recursive Query Processing Lecture 2-1 – Naïve algorithm for recursive queries Lois Delcambre (slides.
Introduction to Logic for Artificial Intelligence Lecture 2
Propositional Calculus: Boolean Functions and Expressions
Chapter 6: Formal Relational Query Languages
Mathematical Induction Recursion
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Propositional Calculus: Boolean Algebra and Simplification
Non-Standard-Datenbanken
Electrical and Computer Engineering Department
Logic: Top-down proof procedure and Datalog
MA/CSSE 474 More Math Review Theory of Computation
Chapter 2: Intro to Relational Model
Logic: Domain Modeling /Proofs + Computer Science cpsc322, Lecture 22
Probabilistic Databases
Chapter 6: Formal Relational Query Languages
Non-Standard-Datenbanken
This Lecture Substitution model
Probabilistic Databases with MarkoViews
Logical Agents Prof. Dr. Widodo Budiharto 2018
Presentation transcript:

A Temporal Probabilistic Database Model for Information Extraction Martin Theobald University of Antwerp Antwerp, Belgium Iris Miliaraki Max Planck Institute for Informatics Saarbr¨ucken, Germany Maximilian Dylla Max Planck Institute for Informatics Saarbr¨ucken, Germany

Motivation For cleaning uncertain temporal facts obtained from information extraction methods To develop a query engine which is capable of scaling to very large temporal knowledge bases with nearly interactive query response times over millions of uncertain facts and hundreds of thousands of grounded rules To develop a temporal-probabilistic database model that supports both time and probability as first class citizens

Problem Statement As information extraction techniques inherently deliver uncertain data, they have often been considered as one of the main motivating applications for PDBs(Probabilistic DataBases) TDBs(Temporal DataBases) have been an active field of research for many years, but so far there exists no TDB system that also supports probabilistic data or vice versa Not aware of any unified temporal and probabilistic database approach that natively supports the kinds of data and constraints they aim to apply to their setting

Preliminaries Temporal Database: It can be defined as a triple (S, F, ΩT ) consisting of a database schema S, a finite set of base tuples F, and the time domain ΩT Base tuples F(set of base facts) captures in all the input instances RiT over S Time Domain(ΩT) is the finite sequence of time points Database Schema S is the collection of all relation schemas (R1T,R2T ,..........,RNT) A temporal relation RT is represented as RT (A1, . . . ,Am )[Tb, Te), where RT denotes the relation name, A1, . . . ,Am is a finite set of attributes of RT with domains Ωi, and Tb, Te are attributes denoting the begin and end time points of the time intervals that are attached to each tuple temporal relations employs tuple timestamping, i.e., each tuple is annotated by a single time interval that specifies its valid-time

Probabilistic Database: Informally, a probabilistic database can be defined as a discrete probability distribution over a finite number of database instances (the “possible worlds”) Formally, we thus define a tuple-independent probabilistic database DBp = (S,F, p) as a triple consisting of a database schema S, a finite set of base facts F, and a probability measure p : F → (0, 1], which assigns a probability value p(f) to each uncertain base fact f ∈ F. A probabilistic relation Rp is represented as Rp(A1, . . . ,Am, p), where Rp denotes the relation name, A1, . . . ,Am denote a finite set of attributes of Rp with domains Ωi, p is an attribute which denotes the probability value that is attached to each fact in a relation instance Rp of Rp p(f) denotes the confidence in the existence of the fact in the database

Conditioning a PDB P(ϕ) := ΣW|=ϕ,W⊆FP(W) P(ϕ | ψ) :=P(ϕ ∧ ψ)/P(ψ) TDB uses time interval and PDB uses confidence for the validity of tuples Now, the third concept is CONDITIONING which allows for additionally incorporating consistency constraints into a possible-worlds-based representation formalism Formally, we can compute the marginal probability of any Boolean formula ϕ over variables in F as the sum of the probabilities of all the possible worlds W ⊆ F that entail ϕ, which is denoted as W |= ϕ. P(ϕ) := ΣW|=ϕ,W⊆FP(W) Conditioning a formula ϕ on another formula ψ, which also ranges over variables in F, then simply denotes the process of computing the following conditional probability: P(ϕ | ψ) :=P(ϕ ∧ ψ)/P(ψ) Intuitively, conditioning removes the possible worlds from a PDB, in which the constraints are not satisfied, and thus re-weighs the remaining worlds to again form a probability distribution

TEMPORAL-PROBABILISTIC DATABASE MODEL This model is the combination and extension of the TDB and PDB, into a unified temporal-probabilistic database model, in order to facilitate both rule-based and probability inference over uncertain temporal knowledge Extension employs two key concepts: Temporal deduction rules: It allows to represent all of the core operations known from relational algebra over these temporal and probabilistic relations; Temporal consistency constraints: It allows to constrain the possible worlds at which query answers are considered to be valid.

Temporal Deduction Rules: Temporal deduction rule over a database schema S is a first-order logical rule of the form RT0(X̅0)[Tb0,Te0 ) ← (∧i =1,...,n RTi ( X̅i)[Tbi,Tei) ∧ ∧j=1,...,m ¬RTj ( X̅j)[Tbj,Tej) ∧ Φ(X̅)) Where: RT0 denotes the head literal intensional temporal relation in S, while RTi , RTj refer to extensional or intensional temporal relations in S; n ≥ 1, m ≥ 0, thus requiring at least one positive relational literal; Φ( X̅) is a conjunction of literals over the arithmetic predicates =, ≠ and ≤T , whose arguments are in X̅ ; X̅0, X̅i, X̅j denote tuples of non-temporal variables and constants, such that Var(X̅0) ⊆ ∪i Var( X̅i) and Var(X̅j)⊆ ∪i Var( X̅i); X̅ denotes a tuple of both temporal and non-temporal variables and constants, such that Var( X̅ ) ⊆ ∪i Var( X̅i)∪ ∪i{Tbi , Tei}; Tb0 , Te0 are head's temporal arguments, such that Tb0,Te0 ∈ ∪i{Tbi , Tei} ∪ {tmin, tmax }.

Example 1 Rule 1: AreMarried(X, Y )[Tb1,tmax )←(Wedding(X, Y )[Tb1,Te1 ) ∧ ¬ Divorce(X, Y )[Tb2,Te2 )) couple stays married from the begin time point of their wedding until to the last possible time point we consider (tmax) unless there is a Divorce fact at any other time interval. Rule 2: AreMarried(X, Y )[Tb1,Te2 )←Wedding(X, Y)[Tb1,Te1 ) ∧ Divorce(X, Y )[Tb2,Te2 ) ∧ (Te1≤TTb2)) couple stays married from the begin time point of their wedding till the end time point of their divorce Rule 3: ¬ (BornIn(X, Y )[Tb1,Te1) ∧ BornIn(X,Z)[Tb2,Te2) ∧ Y ≠ Z) This constraint tells that if two facts are mutually exclusive(such as two birth facts of same person), at least one of the two BornIn facts has to be set to false in all the possible instances Rule 4: ¬ (BornIn(X, Y )[Tb1,Te1) ∧ AreMarried(X,Z)[Tb2,Te2 )∧ Tb2≤TTe1) This consistency constraint uses a temporal condition to express that a marriage should begin after the respective person was born

Grounding: Definition 2. Let Ψ be a conjunctive rst-order formula and F a set of facts. Then, the grounding function G(Ψ,F) instantiates Ψ over F into a set of propositional lineage formulas {ψ0, . . . , ψn}. Example 2. Applying Grounding(G) to the body of Rule (2) and the set of facts {f3, f4, f5} yields the lineage formulas {f3∧f5, f4∧f5}. By further instantiating Rule (2)'s head with the arguments of f4∧f5, we obtain the new fact f′= AreMarried(DeNiro,Abbott)[1976-07-29,1988-12-01). Regarding the lineage, we have ϕ(f′) = f4∧f5. All facts which can be deduced from Rules (1) and (2) are shown together with their lineages and time intervals in the middle part of Fig. 1.

Deduplicating Facts Definition 4. Let a temporal relation RT , non-temporal constants ¯a, a time point t ∈ ΩT , and a set of facts F be given. Then, L is dened as the set of lineages of fact RT (¯a) that are valid at time point t: L(RT, a̅, t,F) := {ϕ(f) | f = RT ( a̅)[tb,te) ∈ F, tb ≤ t < te} We create duplicate free facts f′= RT (¯a)[tb,te), such that for any pair of time points t0, t1 ∈ [tb, te) it holds that: L(RT, a̅ t0,F) = L(RT, a̅, t1,F) Furthermore, we dene the new facts' lineage to be: ϕ(f′) :=∨ψi∈L(RT,a,tb,F)ψi Example 3. Applying Definition 4 to the facts in the middle of Fig. 1 yields the facts shown at the top of the figure. For instance, if we inspect the time points 1976- 07-28 and 1976-07-29, we notice that {f3∧f5, f3∧¬f5} ≠ {f3∧f5, f3∧¬f5, f4∧f5, f4∧¬f5}, so two differing facts f6 and f7 have to be kept in the relation. Thus, the resulting duplicate-free facts are the following. Id Fact T p f6 AreMarried(DeNiro,Abbott) [1936-11-01, 1976-07-29) 0.3 f7 AreMarried(DeNiro,Abbott) [1976-07-29, 1988-12-01) 0.79 f8 AreMarried(DeNiro,Abbott) [1988-12-01, tmax ) 0.158 Following Eq. (10), their lineages are as follows. ϕ(f6) = (f3∧f5) ∨ (f3 ∧ ¬f5) ϕ(f7) = (f3∧f5) ∨ (f3∧¬f5) ∨ (f4∧f5) ∨ (f4 ∧ ¬f5) ϕ(f8) = (f3∧¬f5) ∨ (f4∧¬f5)

Temporal Consistency Constraints: A temporal consistency constraint over a database schema S is a rst-order logical rule of the form ㄱ(⋀i RTi ㄱ(X̅i)[Tbi,Tei) ∧ Φ(X̅) ) where: RTi denote extensional or intensional temporal relations in S; Φ( X̅) is a conjunction of literals relating to the arithmetic predicates =, ≠, and ≤T with arguments in X̅ ; The X̅i denote tuples of non-temporal variables and constants, and Tbi , Tei are temporal variables and constants, such that Var( X̅) ⊆ ∪i(Var( X̅i) ∪ {Tbi , Tei} Rule 3: ¬ (BornIn(X, Y )[Tb1,Te1) ∧ BornIn(X,Z)[Tb2,Te2) ∧ Y ≠ Z) This constraint tells that if two facts are mutually exclusive(such as two birth facts of same person), at least one of the two BornIn facts has to be set to false in all the possible instances Rule 4: ¬ (BornIn(X, Y )[Tb1,Te1) ∧ AreMarried(X,Z)[Tb2,Te2 )∧ Tb2≤TTe1) This consistency constraint uses a temporal condition to express that a marriage should begin after the respective person was born

Conditioning: C := ⋀ψi∈G(Ci,F+),Ci∈Cψi Let C = {C0, . . . , Cn} be a set of temporal consistency constraints, and let F+ be a set containing both base facts F and facts deducible from a set of temporal deduction rules D. Then, we dene C to be the conjunction of all grounded constraints over F+. C := ⋀ψi∈G(Ci,F+),Ci∈Cψi The confidence of a fact f ∈ F+ with lineage ϕ(f) and with respect to the grounded constraints C is then calculated as: P(ϕ(f) | C) := P(ϕ(f) ∧ C) / P(C) Example 4: Grounding the Constraint (3) against the facts of Example 1 yields ¬(f1 ∧ f2), where the arithmetic literal is omitted from the grounded formula, since it evaluates to true. Altogether, grounding Constraints (3) and (4) results in C = ¬(f1 ∧ f2) ∧ ¬(f1 ∧ ϕ(f6)) ∧ ¬(f2 ∧ ϕ(f6)) ∧ ¬(f2 ∧ ϕ(f7)), where all arithmetic predicates have already been evaluated correspondingly.

Temporal-Probabilistic Database: Definition: A temporal-probabilistic database DBTp =(S,F,D, C,ΩT , p) is a six-tuple consisting of a database schema S, a set of base facts F, a set of temporal deduction rules D, a set of temporal consistency constraints C, a time domain ΩT , and a probability measure p. Queries over TPDP Conjunction of literals Similar conditions as for the deduction rules regarding safeness hold. That is, every non-temporal variable that occurs in a negated relational literal or in an arithmetic literal must also occur in at least one of the positive relational literals. Eg: ¬BornIn(X, Y )[Tb1,Te1) should have a positive relational literal BornIn(X,Y)[Tb1,Te1) Every temporal variable that occurs in an arithmetic literal must also occur in at least one of the relational literals ¬ (BornIn(X, Y )[Tb1,Te1) ∧ AreMarried(X,Z)[Tb2,Te2 )∧ Tb2≤TTe1)

Algorithms The basic concept of the algorithm for the proposed TPDB is: Ground the deduction rules and constraints Deduplicate the facts with overlapping time intervals Lineage Decompositions to reduce shannon’s expansion Final Confidence Computation

Grounding: The process of grounding of deduction rules and constraints is divided into two phase: First phase: Ground all the deduction rules in a bottom-up manner while tracing the lineage of derived facts Second phase: instantiates constraints against all facts (i.e., both the base and the deduced facts). Example 6. We initialize Algorithm 1 using F := {f1 ,. . . , f5 }, D comprising Rules (1) and (2), and C containing Constraints (3) and (4) of Example 1. First, we initialize Plan with [ AreMarried→ {(1), (2)}], and we set F+ = {f1 , . . . , f5 }. The loop in Line 3 performs only one iteration where RT = AreMarried . Assuming that the loop in Line 5 selects (2) first, we instantiate the head to two new facts AreMarried (DeNiro,Abbott) [1936−11−01,1988−12−01) and Marriage (DeNiro, Abbott)[1976−07−29,1988−12−01) with lineages f3∧f5 and f4∧f5 ,respectively. Their lineage and that of the two facts deduced by Rule (1) are shown in Example 3. Fig. 1’s top part corresponds to the output of Deduplicate (Line 8). When we reach Line 9, F+ = {f1 , . . ,f8}. Next, the constraints are grounded as shown already in Example 4

2. Deduplicating Facts: Example 7. We apply Algorithm 2 to the facts deduced by Rules (1) and (2), hence constructing the lineage for the facts presented in Example 3. First, we initialize Limits as {1936-11-01, 1976-07-29, 1988-12-01, t max }, set Begin to [1936-11-01 → {f3∧f5 , f3∧¬f5 }, 1976-07-29 → {f4∧f5 , f4∧¬f5 }], and assign [1988-12-01 → {f3∧f5 , f4∧f5 }, tmax →{f3∧¬f5 , f4∧¬f5 }] to End. In the first iteration of the loop, Active is empty. In the next iteration, we add f 6 = AreMarried (DeNiro, Abbott) [1936-11-01,1976-07-29) to Result and set its lineage to φ(f 6 ) = (f3∧f5 ) ∨ (f3∧¬f5 ). The last two iterations produce f 7 , f 8 as shown in Example 3

3. Confidence Computation Example 8. In Example 3 the lineage of f 6 was defined as (f 3 ∧ f 5 ) ∨ (f 3 ∧ ¬f 5 ). Now, let us compute its confidence: P ((f3∧f5) ∨ (f3∧¬f5 )) = P (f3∧f5 ) + P(f3∧¬f5 ) = P(f3 ) · P(f5 ) + P(f3) · (1 − P(f5 )) = 0.3·0.8 + 0.3·0.2 = 0.3

4. Lineage Decomposition Definition: Let φ ≡ φ 0 op . . . op φ n be a lineage formula with operator op ∈ {∧, ∨}, and let φ 0 to φ n be sub-formulas of φ. Then, we define the function S to return the number of necessary Shannon expansions as follows: S(φ) = |{f | ∃i≠j : f ∈ F (φi ), f ∈ F (φj )}| Example 10. As an example, we apply S to φ(f 7 ) (see Example 3). That is op = ∨ and S((f 3 ∧ f 5 ) ∨ (f 3 ∧ ¬f 5 ) ∨ (f 4 ∧ f 5 ) ∨ (f 4 ∧ ¬f 5 )) = |{f 3 , f 4 , f 5 }| = 3. Thus, a naive computation of P (φ(f 7 )) takes 23 iterations. Example 11. We decompose Example 3’s φ(f 7 ) = (f3∧f5)∨(f3∧¬f5) ∨ (f4 ∧f5) ∨ (f4 ∧¬f5 ) by Algorithm 3 with θ = 0. As discussed in Example 10, S(φ(f 7 )) = 3, so we set φ ′ = φ(f 7 ). In Line 4, we receive m = 0 because f 5 occurs in all lineage subformulas. Hence, in Line 9, we choose f 5 and rewrite φ(f 7 ) to (f 5 ∧ (f 3 ∨ f 4 )) ∨ (¬f 5 ∧ (f 3 ∨ f 4 )) in Line 10. Now, all Shannon expansions are gone, yielding P (φ(f 7 )) = (0.8 · (1 − (1 − 0.3) · (1 − 0.7))) + (0.2 · (1 − (1 −0.3) · (1 − 0.7))) = 0.8 which can be computed via Eq. (13)

Experimental Results

Conclusion They developed a temporal-probabilistic database model that supports both time and probability as first class citizens They introduced an expressive class of temporal deduction rules and temporal consistency constraints allowing for a closed and complete representation formalism for temporal uncertain data, and analyzed the properties of their data model from a theoretical perspective Also, we proposed efficient algorithms, which we evaluated on various information-extraction tasks As future work, They aim to investigate the support for periodical temporal data and background knowledge (e.g., presidential elections) And to extend their data model to non-independent (i.e., correlated) base facts