Lois Delcambre (slides from Dave Maier) Recursive Query Processing Naïve and semi-naïve query processing Lecture 2-2 Lois Delcambre (slides from Dave Maier)
Dependency Graphs for Datalog Programs
Datalog program no recursion with negation: dependency graph Given a DB with two relations: Topics(Topic) and Interests(Person,Topic) What is this query computing? Diff(a) :- Prod(a,b), Interests(a,b). Prod(a,b) :- Interests(a,c), Topics(b). Create one node for each body predicate. Draw an arrow from body predicate to corresponding head predicate. Acyclic dependency graph ≡ not recursive. It’s useful to label arrows if predicate is negative. Topics Interests Prod Diff
Datalog program no recursion with negation: dependency graph Given a DB with two relations: Topics(Topic) and Interests(Person,Topic) What is this query computing? Diff(a) :- Prod(a,b), Interests(a,b). Prod(a,b) :- Interests(a,c), Topics(b). Evaluate predicates completely – according to the order of operations in the dependency graph. Topics Interests Prod Diff
Datalog with recursion without negation: dependency graph Anc(x,y) :- Parent-of(x,y). Anc(x,z) :- Anc(x,y), Parent-of(y,z). Recursive Datalog program has a cycle in the dependency graph. Parent-of Anc
Datalog with recursion without negation Needed(D, M, C) :- MajAll(M, C), Offered(D, M). Needed(D, M, C) :- DegAll(D, C), Offered(D, M). MajAll(M, C) :- MajReq(M, C). MajAll(M, C) :- MajAll(M, C1), Prereq(C, C1). DegAll(D, C) :- DegReq(D, C). DegAll(D, C) :- DegAll(D, C1), Prereq(C, C1). Needed MajAll DegAll Offered MajReq Prereq DegReq
Here too … you can evaluate in order of dependency graph Compute DegAll from DegReq and Prereq and DegAll Compute MajorAll from MajReq and Prereq Compute Needed from MajAll, DegAll, Offered Needed MajAll DegAll Offered MajReq Prereq DegReq
Naïve Evaluation of Recursive Datalog (without negation)
Naïve Evaluation of P A “bottom-up” or “right-to-left” method. Have a current set of facts f (a partial model). Use clauses in P in a right-to-left manner to derive additional facts that must also be in a minimum model for P.
Immediate Consequence Operator If P is a Datalog program, the immediate consequence operator IP(f) computes all facts derivable from the set of facts f by a single application of the clauses in P. Note: If f is a subset of the minimum model of P, then so is IP(f), IP(IP(f)), IP(IP(IP(f))), … So start the process with f = , because we know we have a subset of the minimum model. Inflationary semantics of program P: Start with empty set of facts, apply IP until no change.
Naïve Evaluation Naïve(P, q) fold = f = IP(fold) while f fold do fold = f f = IP(f) return match(q, f) Naïve always halts on programs that don’t use computed predicates, with f as the minimum model.
Working through naïve evaluation (1) f0 = Local(Id, Loc, 1) :- Local0(Id, Loc). Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Working through naïve evaluation (2) f1 = IP(f0) = Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Local(Id, Loc, 1) :- Local0(Id, Loc). Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Working through naïve evaluation (3) f2 = IP(f1) = Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Local(Id, Loc, 1) :- Local0(Id, Loc). Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Working through naïve evaluation (4) f3 = IP(f2) = Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(s2, node2, 2). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Local(Id, Loc, 1) :- Local0(Id, Loc). Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Working through naïve evaluation (5) f4 = IP(f3) = Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(j1, node1, 6). Local(s2, node2, 2). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Local(Id, Loc, 1) :- Local0(Id, Loc). Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Working through naïve evaluation (6) f5 = IP(f4) = f4 Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(j1, node1, 6). Local(s2, node2, 2). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Local(Id, Loc, 1) :- Local0(Id, Loc). Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Semi-Naïve Evaluation of Datalog
Semi-Naïve Evaluation Naïve evaluation discovers the same fact over and over. E.g., derive local(r, node1, 1) repeatedly Idea: When evaluating a rule, make sure we use at least one newly discovered fact. Otherwise, we derive a fact we’ve already seen. Notice: after f1, only new facts are local facts. (Only need to work on recursive intensional predicates.)
Example: rediscovering facts vs. discovering new facts (using Naïve) Consider the rule Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). When computing f4 = IP(f3). If we use facts Op2(u1, s, q, node1), Local(s, node1, 1), Local(q, node1, 1), we derive Local(u1, node1, 3), which we already had If we use Op2(j1, s1, u1, node1), Local(s1, node1, 2), Local(u1, node1, 3), we get Local (j1, node2, 6), which is new
Implementing Semi-Naïve: use deltas (new facts from this round) Suppose at each step in Naïve we determine NewLocal(Id, Loc, K) if Local(Id, Loc, K) in f – fold. Replace this rule: Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). With these rules: Local(Id, Loc, K) :- Op1(Id, In, Loc), NewLocal(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), NewLocal(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), NewLocal(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local(id, Loc, K) :- Op2(Id, In1, In2, Loc), NewLocal(In1, Loc, M), NewLocal(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K).
One Other Modification for Semi-Naive If P’ is the modified program, then instead of f = IP(f) we need f = IP’(f) f Note: Semi-Naïve doesn’t guarantee we won’t generate a fact at multiple steps, if there are distinct “proofs” of it, i.e., two ways to generate it. MajReq(eng, wr310). MajReq(eng, wr408). Prereq(wr309, wr310). Prereq(wr401, wr408). Prereq(wr309, wr401).
Evaluating Datalog with Recursion and Negation
Now consider Datalog with recursion & negation Person(Dan). (one tuple in base relation) Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). What is the query answer? Either Student(Dan) or Employee(Dan), depending on which rule fires first. Two different query answer!
We have a problem – for any data. Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). Imagine we have some number of persons in the database. They will ALL end up at Students or as Employees, depending on which rule fires first. Two different query answers!
What does the dependency graph look like? Person(Dan). (one tuple in base relation) Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). This is a recursive program. Employee Person Student
Now consider recursion & negation Person(Dan). (one tuple in base relation) Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). We have a problem when a cycle in a dependency graph includes (at least) one negative edge. Employee Person Student
Cycle with one (or more) negative edges Person(Dan). (one tuple in base relation) Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). Recursion (for Student, say) is attempting to add tuples to answer. But it is subtracting Employee tuples (to find the Students). And … Employees are being computed recursively at the same time. Employee Person Student
Another example with negation & recursion Node(a). Node(b). Node(c). Node(d). Arc(a,b). Arc(c,d). Start(a). Reachable(z) :- Start(z). Reachable(y) :- Reachable(x), Arc(x,y). Unreachable(x) :- Node(x), Reachable(x). These are facts from the database. Draw the dependency graph; label the negative edges.
Dependency graph for this program; notice the negative edge Node(a). Node(b). Node(c). Node(d). Arc(a,b). Arc(c,d). Start(a). Reachable(z) :- Start(z). Reachable(y) :- Reachable(x), Arc(x,y). Unreachable(x) :- Node(x), Reachable(x). Start Arc Reachable Node Unreachable ¬
Stratification – for Datalog with negation and recursion A stratification of a Datalog program P is a partition of P into strata or layers where, for every rule, H :- A1, …, Am, B1, …, Bq The rules that define the predicate H are all in the same strata. For all positive predicates in this rule, their strata is less than or equal to the strata of this rule. Strata(Ai) ≤ Strata(H). For all negative predicates in this rule, their strata is strictly less than the strata of this rule. Strata(Bi) < Strata(H).
Intuition behind stratification In any strata, compute a predicate – in its entirety. If you EVER use a predicate negatively, then make sure it is ALREADY, COMPLETELY computed. So … stratification prevents you from using recursion at the same time you’re computing a predicate that is used negatively.
Exercise: Define a stratification for this datalog program. Node(a). Node(b). Node(c). Node(d). Arc(a,b). Arc(c,d). Start(a). Reachable(z) :- Start(a). Reachable(y) :- Reachable(x), Arc(x,y). Unreachable(x) :- Node(x), Reachable(x). Start Arc Reachable Node Unreachable ¬
Exercise: Define a stratification for this datalog program. Node(a). Node(b). Node(c). Node(d). Arc(a,b). Arc(c,d). Start(a). Reachable(z) :- Start(a). Reachable(y) :- Reachable(x), Arc(x,y). Unreachable(x) :- Node(x), Reachable(x). Start Arc Reachable Node Unreachable ¬ Compute all Reachable facts (positively) in Strata 1. Then, use Reachable negatively in Strata 2. Strata 1 Strata 2
Can you define a stratification for this program? Person(Dan). (one tuple in base relation) Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). Employee Person Student
Compare these two dependency graphs Person(Dan). (one tuple in base relation) Student(x) :- Person(x), Employee(x). Employee(x) :- Person(x), Student(x). The one on left has a cycle with negation; the one on right doesn’t. The on left can’t be stratified. Employee Person Student Start Arc Reachable Node Unreachable ¬
Comments Stratified Datalog programs always have a unique answer. Note that Stratification forces the choice of one of various possible answers (models). Not all Datalog programs can be stratified. Some programs can be stratified in more than one way. If so, then all stratifications will produce the result.
Models and Interpretations
How do we use relational databases for relational calculus and for Datalog? An interpretation I = (U, C, P, F) where U is a nonempty set of elements (the domain) C is a function that maps the constants of the theory into elements of U P maps each n-ary predicate (from the theory) into an n-place relation over U F maps each n-ary function (from the theory) into an function from Un U Variables range over the set U An interpretation is a model (of the theory) iff all of the axioms are true. Union of all domains. No constant symbols needed in the theory. Function-free
How do we use relational databases for relational calculus and for Datalog? An interpretation I = (U, C, P, F) where U is a nonempty set of elements (the domain) C is a function that maps the constants of the theory into elements of U P maps each n-ary predicate (from the theory) into an n-place relation over U F maps each n-ary function (from the theory) into an function from Un U Variables range over the set U An interpretation is a model (of the theory) iff all of the axioms are true. Union of all domains. In order to have an interpretation, we need to describe (we need to write down) all of the tuples where a given predicate is true. No constant symbols needed in the theory. Function-free Think about a relational database. What does it store for us? It (exactly) stores the tuples that belong in a particular table. A relational database is (exactly) an interpretation of a very simple 1st order theory with no functions and no constant symbols.
Relational DB as an interpretation “In [domain calculus] a relational database is considered to be an interpretation of a first-order theory. The domain of the interpretation is a superset of the active domain of the database and the relations in the database are the extensions of the relation symbols of the database schema. A query in the domain relational calculus is essentially checking [i.e., finding] whether [i.e., where] the database is a model of the first-order formula in the query.” P. 108, textbook
Model of a Datalog Program (repeated) A database (aka interpretation) M for a program P is a set of facts over the predicates in P. A database M is a model for program P if It assigns each computable predicate its natural interpretation For any ground instance of a clause H :- L1, L2, …, Ln. if {L1, L2, …, Ln} M then H M. Note that all facts of P must be in M.
A Program Can Have Many Models (extra facts … but still a model) Local(r, node1, 1). Local(r, node2, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(j1, node1, 6). Local(s2, node2, 2). Local0(r, node1). Local0(r, node2). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Program: Local(Id, Loc, 1) :- Local0(Id, Loc) Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
A Second Model Model: Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(j1, node1, 6). Local(s2, node2, 2). Local(j2, node2, 9). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Program: Local(Id, Loc, 1) :- Local0(Id, Loc) Local(Id, Loc, K) :- Op1(Id, In, Loc), Local(In, Loc, M), Sum(M, 1, K). Local(Id, Loc, K) :- Op2(Id, In1, In2, Loc), Local(In1, Loc, M), Local(In2, Loc, N), Sum(M, N, J), Sum(J, 1, K). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1).
Closure Under Intersection Lemma: If M1 and M2 are both models for Datalog program P, then so is M1 M2. Proof: Let M = M1 M2. Both M1 and M2 give any computed predicate its natural interpretation, so part 1. of model definition is covered. How could M fail to be a model of P? Must have a ground instance of a clause: H :- L1, L2, …, Ln. Where {L1, L2, …, Ln} M but H M. So H must be missing from M1 or M2 (or both).
Example Intersection M1 M2 Local(r, node1, 1). Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(j1, node1, 6). Local(s2, node2, 2). Local0(r, node1). Local0(r, node2). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). Local(r, node1, 1). Local(s, node1, 1). Local(q, node1, 1). Local(u, node2, 1). Local(s1, node1, 2). Local(u1, node1, 3). Local(j1, node1, 6). Local(s2, node2, 2). Local(j2, node2, 9). Local0(r, node1). Local0(s, node1). Local0(q, node1). Local0(u, node2). Op1(s1, r, node1). Op1(s2, u, node2). Op2(j1, s1, u1, node1). Op2(u1, s, q, node1). Op2(j2, j1, s2, node1). M1 M2
Minimal Model A model M for program P is minimal if no proper subset of M is also a model. Every Datalog program P (without negation) has a unique minimal model. Consider two different minimal models N1 and N2 for P. Then N = N1 N2 is also a model, and is a proper subset of at least one of them. Minimum model of P: Unique minimal model.
Minimum Model and Queries When we answer queries against Datalog program P, we want to answer them against the minimal model. (Closed-World Assumption) Here’s a query: :- Local(j1, L, C). We want all Local facts in the minimum model where the first component is j1.
Immediate Consequence Operator If P is a Datalog program, the immediate consequence operator IP(f) computes all facts derivable from the set of facts f by a single application of the clauses in P. Note: If f is a subset of the minimum model of P, then so is IP(f), IP(IP(f)), IP(IP(IP(f))), … So start the process with f = , because we know we have a subset of the minimum model. Inflationary semantics of program P: Start with empty set of facts, apply IP until no change.
Comment on Immediate Consequence Algorithm 3.4 (Meaning) This algorithm give the meaning for unsafe, recursive Datalog programs. The algorithm uses what is called “inflationary” semantics. This means that if you ever derive a fact in the answer, then it stays there. Temp-21(x) :- Person(x, 21, y). Temp-male (x) :- Person(x, z, “male”). Answer(x) :- Temp-21(x), Temp-male(x). Algorithm 3.4 tells us that if we fire a rule (no matter what order), then the new tuples are added to Result. If the first rule fires and then the third, we’ll have some folks in Answer who are not male. (And this is not regarded as a problem.) A program may have more than one fixed point.
Comment on Immediate Consequence Algorithm 3.5 (New Meaning) This algorithm is for safe, non-recursive Datalog programs. Since the program is not recursive, we can use the dependency graph to induce an ordering on the relations. Basically, compute a relation before it’s used in the body of other rules. Algorithm 3.5, computes the relations according to that order. Temp-21(x) :- Person(x, 21, y). Temp-male (x) :- Person(x, z, “male”). Answer(x) :- Temp-21(x), Temp-male(x). Order the relations: Either Temp-21 or Temp-male is first. Then the other one is second. Answer is computed third. This follows from the dependency graph. Person Temp-21 Temp-male Answer
Comments on Datalog without recursion with negation For safe Datalog without recursion, with negation, if we compute the relations one by one, according to the dependency graph, then there is a unique fixed point. And it is the same as the minimal model. Note that for this language, all rules only need to be fired once.
Comments on Datalog with recursion NO negation For Datalog with recursion, but NO negation, then: The minimal model is unique. The minimal model is always the intersection of all the models. The minimal model is the same as the fixed point of the immediate consequence operator. This language is monotonic (rules only add facts)
Semantics of a Datalog Program with recursion and negation (slide repeated) Stratified Datalog programs always have a unique answer. Note that Stratification forces the choice of one of various possible answers. Not all Datalog programs can be stratified. Some programs can be stratified in more than one way. If so, then all stratifications will produce the result.