1 Querying Infinite Databases Safety of Datalog Queries over infinite Databases (Sagiv and Vardi ’90) Queries and Computation on the Web (Abiteboul and Vianu ’97) Itay Maman Student Symposium, 5 July 2006
2/19 Simple Technion Queries… (Domain: The Technion’s students database) Q1: Which courses did Gidi attend? SELECT course FROM students WHERE name='Gidi' Q2: Which students took ? SELECT name FROM students WHERE course='234218' courses coursename Gidi Gidi Dina ……
3/19 Simple Web Queries… Q3: Which pages does my home page link to? SELECT target FROM links WHERE source=' Q4: Which pages link to my home page? SELECT source FROM links WHERE target=' Q4 is challenging: No matter how long my web-crawler works… … I can never find all incoming links of a page! This is an infinite query The more you crawl the more answers you get (In Q3 the size of the result set is bounded) links Sourcetarget … …
4/19 Leading questions What does an infinite DB look like? Can we evaluate a query over an infinite DB? Can we determine the finiteness of a query? But first, some Datalog…
5/19 Datalog Why Datalog? Supports recursion/transitive closure (unlike SQL) Recursion is essential in large data-sets Terminates if DB is finite Very simple program = A collection of rules rule = A sequence of terms In our program: Three rules Two queries (AKA: IDB): g(X), small(X,Y) One Table (AKA: EDB): before(X,Y) A goal predicate from which execution starts We choose g(X) as the goal g(X) :- small(X,2). small(X,Y) :- before(X,Y). small(X,Y) :- small(X,Z), before(Z,Y). g(X) :- small(X,2). small(X,Y) :- before(X,Y). small(X,Y) :- small(X,Z), before(Z,Y).
6/19 Finiteness A DB is finite If every table is a finite set before(X,Y) { (0,1), (1,2), (2,3) } Possible evaluation schemes: Brute force Bottom up Optimizations The Requirement: Finiteness of tables The guarantee: Termination of the Datalog program
7/19 Infinity Here is another definition for our table before(X,Y) { (X,X+1) | X 0 } We now have an infinite DB The Problem: we cannot iterate over the tuples in the set The solution: Top-down algorithm Such tables are quite common The internet links relation links(X,Y) { (X,Y) | page X links to page Y } Java’s subclassing relation extends(X,Y) { (X,Y) | class X extends Y } Leading question: What does as infinite DB look like?
8/19 Example: Top-down evaluation g(W) = s(W,2) = b(W,2) s(W,Z) b(Z,2) = {(1,2)} s(W,1) {(1,2)} = {(1,2)} [b(W,1) s(W,Z) b(Z,1)] {(1,2)} = {(1,2)} [{(0,1)} s(W,0) {(0,1)}] {(1,2)} = {(1,2)} [{(0,1)} [b(W,0) s(W,Z) b(Z,0)] {(0,1)}] {(1,2)} = {(1,2)} [{(0,1)} [ s(W,Z) ] {(0,1)}] {(1,2)} = {(1,2)} [{(0,1)} {(0,1)}] {(1,2)} = {(1,2)} {(0,1)} {(1,2)} = {(1,2)} {(0,2)} = {(1,2), (0,2)} g(W) :- small(W,2). small(A,B) :- before(A,B). small(X,Y) :- small(X,Z), before(Z,Y). before(X,Y) { (X,X+1) | X 0 } g(W) :- small(W,2). small(A,B) :- before(A,B). small(X,Y) :- small(X,Z), before(Z,Y). before(X,Y) { (X,X+1) | X 0 } b : before s : small : Join s(X,Y) = b(X,Y) s(X,Z) b(Z,Y)
9/19 Top-down evaluation The Top-down algorithm Init: assign r body of the goal Loop: (Intelligently) Pick a term, t, from r If t is a query term: Replace it with the union of the rules indicated by t If t is a table term: Replace it with the set generated by the table Replace s expressions (in r) with Replace s expressions (in r) with s Evaluate relational algebra expressions (if both sides are known) Stop if no further replacements can be made Leading question: Can we evaluate a query over an infinite DB? Yes
10/19 Infinite Queries Can the top-down algorithm run forever? Yes Case 1: An table that returns an infinite result evenProduct(X,Y) { (X,Y) | X*Y mod 2 = 0 } divides(X,Y) { (X,Y) | X mod Y = 0 } links(X,Y) { (X,Y) | page X links to page Y } weak-safety: all intermediate results are finite Result #1 (Sagiv and Vardi ’90): Weak-safety is decidable given F/C (finiteness constraints) of tables F/C of evenProduct: None F/C of divides: X => Y F/C of links: X => Y Algorithm: Tracking flow of values from assigned variables
11/19 g(W) = s(2,W) = b(2,W) s(2,Z) b(Z,W) = {(2,3)} s(2,Z) b(Z,W) = {(2,3)} [b(2,Z) s(2,Z’) b(Z’,Z)] b(Z,W) … Infinite Queries (cont.) Can the top-down algorithm run forever? Yes Case 2: The algorithm’s recursion never stops A query/table is used in its “unbounded” direction g(W) :- small(2,W). small(A,B) :- before(A,B). small(X,Y) :- small(X,Z), before(Z,Y). before(X,Y) { (X,X+1) | X 0 } g(W) :- small(2,W). small(A,B) :- before(A,B). small(X,Y) :- small(X,Z), before(Z,Y). before(X,Y) { (X,X+1) | X 0 } s(X,Y) = b(X,Y) s(X,Z) b(Z,Y) Results #2-3 (Sagiv and Vardi ’90): Termination is undecidable in the general case Termination is decideable if all queries are unary
12/19 Infinite Queries (summ.) We can automatically determine weak-safety We cannot (automatically) determine termination But, one can analytically prove that a given query over a given DB is finite E.g., our small(W,2) program Leading question: Can we determine the finiteness of a query? No
13/19 The Web as a DB The web data model (WDM): A scheme of a DB that can represent the web graph Just three tables: urls = { u | u is a url of a web-page } links = { (u1,u2) | u1 links to u2; u1, u2 urls } Words = { (u,w) | w appears in page u; u urls } Result #4 (Abiteboul and Vianu ’97): If a Datalog program with no literals halts over an infinite DB, its result is => A non-trivial query (over an infinite DB) must have a literal
14/19 Web - Machines Browsing Machine A weakly safe Datalog program (over WDM) At least one URL literal Searching/Browsing Machine An unsafe Datalog program (over WDM) Evaluates queries in parallel Allowed literal types: URLs, Words Claims #1-2 ( Abiteboul and Vianu ’97): Browsing machine: Represent a user following static links from a page Searching/Browsing machine: Also allows the user to access search engine
15/19 Discussion: Finite approximation Relational Database servers are very popular Such DBs are finite Also, computing a table on demand may be slow Better performance at batch processing The challenge: Build a finite replacement for an infinite DB Formally: Given a finite query, q, over an infinite DB, (Finiteness of q proved analytically) Build a finite Database, , such that q over yield the same result as q over
16/19 Discussion: Finite approximation Example: Our small(W,2) program A finite, sound table: before(X,Y) { (0,1), (1,2) } A finite, unsound table: before(X,Y) { (0,1) } The process: Compute the transitive closure of the before relation Start from the literal ‘2’ at the right-hand side position Condition: the table graph must end with a sink In before the sink is the vertex ‘0’ => We can build a finite DB Sadly, In the web-graph no such sink exists
17/19 Discussion: Temporality Crawling takes time The subject may change while crawling The DB is a snapshot which never happened (Open Question): Can we decide whether a result was really “true” at some point?
18/19 More issues Relational algebra over large relations BDD Negation Stratified Datalog
19/19 - Questions ? -
20/19
21/19 Datalog Semantics: ??? Straight forward mapping to Relational Algebra?? g(X) :- small(X,2). small(X,Y) :- before(X,Y). small(X,Y) :- small(X,Z), before(Z,Y). g(X) :- small(X,2). small(X,Y) :- before(X,Y). small(X,Y) :- small(X,Z), before(Z,Y).
22/19 Example: Bottom-up evaluation before XY Initialization: Translate the EDBs into relations
23/19 Example: Bottom-up evaluation small XY apply small(X,Y) :- before(X,Y). before XY
24/19 Example: Bottom-up evaluation before ZY apply small(X,Y) :- small(X,Z), before(Z,Y). less XZ small XY Join small XZ before ZY small XZ XZ XZ
25/19 Example: Bottom-up evaluation apply g(X) :- small(X,2). small XY g X 1 0 XY
26/19 Finiteness before(X,Y) { (0,1) (1,2) (2,3) } The Bottom-up algorithm: Init: For each EDB, p, assign r(p) Relation of all tuples satisfying p For each IDB, p, assign r(p) Loop: Choose a rule p(…) :- t1(…), t2(…), … tn(…) t join of all r(t i ), where 1 i n r(p) r(p) t Continue until a fix-point is reached Requires: Finiteness of EDBs Ensures: Termination