CSE 544: Lecture 11 Theory Monday, May 3, 2004.

Slides:

Advertisements

Similar presentations

1 Datalog: Logic Instead of Algebra. 2 Datalog: Logic instead of Algebra Each relational-algebra operator can be mimicked by one or several Database Logic.

Advertisements

1 Decidable Containment of Recursive Queries Diego Calvanese, Giuseppe De Giacomo, Moshe Y. Vardi presented by Axel Polleres

CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman.

2005conjunctive-ii1 Query languages II: equivalence & containment (Motivation: rewriting queries using views)  conjunctive queries – CQ’s  Extensions.

Lecture 24 MAS 714 Hartmut Klauck

Lecture 11: Datalog Tuesday, February 6, Outline Datalog syntax Examples Semantics: –Minimal model –Least fixpoint –They are equivalent Naive evaluation.

CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.

Complexity 11-1 Complexity Andrei Bulatov Space Complexity.

CS151 Complexity Theory Lecture 12 May 6, CS151 Lecture 122 Outline The Polynomial-Time Hierarachy (PH) Complete problems for classes in PH, PSPACE.

CSE 636 Data Integration Datalog Rules / Programs / Negation Slides by Jeffrey D. Ullman.

Abstract State Machines and Computationally Complete Query Languages Andreas Blass,U Michigan Yuri Gurevich,Microsoft Research & U Michigan Jan Van den.

CSE 544 Theory of Query Languages Tuesday, February 22 nd, 2011 Dan Suciu , Winter

CS848: Topics in Databases: Foundations of Query Optimization Topics Covered  Databases  QL  Query containment  More on QL.

Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.

Week 10Complexity of Algorithms1 Hard Computational Problems Some computational problems are hard Despite a numerous attempts we do not know any efficient.

CSE 544 Relational Calculus Lecture #2 January 11 th, Dan Suciu , Winter 2011.

Datalog Inspired by the impedance mismatch in relational databases. Main expressive advantage: recursive queries. More convenient for analysis: papers.

CSE 636 Data Integration Conjunctive Queries Containment Mappings / Canonical Databases Slides by Jeffrey D. Ullman Fall 2006.

Chapter 5 Notes. P. 189: Sets, Bags, and Lists To understand the distinction between sets, bags, and lists, remember that a set has unordered elements,

Row Types in SQL-3 Row types define types for tuples, and they can be nested. CREATE ROW TYPE AddressType{ street CHAR(50), city CHAR(25), zipcode CHAR(10)

Lecture 7: Foundations of Query Languages Tuesday, January 23, 2001.

1 CSE544 Monday April 26, Announcements Project Milestone –Due today Next paper: On the Unusual Effectiveness of Logic in Computer Science –Need.

1 CSE 326: Data Structures: Graphs Lecture 24: Friday, March 7 th, 2003.

CS589 Principles of DB Systems Fall 2008 Lecture 4d: Recursive Datalog with Negation – What is the query answer defined to be? Lois Delcambre

Extensions of Datalog Wednesday, February 13, 2001.

Lecture 9: Query Complexity Tuesday, January 30, 2001.

1 Finite Model Theory Lecture 9 Logics and Complexity Classes (cont’d)

CS589 Principles of DB Systems Fall 2008 Lecture 4c: Query Language Equivalence Lois Delcambre

CS589 Principles of DB Systems Spring 2014 Unit 2: Recursive Query Processing Lecture 2-1 – Naïve algorithm for recursive queries Lois Delcambre (slides.

Computational problems, algorithms, runtime, hardness

Datalog Rules / Programs / Negation Slides by Jeffrey D. Ullman

Finite Model Theory Lecture 8

Great Theoretical Ideas in Computer Science

Goal for this lecture Demonstrate how we can prove that one query language is more expressive than (i.e., “contained in” as described in the book) another.

Umans Complexity Theory Lectures

Containment Mappings Canonical Databases Sariaya’s Algorithm

Turing Machines Acceptors; Enumerators

Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.

NP-Completeness Yin Tat Lee

Intro to Theory of Computation

Intro to Theory of Computation

January 19th – Subqueries 2 and relational algebra

Cse 344 January 29th – Datalog.

Relational Algebra Chapter 4, Sections 4.1 – 4.2

Chapter Nine: Advanced Topics in Regular Languages

Finite Model Theory Lecture 15

Linear Programming Duality, Reductions, and Bipartite Matching

CSE544: Lecture 12 Query Answering Using Views

Finite Model Theory Lecture 6

Lecture 10: Query Complexity

CPS 173 Computational problems, algorithms, runtime, hardness

Local-as-View Mediators

Logic Based Query Languages

CSE 544: Lecture 5-6 Theory Wednesday, April 10/12, 2006.

NP-Completeness Yin Tat Lee

CSE 544: Lecture 8 Theory.

Brief Introduction to Computational Logic

Datalog Inspired by the impedance mismatch in relational databases.

Conjunctive Queries, Views, Datalog Monday, 4/29/2002

Materializing Views With Minimal Size To Answer Queries

CSE 589 Applied Algorithms Spring 1999

Finite Model Theory Lecture 7

Instructor: Aaron Roth

Instructor: Aaron Roth

Advanced Data Modeling Minimal Models

CSE544 Wednesday, March 29, 2006.

CS589 Principles of DB Systems Fall 2008 Lecture 4d: Recursive Datalog with Negation – What is the query answer defined to be? Lois Delcambre

Intro to Theory of Computation

Unit III – Chapter 3 Path Testing.

Presentation transcript:

CSE 544: Lecture 11 Theory Monday, May 3, 2004

Query Minimization Definition A conjunctive query q is minimal if for every other conjunctive query q’ s.t. q  q’, q’ has at least as many predicates (‘subgoals’) as q Are these queries minimal ? q(x) :- R(x,y), R(y,z), R(x,x) q(x) :- R(x,y), R(y,z), R(x,’Alice’)

Query Minimization Query minimization algorithm Choose a subgoal g of q Remove g: let q’ be the new query We already know q  q’ (why ?) If q’  q then permanently remove g Notice: the order in which we inspect subgoals doesn’t matter

Query Minimization In Practice No database system today performs minimization !!! Reason: It’s hard (NP-complete) Users don’t write non-minimal queries However, non-minimal queries arise when using views intensively

Query Minimization for Views CREATE VIEW HappyBoaters SELECT DISTINCT E1.name, E1.manager FROM Employee E1, Employee E2 WHERE E1.manager = E2.name and E1.boater=‘YES’ and E2.boater=‘YES’ This query is minimal

Query Minimization for Views Now compute the Very-Happy-Boaters SELECT DISTINCT H1.name FROM HappyBoaters H1, HappyBoaters H2 WHERE H1.manager = H2.name This query is also minimal What happens in SQL when we run a query on a view ?

Query Minimization for Views View Expansion SELECT DISTINCT E1.name FROM Employee E1, Employee E2, Employee E3, Empolyee E4 WHERE E1.manager = E2.name and E1.boater = ‘YES’ and E2.boater = ‘YES’ and E3.manager = E4.name and E3.boater = ‘YES’ and E4.boater = ‘YES’ and E1.manager = E3.name This query is no longer minimal ! E4 E3 E1 E2 E2 is redundant

Monotone Queries Definition A query q is monotone if: For every two databases D, D’ if D  D’ then q(D)  q(D’) Which queries below are monotone ?   x.R(x,x)   x.y.z.u.(R(x,y)  R(y,z)  R(z,u))   x.y.R(x,y)

Monotone Queries Theorem. Every conjunctive query is monotone Stronger: every UCQ query is monotone

How To Impress Your Students Or Your Boss Find all drinkers that like some beer that is not served by the bar “Black Cat” Can you write as a simple SELECT-FROM-WHERE (I.e. without a subquery) ? SELECT L.drinker FROM Likes L WHERE L.beer not in (SELECT S.beer FROM Serves S WHERE S.bar = ‘Black Cat’)

Expressive Power of FO The following queries cannot be expressed in FO: Transitive closure: x.y. there exists x1, ..., xn s.t. R(x,x1)  R(x1,x2)  ...  R(xn-1,xn)  R(xn,y) Parity: the number of edges in R is even

Datalog Adds recursion, so we can compute transitive closure A datalog program (query) consists of several datalog rules: P1(t1) :- body1 P2(t2) :- body2 .. . . Pn(tn) :- bodyn

Datalog Terminology: EDB = extensional database predicates The database predicates IDB = intentional database predicates The new predicates constructed by the program

Datalog Employee(x), ManagedBy(x,y), Manager(y) EDBs All higher level managers that are employees: HMngr(x) :- Manager(x), ManagedBy(y,x), ManagedBy(z,y) Answer(x) :- HMngr(x), Employee(x) IDBs

Datalog Employee(x), ManagedBy(x,y), Manager(y) All persons: Person(x) :- Manager(x) Person(x) :- Employee(x) Manger  Employee

Unfolding non-recursive rules Graph: R(x,y) P(x,y) :- R(x,u), R(u,v), R(v,y) A(x,y) :- P(x,u), P(u,y) Can “unfold” it into: A(x,y) :- R(x,u), R(u,v), R(v,w), R(w,m), R(m,n), R(n,y)

Unfolding non-recursive rules Graph: R(x,y) P(x,y) :- R(x,y) P(x,y) :- R(x,u), R(u,y) A(x,y) :- P(x,y) Now the unfolding has a union: A(x,y) :- R(x,y)  u(R(x,u)  R(u,y))

Recursion in Datalog Graph: R(x,y) Transitive closure: P(x,y) :- R(x,y) P(x,y) :- P(x,u), R(u,y) Transitive closure: P(x,y) :- R(x,y) P(x,y) :- P(x,u), P(u,y)

Recursion in Datalog Boolean trees: Leaf0(x), Leaf1(x), AND(x, y1, y2), OR(x, y1, y2), Root(x) Write a program that computes: Answer() :- true if the root node is 1

Recursion in Datalog One(x) :- Leaf1(x) One(x) :- AND(x, y1, y2), One(y1), One(y2) One(x) :- OR(x, y1, y2), One(y1) One(x) :- OR(x, y1, y2), One(y2) Answer() :- Root(x), One(x)

Exercise Boolean trees: Leaf0(x), Leaf1(x), AND(x, y1, y2), OR(x, y1, y2), Not(x,y), Root(x) Hint: compute both One(x) and Zero(x) here you need to use Leaf0

Variants of Datalog without recursion with recursion without  Non-recursive Datalog = UCQ (why ?) Datalog with  Non-recursive Datalog = FO Datalog

Non-recursive Datalog Union of Conjunctive Queries = UCQ Containment is decidable, and NP-complete Non-recursive Datalog Is equivalent to UCQ Hence containment is decidable here too Is it still NP-complete ?

Non-recursive Datalog A non-recursive datalog: Its unfolding as a CQ: How big is this query ? T1(x,y) :- R(x,u), R(u,y) T2(x,y) :- T1(x,u), T1(u,y) . . . Tn(x,y) :- Tn-1 (x,u), Tn-1 (u,y) Answer(x,y) :- Tn(x,y) Anser(x,y) :- R(x,u1), R(u1, u2), R(u2, u3), . . . R(um, y)

Query Complexity Given a query  in FO And given a model D = (D, R1D, …, RkD) What is the complexity of computing the answer (D)

Query Complexity Which is the most important in databases ? Vardi’s classification: Data Complexity: Fix . Compute (D) as a function of |D| Query Complexity: Fix D. Compute (D) as a function of || Combined Complexity: Compute (D) as a function of |D| and || Which is the most important in databases ?

Example (x)  u.(R(u,x)  y.(v.S(y,v)  R(x,y))) S = R = 3 8 7 5 09 6 9 89 98 4 43 4 5 58 8 6 9 79 67 7 S = R = How do we proceed ?

General Evaluation Algorithm for every subexpression i of , (i = 1, …, m) compute the answer to i as a table Ti(x1, …, xn) return Tm Theorem. If  has k variables then one can compute (D) in time O(||*|D|k) Data Complexity = O(|D|k) = in PTIME Query Complexity = O(||*ck) = in EXPTIME

General Evaluation Algorithm (x)  u.(R(u,x)  y.(v.S(y,v)  R(x,y))) Example: 1(u,x)  R(u,x) 2(y,v)  S(y,v) 3(x,y)  R(x,y) 4(y)  v.2(y,v) 5(x,y)  4(y)  3(x,y) 6(x)  y. 5(x,y) 7(u,x)  1(u,x)  6(x) 8(x)  u. 7(u,x)  (x)

Complexity Theorem. If  has k variables then one can compute (D) in time O(||*|D|k) Remark. The number of variables matters !

Paying Attention to Variables Compute all chains of length m We used m+1 variables Can you rewrite it with fewer variables ? Chainm(x,y) :- R(x,u1), R(u1, u2), R(u2, u3), . . . R(um-1, y)

Counting Variables FOk = FO restricted to variables x1,…, xk Write Chainm in FO3: Chainm(x,y) :- u.R(x,u)(x.R(u, x)(u.R(x,u)…(u. R(u, y)…))

Query Complexity Note: it suffices to investigate boolean queries only If non-boolean, do this: for a1 in D, …, ak in D if (a1, …, ak) in (D) /* this is a boolean query */ then output (a1, …, ak)

Computational Complexity Classes Recall computational complexity classes: AC0 LOGSPACE NLOGSPACE PTIME NP PSPACE EXPTIME EXPSPACE (Kalmar) Elementary Functions Turing Computable functions

Data Complexity of Query Languages Paper: On the Unusual Effectiveness of Logic in Computer Science PSPACE FO(PFP) = datalog,* PTIME FO(LFP) = datalog AC0 FO = non-rec datalog Important: the more complex a QL, the harder it is to optimize

Views Employee(x), ManagedBy(x,y), Manager(y) Views L(x,y) :- ManagedBy(x,u), ManagedBy(u,y) E(x,y) :- ManagedBy(x,y), Employee(y) Query Q(x,y) :- ManagedBy(x,u), ManagedBy(u,v), ManagedBy(v,w), ManagedBy(w,y), Employee(y) How can we answer Q if we only have L and E ?

Views Query rewriting using views (when possible): Query answering: Sometimes we cannot express it in CQ or FO, but we can still answer it Q(x,y) :- L(x,u), L(u,y), E(v,y)

Views Applications: Using advanced indexes Using replicated data Data integration [Ullman’99]