Optimal Query Processing Meets Information Theory

Optimal Query Processing Meets Information Theory
Dan Suciu – University of Washington Joint w/ Mahmoud Abo-Khamis and Hung Ngo RelationalAI Inc

Basic Question What is the optimal runtime to compute a query Q on a database D? Data complexity: Q is fixed, runtime = f(D) Applications: CSP where the nodes of Q are the variables of the CSP problem, its hyperedges are the constraints, and D represents the valid assignments Databases: …

Example select * -- natural join from R, S, T
Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X Query plan ⨝ ⨝ T(Z,X) R(X,Y) S(Y,Z)

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X R: X Y 1 2 3 ... N/2 Query plan ⨝ ⨝ T(Z,X) N R(X,Y) S(Y,Z)

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X R: X Y 1 2 3 ... N/2 S: (same as R) Y Z 1 2 3 ... N/2 T: (same as R) Z X 1 2 3 ... N/2 Query plan ⨝ ⨝ T(Z,X) N R(X,Y) S(Y,Z)

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X R: X Y 1 2 3 ... N/2 S: (same as R) Y Z 1 2 3 ... N/2 T: (same as R) Z X 1 2 3 ... N/2 Query plan ⨝ Θ(N2) tuples ⨝ T(Z,X) N R(X,Y) S(Y,Z)

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X 0 tuples R: X Y 1 2 3 ... N/2 S: (same as R) Y Z 1 2 3 ... N/2 T: (same as R) Z X 1 2 3 ... N/2 Query plan ⨝ Θ(N2) tuples ⨝ T(Z,X) N R(X,Y) S(Y,Z)

Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Will prove later: |Q(D)| ≤ N3/2 select * natural join from R, S, T where R.Y = S.Y and S.Z = T.Z and T.X = R.X 0 tuples R: X Y 1 2 3 ... N/2 S: (same as R) Y Z 1 2 3 ... N/2 T: (same as R) Z X 1 2 3 ... N/2 Query plan ⨝ Θ(N2) tuples ⨝ T(Z,X) N R(X,Y) S(Y,Z) Every plan costs Θ(N2)!

Outline Motivation Full CQ Boolean CQ Conclusions

Maximum Output Size maxD satisfies stats (|Q(D)|)
E.g. R(X,Y) ∧ S(Y,Z), |R|, |S| ≤ N No other info: |Q(D)| ≤ N2 S.Y is a key: |Q(D)| ≤ N S.Y has degree ≤ d: |Q(D)| ≤ d×N E.g. R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) No other info: |Q(D)| ≤ N3/2

Main Principle Find information-theoretic proof of the upper bound, or the submodular width Convert proof to algorithm

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D  entropic function H

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D  entropic function H Database D R(X,Y) S(Y,Z) T(Z,X) X Y a 3 2 b d Y Z 3 m 2 q Z X m a q b d

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D  entropic function H Output Q(D) Database D R(X,Y) S(Y,Z) T(Z,X) X Y Z a 3 m 2 q b d X Y a 3 2 b d Y Z 3 m 2 q Z X m a q b d

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D  entropic function H Output Q(D) Database D R(X,Y) S(Y,Z) T(Z,X) X Y Z a 3 m 1/5 2 q b d X Y a 3 2 b d Y Z 3 m 2 q Z X m a q b d H(XYZ) = log |Q(D)|

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D  entropic function H Output Q(D) Database D R(X,Y) S(Y,Z) T(Z,X) X Y Z a 3 m 1/5 2 q b d X Y a 3 2/5 2 1/5 b d Y Z 3 m 2/5 2 q 1/5 Z X m a 1/5 q 2/5 b d H(XYZ) = log |Q(D)|

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) Database D  entropic function H Output Q(D) Database D R(X,Y) S(Y,Z) T(Z,X) X Y Z a 3 m 1/5 2 q b d X Y a 3 2/5 2 1/5 b d Y Z 3 m 2/5 2 q 1/5 Z X m a 1/5 q 2/5 b d H(XYZ) = log |Q(D)| H(XY) ≤ log NR H(YZ) ≤ log NS H(Z|Y) ≤ log degS(z|y) H(XZ) ≤ log NT Cardinalitites, functional dependences, max degrees

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N  |Q(D)|≤ N3/2

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N  |Q(D)|≤ N3/2 3 log N ≥ h(XY) + h(YZ) + h(XZ) ≥ h(XYZ) + h(Y) + h(XZ) ≥ h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output|

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N  |Q(D)|≤ N3/2 submodularity 3 log N ≥ h(XY) + h(YZ) + h(XZ) ≥ h(XYZ) + h(Y) + h(XZ) ≥ h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output|

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N  |Q(D)|≤ N3/2 submodularity 3 log N ≥ h(XY) + h(YZ) + h(XZ) ≥ h(XYZ) + h(Y) + h(XZ) ≥ h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Output| submodularity

Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N  |Q(D)|≤ N3/2 submodularity 3 log N ≥ h(XY) + h(YZ) + h(XZ) ≥ h(XYZ) + h(Y) + h(XZ) ≥ h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Q(D)| submodularity

Shearer’s inequality h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ)
Proof of Upper Bound Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) |R|,|S|,|T| ≤ N  |Q(D)|≤ N3/2 submodularity 3 log N ≥ h(XY) + h(YZ) + h(XZ) ≥ h(XYZ) + h(Y) + h(XZ) ≥ h(XYZ) + h(XYZ) + h(∅) = 2 h(XYZ) = 2 log |Q(D)| submodularity Shearer’s inequality h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ)

Proof to Algorithm + Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X)
h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) h(XY)+h(YZ)+h(XZ) h(XYZ) h(Y) + h(XZ) + Proof h(XYZ)

Proof to Algorithm + Runtime Õ(N3/2)
Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) h(XYZ) h(Y) + h(XZ) + Proof h(XYZ)

Proof to Algorithm + Runtime Õ(N3/2)
Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) h(Y) + h(XZ) + Rlight(X,Y)∧S(Y,Z) Proof h(XYZ) Rlight or Rheavy: degree(Y) ≤ or > N1/2

Proof to Algorithm + ∪ Runtime Õ(N3/2)
Q(X,Y,Z) = R(X,Y) ∧ S(Y,Z) ∧ T(Z,X) h(XY) + h(YZ) + h(XZ) ≥ 2 h(XYZ) Algorithm h(XY)+h(YZ)+h(XZ) R(X,Y)∧S(Y,Z)∧T(Z,X) N3/2 h(XYZ) h(Y) + h(XZ) + N3/2 Rlight(X,Y)∧S(Y,Z) ∪ Proof h(XYZ) Rheavy(Y)∧T(X,Z) Rlight or Rheavy: degree(Y) ≤ or > N1/2

Enumeration Problem: Discussion
Cardinalities: [Atserias,Grohe,Marx’08, Ngo,Re,Rudra’13] Entropic bound = polymatroid bound Algorithm for Q(D) has single log factor Cardinalities + FDs + max degrees: Entropic bound ≨ polymatroid bound Algorithm for Q(D) has polylog factor

Background: Tree Decomposition
Informally: TD = a tree where each node t represents an enumeration problem Fractional hypetree width [Grohe,Marx’14] mintree maxnode t maxD Submodular width [Marx’2013] maxD mintree maxnode t

Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x)
|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97]

mintree maxnode t maxD Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x)
|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD

mintree maxnode t maxD Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x)
|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y)

mintree maxnode t maxD Runtime Õ(N2) (suboptimal)
Z X Y Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) |R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] mintree maxnode t maxD R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) Runtime Õ(N2) (suboptimal)

maxD mintree maxnode t Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x)
|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y)

|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) =

|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T2

|R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy)))

3 log N ≥ h(xy) + h(yz) + h(zu)
Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) |R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = 3 log N ≥ h(xy) + h(yz) + h(zu) T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy)))

3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu)
Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) |R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy)))

Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) |R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) + h(∅) T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy)))

Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) |R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) + h(∅) ≥ 2 min(h(xyz),h(yzu)) T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy)))

Q() = ∃x∃y∃z∃u R(x,y)∧S(y,z)∧T(z,u)∧K(u,x) |R|,|S|,|T|,|K| ≤ N O(N3/2) algorithm [Alon,Yuster,Zwick’97] maxD mintree maxnode t R(x,y),S(y,z) T(z,u),K(u,x) Tree decompositions subw(Q) = 3/2 log N S(y,z),T(z,u) K(u,x),R(x,y) T1 min( max(h(xyz),h(zux)), max(h(yzu),h(uxy))) = 3 log N ≥ h(xy) + h(yz) + h(zu) ≥ h(xyz) + h(y) + h(zu) ≥ h(xyz) + h(yzu) + h(∅) ≥ 2 min(h(xyz),h(yzu)) T2 = max(min(h(xyz),h(yzu)), min(h(xyz),h(uxy)), min(h(zux),h(yzu)), min(h(zux),h(uxy))) ≤ 3/2 log N

Proof to Algorithm Use the proof of: to compute the disjunctive datalog rule: (details omitted) h(xyz)+h(yzu) ≤ h(xy) + h(yz) + h(zu) A(x,y,z) ∨ B(y,z,u)  R(x,y) ∧ S(y,z) ∧ T(z,u) Runtime Õ(N3/2)

Conclusions Query evaluation summary: Information theory  Proof
Proof  Algorithm Open problems: Better “Proof  Algorithm” Fine-grained lower bounds

Thank You! Questions?

Optimal Query Processing Meets Information Theory

Similar presentations

Presentation on theme: "Optimal Query Processing Meets Information Theory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimal Query Processing Meets Information Theory

Similar presentations

Presentation on theme: "Optimal Query Processing Meets Information Theory"— Presentation transcript:

Similar presentations

About project

Feedback