1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational.

Slides:



Advertisements
Similar presentations
Three-Step Database Design
Advertisements

CPSC 504: Data Management Discussion on Chandra&Merlin 1977 Laks V.S. Lakshmanan Dept. of CS UBC.
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Logical Database Design (3 of 3) John Ortiz. Lecture 7Logical Database Design (2)2 Normalization  If a relation is not in BCNF or 3NF, we refine it by.
Shuai Ma, Yang Cao, Wenfei Fan, Jinpeng Huai, Tianyu Wo Capturing Topology in Graph Pattern Matching University of Edinburgh.
D ATABASE S YSTEMS I R ELATIONAL A LGEBRA. 22 R ELATIONAL Q UERY L ANGUAGES Query languages (QL): Allow manipulation and retrieval of data from a database.
Containment of Nested XML Queries Xin (Luna) Dong, Alon Halevy, Igor Tatarinov University of Washington.
The Theory of NP-Completeness
Carnegie Mellon Carnegie Mellon Univ. Dept. of Computer Science Database Applications C. Faloutsos Integrity Constraints.
Database Management COP4540, SCS, FIU Functional Dependencies (Chapter 14)
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
1 9. Evaluation of Queries Query evaluation – Quantifier Elimination and Satisfiability Example: Logical Level: r   y 1,…y n  r’ Constraint.
1 8. Safe Query Languages Safe program – its semantics can be at least partially computed on any valid database input. Safety is tied to program verification,
FDImplication: 1 Functional Dependencies (FDs) Let r(R) be a relation and let t  r, then the restriction of t to X  R, written t[X], is the projection.
Data Exchange & Composition of Schema Mappings Phokion G. Kolaitis IBM Almaden Research Center.
Chapter 7: Relational Database Design. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Chapter 7: Relational Database Design First Normal.
1 Describing and Utilizing Constraints to Answer Queries in Data-Integration Systems Chen Li Information and Computer Science University of California,
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
Chapter 11: Limitations of Algorithmic Power
Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
1 Triggers: Correction. 2 Mutating Tables (Explanation) The problems with mutating tables are mainly with FOR EACH ROW triggers STATEMENT triggers can.
CONSTRAINTS AND UPDATES CHAPTER 3 (6/E) CHAPTER 5 (5/E) 1.
Chapter 8: Relational Database Design First Normal Form First Normal Form Functional Dependencies Functional Dependencies Decomposition Decomposition Boyce-Codd.
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
1 Extending Dependencies with Conditions Loreto Bravo University of Edinburgh Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University.
Chapter 10 Functional Dependencies and Normalization for Relational Databases.
CS 405G: Introduction to Database Systems 16. Functional Dependency.
Virtual Network Mapping: A Graph Pattern Matching Approach Yang Cao 1,2, Wenfei Fan 1,2, Shuai Ma University of Edinburgh 2 Beihang University.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
1 The Theory of NP-Completeness 2012/11/6 P: the class of problems which can be solved by a deterministic polynomial algorithm. NP : the class of decision.
10/9/20151 The Relational Data Model TCU Database Systems Last update: September 2004 Reference: Elmasri 4 th edition, chapter 5.
DBSQL 3-1 Copyright © Genetic Computer School 2009 Chapter 3 Relational Database Model.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
Ihr Logo Fundamentals of Database Systems Fourth Edition El Masri & Navathe Chapter 10 Functional Dependencies and Normalization for Relational Databases.
Ihr Logo Fundamentals of Database Systems Fourth Edition El Masri & Navathe Chapter 10 Functional Dependencies and Normalization for Relational Databases.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
1 Dept. of CIS, Temple Univ. CIS616/661 – Principles of Data Management V. Megalooikonomou Integrity Constraints (based on slides by C. Faloutsos at CMU)
Query and Reasoning. Types of Queries Most GIS queries will select spatial features Query by Attribute (Select by Attribute) –Structured Query Language.
CS 338The Relational Model2-1 The Relational Model Lecture Topics Overview of SQL Underlying relational model Relational database structure SQL DDL and.
Reasoning about Record Matching Rules Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1 1 University of Edinburgh 2 Bell Labs Jianzhong Li Harbin Institute of Technology.
Web Science & Technologies University of Koblenz ▪ Landau, Germany Relational Data Model.
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.
A Dichotomy in the Complexity of Deletion Propagation with Functional Dependencies 2012 ACM SIGMOD/PODS Conference Scottsdale, Arizona, USA PODS 2012 Benny.
CPS Computational problems, algorithms, runtime, hardness (a ridiculously brief introduction to theoretical computer science) Vincent Conitzer.
Functional Dependencies CIS 4301 Lecture Notes Lecture 8 - 2/7/2006.
32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.
MIS 3053 Database Design And Applications The University Of Tulsa Professor: Akhilesh Bajaj Normal Forms Lecture 1 © Akhilesh Bajaj, 2000, 2002, 2003.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Efficient Discovery of XML Data Redundancies Cong Yu and H. V. Jagadish University of Michigan, Ann Arbor - VLDB 2006, Seoul, Korea September 12 th, 2006.
© D. Wong Functional Dependencies (FD)  Given: relation schema R(A1, …, An), and X and Y be subsets of (A1, … An). FD : X  Y means X functionally.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
Computational problems, algorithms, runtime, hardness
Module 2: Intro to Relational Model
Chapter 2: Relational Model
RE-Tree: An Efficient Index Structure for Regular Expressions
Computing Full Disjunctions
Chapter 2: Intro to Relational Model
CS405G: Introduction to Database Systems
RDBMS RELATIONAL DATABASE MANAGEMENT SYSTEM.
Propagating Functional Dependencies with Conditions
CPS 173 Computational problems, algorithms, runtime, hardness
Chapter 2: Intro to Relational Model
Chapter 2: Intro to Relational Model
Example of a Relation attributes (or columns) tuples (or rows)
Chapter 2: Intro to Relational Model
Chen Li Information and Computer Science
Presentation transcript:

1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational University of Defense Technology Jie Liu Chinese Academy of Sciences Yinghui Wu University of Edinburgh

2 Dependency propagation: The problem Given a set  of functional dependencies (FDs) that hold on some of the sources Questions: Do these dependencies hold on the target? How to compute the set of the view dependencies? data integration vie w  SourcesTarget

3 Dependency propagation: An example Sources R s : customers in the UK, USA and Netherlands R S (AC: int, phn: int, name: string, street: string, city: string, zip: string) Source dependencies: An FD on R UK, for UK customers  1 : R UK (zip  street) FDs on R UK and R NL, for UK and Netherlands sources  2 : R UK (AC  city)  3 : R NL (AC  city) View definition: V = Q 1  Q 2  Q 3, Q 1 : select AC, phn, name, street, city, zip, ‘44’ as CC from R UK Q 2 : select AC, phn, name, street, city, zip, ‘01’ as CC from R USA Q 3 : select AC, phn, name, street, city, zip, ‘31’ as CC from R NL Question: Does any of these source FDs hold on the view?

4 Source FDs may NOT hold on the target View V = Q 1  Q 2  Q 3, where Q 1 : select AC, phn, name, street, city, zip, ‘44’ as CC from R UK Q 2 : select AC, phn, name, street, city, zip, ‘01’ as CC from R USA Q 3 : select AC, phn, name, street, city, zip, ‘31’ as CC from R NL ACphnnamestreetcityzipCC t1:t1: MikePortlandLDNW1B 1JL44 t2:t2: RickPortlandLDNW1B 1JL44 t3:t3: JoeCopleyDarby t4:t4: MaryWalnutDarby t5:t5: MarxKruiseAmsterdam t6:t6: BartGroteAlmere  1 : R UK (zip  street)  2 : R UK (AC  city)  3 : R NL (AC  city) D UK : {t 1, t 2 }, D USA : {t 3, t 4 }, D NL : {t 5, t 6 }

5 The FDs indeed hold, but under conditions  1 : R([CC = ‘44’, zip]  [street])  2 : R([CC = ‘44’, AC]  [city])  3 : R([CC = ‘31’, AC]  [city]) ACphnnamestreetcityzipCC t1:t1: MikePortlandLDNW1B 1JL44 t2:t2: RickPortlandLDNW1B 1JL44 t3:t3: JoeCopleyDarby t4:t4: MaryWalnutDarby t5:t5: MarxKruiseAmsterdam t6:t6: BartGroteAlmere  1 : R UK (zip  street)  2 : R UK (AC  city)  3 : R NL (AC  city) Source DependenciesView Dependencies FDs are propagated, but as CFDs rather than FDs!

6 Dependency Propagation Dependency propagation:  | = v  Input: a view V, a set  of source dependencies (FDs or CFDs), and a single CFD  on the view Question: is  propagated from  via V? For any source instance D, if D |=  then the view V(D) |=  Implication problem:  | =  For any database D, if D |=  then the same database D |=  A special case of dependency propagation problem, when the views are the identity mappings  1 : R UK (zip  street)  2 : R UK (AC  city)  3 : R NL (AC  city) Source Dependencies ∑ = {  1,  2,  3 } ∑ |≠ v  1,  2,  3 ∑ | =  1,  2,  3

7 Why bother? Data exchange: views derived from TGDs from the source to the target, source dependencies, and target dependencies Is a target dependency guaranteed to hold (propagated)? Data integration: Constraint checking: do certain constraints hold on the integrated data? How to check it on a virtual view ? Update management: an insertion of (CC = 44, AC = 20, city = EDI, …) can be rejected without checking the data Query optimization: rewriting queries on the view by making use of the derived target dependencies Data quality: no need to check, e.g., zip  street on target data taken from the UK source...

8 CFD: R (X  Y, t p ), where X  Y: traditional functional dependency (FD) on R Pattern tuple t p : Attributes: X  Y For each A in X (or Y), t p [A] is either a constant or a wild card (unnamed variable) _ Example:  1 : R([CC, zip]  [street], (44, _ || _))  3 : R([CC, AC]  [city], (31, _ || _))  1 : R UK (zip  street, (_ || _)), special case of CFDs View CFDs of a special form: R (A  B, ( x || x ) ), where A and B are attributes of R, x is a special variable To express domain constraints (A = B) Conditional functional dependencies (CFDs): review

9 View definitions: A brief overview A relational Schema  = {S 1, …, S n } SPC query Q = ∏ Y (R c x E s ), where R c = {(A 1 :a 1, … A m : a m )} E s = σ F (R 1 x … x R n )  F is a conjunction of equality atoms of the form A = B and A = ‘a’ for a constant ‘a’ in dom(A)  R j is ρ(S) for some S in  SPCU query Q = V 1  …  V n, where V i is an SPC query Example Q 1 = {(CC : 44)} x R UK, Q 2 = {(CC : 01)} x R USA, Q 3 = {(CC : 31)} x R NL R = Q 1  Q 2  Q 3

10 Dependency Propagation from FDs to FDs It is believed that the propagation problem from FDs to FDs is in PTIME for SPCU views undecidable for views defined in relational algebra This PTIME result holds only if all attributes have an infinite domain When we define a schema, we specify domains of attributes R S (AC: int, phn: int, name: string, street: string, city: string, zip: string) In practice, it is common to find attributes with a finite domain: Boolean, Date, etc The general setting: finite-domain attributes may be present Theorem. The propagation problem from source FDs to view FDs is coNP-complete for SC views in the general setting

11 Dependency Propagation from FDs to FDs View Language SP SC PC SPC SPCU RA General Setting PTIME coNP-complete PTIME coNP-complete Undecidable Infinite Domain Only PTIME Undecidable There is interaction between domain constraints and dependency propagation

12 Dependency Propagation from FDs to CFDs View Language SP SC PC SPC SPCU RA General Setting PTIME coNP-complete PTIME coNP-complete Undecidable Infinite Domain Only PTIME Undecidable View CFDs alone do not make our lives harder The same complexity as its counterpart from FDs to FDs

13 Dependency Propagation from CFDs to CFDs View Language S P C SPC SPCU RA General Setting coNP-complete Undecidable Infinite Domain Only PTIME Undecidable Source CFDs complicate the propagation analysis

14 Propagation Cover Problem Problem Statement Input: a view V a set  of source dependencies (CFDs) Output: A propagation cover  c a cover of all view CFDs propagated from  via V cc data integration vie w  SourcesTarget

15 Finding Propagation Cover: Nontrivial even for FDs Example R(A 1, B 1, C 1, …, A n, B n, C n, D)  : A i  C i, B i  C i for i  [1, n], C 1, …, C n  D V = ∏ A 1, B 1, …, A n, B n, D (R), dropping C i attributes The propagation cover  c contains all FDs of the form η 1, …, η n  D, where η i is either A i or B i for i  [1, n] at least 2 n FDs, where the size of input is O(n) In contrast The implication problem for FDs is in linear time The dependency propagation problem is in PTIME for Projection views

16 Propagation Cover Problem: Harder for CFDs Already hard for FDs and P views More intricate for CFDs and SPC views Possibly infinitely many CFDs, while at most exponentially many FDs   : R(A  B, t p ), t p [A] draws values from an infinite dom(A) Trivial FDs, but nontrivial CFDs  e.g., AX  A,  : R(AX  A, t p ), t p =(_, d X || a) Transitivity involves pattern tuples  For FDs, A  B, B  C yield A  C  For CFDs: pattern tableaux have to be matched: if (X  Y, tp), (Y  Z, tp’) and tp ≤ tp’, then (X  Z, tp[X] || tp’[Z]) Interaction between domain constraints and CFDs

17 Algorithm for Computing Minimal Cover of View CFDs Input: Source CFDs  and SPC view V Output: A minimal cover of views CFDs propagated from  via V No redundant CFDs: no proper subset is a cover No redundant attributes/patterns: all CFDs are left-reduced PropCFD_SPC: Key idea An extension the Reduction by Resolution (RBR) algorithm  First proposed by G. Gottlob (PODS 1987)  Computing propagated cover of FDs over Projection views  In Polynomial time in many practical cases Domain constraints are also represented as CFDs PropCFD_SPC has the same complexity as RBR RBR is for FDs and P views PropCFD_SPC is for CFDs and SPC views

18 Algorithm PropCFD_SPC Input V = ∏ Y (  F (R 1  R 2  R 3 )), where  Y = {A, B, C, D, H, J}  F = {A = H, D = G, E = K }  = {  1,  2 }, where   1 = R 2 (CD  E, (_, c || a))   2 = R 3 (KGH  J, (_, c, b || _)) Step1:  = MinCover(  ); Step2: (a) EQ = ComputeEQ(  F (R 1  R 2  R 3 ),  ) (b) choose representative rep(eq) for each eq class R1R1 ABR2R2 CDER3R3 KGHJ A, HD, GE, K BCJ

19 Algorithm PropCFD_SPC Output: MinCover(  c   d ) = {Ф 1, Ф 2 } Step 3: (a) Substitute each A  eq with rep(eq) in CFDs  1 = R 2 (CD  E, (_, c || a))  2 = R 3 (KGH  J, (_, c, b || _))  1 ’ = CD  E, (_, c || a)  2 ’ = EDA  J, (_, c, b || _) (b) Remove attributes not in Y={A, B, C, D, H, J} from EQ Step 4:  c = RBR(  v, EGK) Ф 1 = CDA  J, ( _, c, b || _ ) Step 5:  d = EQ2CFD(EQ) Ф 2 = A  H, ( x || x ) CDEJDEA  v = {  1 ',  2 ' } A, HD, GE, K BCJ A, HD BCJ

20 Experimental Study Investigate the impact of The source CFDs and the complexity of SPC views CFD generator Input: , m, n, LHS, var% Output: A set  consisting of source CFDs SPC view generator Input: , |Y|, |F|, |E c | Output: An SPC view  Y (  F (Ec)) Experimental Settings # of relations at least 10, each with 10 to 20 attributes # of CFDs  [200, 2000], LHS  [3, 9], var%  [40%, 50%] SPC View: |Y|  [5, 50], |F|  [1, 10], |Ec|  [2, 11] 1 PC, 3.00GHz Intel (R) Pentium (R) D processor, 1GB of memory An average of 5 tests on each dataset

21 Varying CFDs on the Source (|Y|=25, |F|=10, |E c |=4) Scales well w.r.t |  | Cardinality of the minimal cover of propagated CFDs is smaller than |  |

22 Varying Projection Attributes (|  |=2000,|F| =10,|E c |=4) Runtime sensitive to |Y| The larger the size |Y|, the more the view CFDs

23 Varying Selection Condition (|  |=2000,|Y|=25,|E c |=4) The larger the size |F|, the smaller the Runtime Cardinality of the minimal cover of propagated CFDs goes up and down

24 Varying Number of Relations (|  |=2000, |F|=10, |Y|=25) The larger the size |E c |, the smaller the Runtime Cardinality of the minimal cover of propagated CFDs goes down

25 Summary A complete picture of complexity bounds on dependency propagation for from source FDs/CFDs to view FDs/CFDs via views in various fragments of relational algebra The first complexity results on dependency propagation in the general setting, namely, in presence of finite-domains A practical algorithm for computing minimal propagation cover for CFDs via SPC views, without incurring extra complexity: the same complexity as its counterpart for FDs via P views Open research issues: adding union: for SPCU views adding finite-domain attributes A useful tool for analyzing constraints in data exchange/integration