1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories.

Slides:



Advertisements
Similar presentations
primary key constraint foreign key constraint
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 16 Relational Database Design Algorithms and Further Dependencies.
Floris Geerts (University of Antwerp) Giansalvatore Mecca, Donatello Santoro (Università della Basilicata) Paolo Papotti (Qatar Computing Research Institute)
Database Management COP4540, SCS, FIU Functional Dependencies (Chapter 14)
Efficient Query Evaluation on Probabilistic Databases
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
The Relational Model System Development Life Cycle Normalisation
FDImplication: 1 Functional Dependencies (FDs) Let r(R) be a relation and let t  r, then the restriction of t to X  R, written t[X], is the projection.
Chapter 7: Relational Database Design. ©Silberschatz, Korth and Sudarshan7.2Database System Concepts Chapter 7: Relational Database Design First Normal.
CMSC424: Database Design Instructor: Amol Deshpande
1 CMSC424, Spring 2005 CMSC424: Database Design Lecture 9.
Schema Refinement and Normalization Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus.
Dr. Alexandra I. Cristea CS 319: Theory of Databases: C3.
Databases 6: Normalization
Functional Dependencies CS 186, Spring 2006, Lecture 21 R&G Chapter 19 Science is the knowledge of consequences, and dependence of one fact upon another.
Towards Certain Fixes with Editing Rules and Master Data Wenfei Fan Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh Jianzhong Li Harbin Institute.
©Silberschatz, Korth and Sudarshan7.1Database System Concepts Chapter 7: Relational Database Design First Normal Form Pitfalls in Relational Database Design.
1 Extending Dependencies with Conditions Loreto Bravo University of Edinburgh Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University.
Chapter 10 Functional Dependencies and Normalization for Relational Databases.
CS 405G: Introduction to Database Systems 16. Functional Dependency.
1 Dependencies for improving data quality Conditional functional dependencies (CFDs; Chapter 2) –Syntax and semantics –Static analysis: consistency and.
1 Propagating Functional Dependencies with Conditions Wenfei Fan University of Edinburgh & Bell Laboratories Shuai Ma University of Edinburgh Yanli HuNational.
Database Systems Normal Forms. Decomposition Suppose we have a relation R[U] with a schema U={A 1,…,A n } – A decomposition of U is a set of schemas.
Your name here. Improving Schemas and Normalization What are redundancies and anomalies? What are functional dependencies and how are they related to.
CS143 Review: Normalization Theory Q: Is it a good table design? We can start with an ER diagram or with a large relation that contain a sample of the.
Functional Dependencies An example: loan-info= Observe: tuples with the same value for lno will always have the same value for amt We write: lno  amt.
Discussion of Conditional Functional Dependencies Erik Wang.
Ihr Logo Fundamentals of Database Systems Fourth Edition El Masri & Navathe Chapter 10 Functional Dependencies and Normalization for Relational Databases.
Ihr Logo Fundamentals of Database Systems Fourth Edition El Masri & Navathe Chapter 10 Functional Dependencies and Normalization for Relational Databases.
1 Discovering Robust Knowledge from Databases that Change Chun-Nan HsuCraig A. Knoblock Arizona State UniversityUniversity of Southern California Journal.
Logical Database Design (1 of 3) John Ortiz Lecture 6Logical Database Design (1)2 Introduction  The logical design is a process of refining DB schema.
1 Lecture 6: Schema refinement: Functional dependencies
Functional Dependencies. FarkasCSCE 5202 Reading and Exercises Database Systems- The Complete Book: Chapter 3.1, 3.2, 3.3., 3.4 Following lecture slides.
Christoph F. Eick: Functional Dependencies, BCNF, and Normalization 1 Functional Dependencies, BCNF and Normalization.
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
1 Functional Dependencies. 2 Motivation v E/R  Relational translation problems : –Often discover more “detailed” constraints after translation (upcoming.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 15.
A Logic of Partially Satisfied Constraints Nic Wilson Cork Constraint Computation Centre Computer Science, UCC.
Functional Dependencies R&G Chapter 19 Science is the knowledge of consequences, and dependence of one fact upon another. Thomas Hobbes ( )
Reasoning about Record Matching Rules Wenfei Fan 1, 2 Xibei Jia 1 Shuai Ma 1 1 University of Edinburgh 2 Bell Labs Jianzhong Li Harbin Institute of Technology.
CPT-S Topics in Computer Science Big Data 1 Yinghui Wu EME 49.
Querying Big Data by Accessing Small Data Wenfei FanUniversity of Edinburgh & Beihang University Floris GeertsUniversity of Antwerp Yang CaoUniversity.
Schema Refinement and Normalization Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2009.
Rensselaer Polytechnic Institute CSCI-4380 – Database Systems David Goldschmidt, Ph.D.
MIS 3053 Database Design And Applications The University Of Tulsa Professor: Akhilesh Bajaj Normal Forms Lecture 1 © Akhilesh Bajaj, 2000, 2002, 2003.
Ch 7: Normalization-Part 1
CS542 1 Schema Refinement Chapter 19 (part 1) Functional Dependencies.
1 Extending and Inferring Functional Dependencies in Schema Transformation Qi He Tok Wang Ling Dept. of Computer Science School of Computing National Univ.
CS411 Database Systems Kazuhiro Minami 04: Relational Schema Design.
Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke1 Schema Refinement and Normal Forms Chapter 19.
Relational Database Design Algorithms and Further Dependencies.
Chapter 8 Relational Database Design. 2 Relational Database Design: Goals n Reduce data redundancy (undesirable replication of data values) n Minimize.
1 CS 430 Database Theory Winter 2005 Lecture 8: Functional Dependencies Second, Third, and Boyce-Codd Normal Forms.
Normalization and FUNctional Dependencies. Redundancy: root of several problems with relational schemas: –redundant storage, insert/delete/update anomalies.
Normal Forms Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems June 18, 2016 Some slide content courtesy of Susan Davidson.
CPT-S Advanced Databases 11 Yinghui Wu EME 49 ADB(ln23)
CPT-S Advanced Databases 1 Yinghui Wu EME 49 ADB (ln24)
CSC 411/511: DBMS Design Dr. Nan Wang 1 Schema Refinement and Normal Forms Chapter 19.
1 CS122A: Introduction to Data Management Lecture #12: Relational DB Design Theory (1) Instructor: Chen Li.
Normalization Database Management Systems, 3rd ed., Ramakrishnan and Gehrke, Chapter 19.
Chapter 15 Relational Design Algorithms and Further Dependencies
Database Management Systems (CS 564)
Relational Database Design by Dr. S. Sridhar, Ph. D
3.1 Functional Dependencies
Schema Refinement and Normalization
Functional Dependencies
Propagating Functional Dependencies with Conditions
Chapter 19 (part 1) Functional Dependencies
Presentation transcript:

1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories

2 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Outline of Part III

3 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Conditional functional dependencies (CFDs)

4 Data in real-life is often dirty Errors, conflicts and inconsistencies Australia: 500,000 dead people retain active Medicare cards US: Pentagon asked 275 dead/wounded officers to re-enlist. UK: there are 81 million National Insurance numbers but only 60 million eligible citizens. It is estimated that in a 500,000 customer database, 120,000 customer records become invalid within a year, due to deaths, divorces, marriages, moves. typical data error rate in industry: 1% - 5%, up to 30%...

5 Dirty data is costly Poor data costs US companies $600 billion annually Wrong price data in retail databases costs US customers $2.5 billion each year AAA improves data quality by 20%, and saves $150,000… in postage stamps alone 30%-80% of the development time for data cleaning in a data integration project and don’t forget CIA intelligence on WMD in Iraq! The need for (semi-)automated methods to clean data!

6 Characterizing the consistency of data One of the central technical problems is how to tell whether the data is dirty or clean Specify consistency using integrity constraints Inconsistencies emerge as violations of constraints Constraints considered so far: traditional –functional dependencies –inclusion dependencies –denial constraints (a special case of full dependencies) –... Question: are these traditional dependencies sufficient?

7 Example: customer relation Schema: Cust(country, area-code, phone, street, city, zip) Instance: countryarea-codephonestreetcityzip MayfieldNYCEH4 8LE CrichtonNYCEH4 8LE Mountain AveNYC07974 functional dependencies (FDs): cust[country, area-code, phone]  cust[street, city, zip] cust[country, area-code]  cust[city] The database satisfies the FDs. Is the data consistent?

8 Capturing inconsistencies in the data cust ([country = 44, zip]  [street]) In the UK, zip code uniquely determines the street The constraint may not hold for other countries It expresses a fundamental part of the semantics of the data It can NOT be expressed as a traditional FD –It does not hold on the entire relation; instead, it holds on tuples representing UK customers only countryarea-codephonestreetcityzip MayfieldNYCEH4 8LE CrichtonNYCEH4 8LE Mountain AveNYC07974

9 Two more constraints cust([country = 44, area-code = 131, phone]  [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone]  [street, zip, city = MH]) –In the UK, if the area code is 131, then the city has to be EDI –In the US, if the area code is 908, then the city has to be MH t1, t2 and t3 violate these constraints –refining cust([country, area-code, phno]  [street, city, zip]) –combining constants and variables idcountryArea-codephonestreetcityzip t MayfieldNYCEH4 8LE t CrichtonNYCEH4 8LE t Mountain AveNYC07974

10 The need for new constraints cust([country = 44, zip]  [street]) cust([country = 44, area-code = 131, phone]  [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone]  [street, zip, city = MH]) They capture inconsistencies that traditional FDs cannot detect Traditional constraints were developed for schema design, not for data cleaning! Data integration in real-life: source constraints –hold on a subset of sources –hold conditionally on the integrated data They are NOT expressible as traditional FDs –do not hold on the entire relation –contain constant data values, besides logical variables

11 An extension of traditional FDs: (R: X  Y, Tp) X  Y: embedded traditional FD on R Tp: a pattern tableau –attributes: X  Y –each tuple in Tp consists of constants and unnamed variable _ Example: cust([country = 44, zip]  [street]) (cust (country, zip  street), Tp) pattern tableau Tp Conditional Functional Dependencies (CFDs) countryzipstreet 44__

12 Represent cust([country = 44, area-code = 131, phone]  [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone]  [street, zip, city = MH]) cust([country, area-code, phone]  [street, city, zip]) as a SINGLE CFD: (cust(country, area-code, phone  street, city, zip ), Tp) pattern tableau Tp: one tuple for each constraint Example CFDs countryarea-codephonestreetcityzip 44131__Edi_ 01908__MH_ ______

13 Express cust[country, area-code]  cust[city] as a CFD: (cust(country, area-code,  city ), Tp) pattern tableau Tp: a single tuple consisting of _ only CFDs subsume traditional FDs Traditional FDs as a special case countryarea-codecity ___

14 a  b (a matches b) if –either a or b is _ –both a and b are constants and a = b tuple t1 matches t2: t1  t2 (a, b)  (a, _), but (a, b) does not match (a, c) DB satisfies (R: X  Y, Tp) iff for any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB, if t1[X] = t2[X]  tp[X], then t1[Y] = t2[Y]  tp[Y] –tp[X]: identifying the set of tuples on which the constraint tp applies, ie, { t | t[X]  tp[X]} –t1[Y] = t2[Y]  tp[Y]: enforcing the embedded FD, and the pattern of tp Semantics of CFDs

15 cust([country = 44, zip]  [street]) Tuples t1 and t2 violate the CFD t1[country, zip] = t2[country, zip]  tp[country, zip] t1[street]  t2[street] The CFD applies to t1 and t2 since they match tp[country, zip] Example: violation of CFDs idcountryarea-codephonestreetcityzip t MayfieldNYCEH4 8LE t CrichtonNYCEH8 8LE t Mountain AveNYC07974 countryzipstreet 44__

16 (cust(country, area-code  city ), Tp) Tuple t1 does not satisfy the CFD t1[country, area-code] = t1[country, area-code]  tp1[country, area-code] t1[city] = t1[city]; however, t1[city] does not match tp1[city] In contrast to traditional FDs, a single tuple may violate a CFD Violation of CFDs by a single tuple idcountryarea-codecity tp144131Edi tp201908MH tp3___ idcountryarea-codephonestreetcityzip t MayfieldNYCEH4 8LE t CrichtonNYCEH8 8LE t Mountain AveNYC07974

17 Conditional tables, Codd tables and variable tables have been studied for incomplete information Conditional tables: representing infinitely many relation instances, one for each instantiation of variables Pattern tableau in a CFD: each pattern tuple is a constraint, and all constraints applying to the same relation instance Relational table, traditional dependencies and CFDs One end of the spectrum: relations consisting of data only The other end of the spectrum: traditional dependencies defined in terms of logic variables CFD: in the between, both data values and logic variables CFDs: enforcing binding of semantically related data values CFDs vs. conditional tables

18 “Dirty” constraints? A set of CFDs may be inconsistent! Inconsistent: (R(A  B), Tp) In any nonempty database DB and for any tuple t in DB, –tp1: t[B] must be b –tp2: t[B] must be c –Inconsistent if b and c are different inconsistent  = {  1,  2 },  1 = (R(A  B ), Tp1),  2 = (R(B  A ), Tp2) Why? idAB tp1_b tp2_c Tp idAB tp1trueb tp2falsec idBA tp3bfalse tp4ctrue

19 The satisfiability problem The satisfiability problem for CFDs is to determine, given a set  of CFDs, whether or not there exists a nonempty database DB that satisfies , i.e., for any  in , DB satisfies . Whether or not  makes sense For traditional FDs, it is not an issue: one can specify any FDs without worrying about their consistency In contrast, a set of CFDs may be inconsistent!

20 The complexity of the satisfiability analysis Theorem. The satisfiability problem for CFDs is NP-complete. Nontrivial: contrast this with the trivial consistency analysis of FDs! Proof idea: Upper bound: the small model property: if  is satisfiable, then there is DB that satisfies  and consists of a single tuple! Lower bound: reduction from the non-tautology problem Good news: PTIME special cases Theorem. Given a set  of CFDs on a relation schema R, the satisfiability of  can be determined in O(|  | 2 ) time if either the schema R is predefined (fixed), or no attributes in  have a finite domain Proof idea: an extension of chase for CFDs

21 The implication problem The implication problem for CFDs is to determine, given a set  of CFDs and a single CFD , whether  implies , denoted by  |= , i.e., for any database DB, if DB satisfies , then DB satisfies . Example:  = {  1,  2 },  1 = (R(A  B ), Tp1),  2 = (R(B  C ), Tp2)  = (R(A  C ), Tp)  |= . Why? idAB tp1_b Tp1 Tp2 idBC tp1_c idAC tpac

22 The complexity of the implication problem For traditional FDs, the implication problem is in linear time In contrast, the implication problem for CFDs is intractable Theorem. The implication problem for CFDs is coNP-complete. Tractable special cases Theorem. Given a set  of CFDs and a single CFD  on a relation schema R, whether  |=  can be determined in O((|  | + |  |) 2 ) time if either the schema R is predefined, or no attributes in  and  have a finite domain Proof idea: an extension of chase for CFDs

23 Finite axiomatizability: Flashback Armstrong’s axioms can be found in every database textbook: Reflexivity: If Y  X, then X  Y Augmentation: If X  Y, then XZ  YZ Transitivity: If X  Y and Y  Z, then X  Z Sound and complete for FD implication, i.e,  |=  iff  can be inferred  from using reflexivity, augmentation, transitivity. Question: is there a sound and complete inference system for the implication analysis of CFDs?

24 Finite axiomatizability of CFDs Theorem. There is a sound and complete inference system I for implication analysis of CFDs Sound: if  |- , i.e.,  can be proved from  using I, then  |=  Complete: if  |= , then  |-  using I The inference system is more involved than its counterpart for traditional FDs, namely, Armstrong’s axioms. There are 5 axioms. A normal form of CFDs: (R: X  A, tp), tp is a single pattern tuple.

25 Axioms for CFDs: extension of Armstrong’s axioms Reflexivity: If A  X, then (R : X  A, tp), where A1…AkAA _…___ or A1…AkAA _…_aa Augmentation: If (X  A, tp) and B  attr(R), then (BX  A, t’p) A1…AkA tp[A1]…tp[Ak]tp[A] A1…AkBA tp[A1]…tp[Ak]_tp[A] tp t’p

26 Axioms for CFDs: transitivity Transitivity: if ([A1,…,Ak]  [B1,…,Bm], tp) and ([B1,…,Bm]  [C1,…,Cn], t’p) A1…AkB1…Bm tp[A1]…tp[Ak]tp[B1]tp[Bm] A1…AkC1…Cn tp[A1]…tp[Ak]t’p[C1]t’p[Cn] B1…BmC1…Cn tp’[B1]…t’p[Bm]t’p[C1]t’p[Cm] ([A1,…,Ak]  [C1,…,Cn], t’p) match

27 Axioms for CFDs: reduction reduction: if ([B, X]  A, tp), tp[B] = _, and tp[A] = a A1…AkBA tp[A1]…tp[Ak]_a then (X  A, t’p) A1…AkA tp[A1]…tp[Ak]a

28 Axioms for CFDs: finite domain upgrade upgrade: if only consistent values for B are b1, b2,..., bn, dom(B) = { b1, …, bn, …, bm}, and (R : [A1,...,Ak, B]  A, tp) A1…AkBA tp[A1]…tp[Ak]b1tp[A] tp[A1]…tp[Ak]…tp[A] tp[A1]…tp[Ak]bntp[A] then (R : [A1,...,Ak, B]  A, tp) A1…AkBA tp[A1]…tp[Ak]_tp[A]

29 Static analyses: CFD vs. FD satisfiabilityimplicationfinite axiom’ty CFDNP-completecoNP-completeyes FDO(1)O(n)yes General setting: satisfiabilityimplicationfinite axiom’ty CFDO(n 2 ) yes FDO(1)O(n)yes in the absence of finite-domain attributes: Theorem: In the absence of finite-domain attributes, Reflexivity, Augmentation, Transitivity and Reduction are sound and complete for CFD implication complications: finite-domain attributes, interaction between satisfiability and implication analyses

30 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Conditional Inclusion dependencies (CINDs)

31 Example: Amazon database Schema: order(asin, title, type, price, country, county) -- source book(asin, isbn, title, price, format) -- target CD(asin, title, price, genre) asin: Amazon standard identification number Instances: asintitletypepricecountrycounty a23H. Porterbook17.99USDL a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD

32 Schema matching Traditional inclusion dependencies: order[asin, title, price]  book[asin, title, price] order[asin, title, price]  CD[asin, title, price] These inclusion dependencies do not make sense! Inclusion dependencies from source to target (e.g., Clio) asintitletypepricecountrycounty asinisbntitlepriceasintitlepricegenre

33 Schema matching: dependencies with conditions asintitletypepricecountrycounty asinisbntitlepriceasintitlepricegenre Conditional inclusion dependencies: order[asin, title, price; type = book]  book[asin, title, price] order[asin, title, price; type = CD]  CD[asin, title, price] order[asin, title, price]  book[asin, title, price] holds only if type = book order[asin, title, price]  CD[asin, title, price] holds only if type = CD The constraints do not hold on the entire order table

34 Date cleaning with conditional dependencies CIND1: order[asin, title, price; type = book]  book[asin, title, price] CIND2: order[asin, title, price; type = CD]  CD[asin, title, price] Tuple t1 violates CIND1 Tuple t2 violates CIND2 idasintitletypepricecountrycounty t1a23H. Porterbook17.99USDL t2a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD

35 More on data cleaning CD[asin, title, price; genre = ‘a-book’]  book[asin, title, price; format = ‘audio’] Inclusion relation CD[asin, title, price]  book[asin, title, price] holds only if genre = ‘a-book’, i.e., when the CD is an audio book In addition, the format of the corresponding book must be audio – a pattern for the referenced tuple asinisbntitlepriceformat a23b32Harry Porter17.99Hard cover a56b65Snow White17.94audio asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book book CD

36 (R1[X; Xp]  R2[Y; Yp], Tp) R1[X]  R2[Y]: embedded traditional IND from R1 to R2 Tp: a pattern tableau –attributes: X  Xp  Y  Yp –tuples in Tp consist of constants and unnamed variable _ Example: express CIND1: order[asin, title, price; type = book]  book[asin, title, price] ( order[asin, title, price; type]  book[asin, title, price; nil], Tp) nil: empty list pattern tableau Tp Conditional Inclusion Dependencies (CINDs) asintitlepricetypeasintitleprice ___book___

37 CIND2: order[asin, title, price; type = CD]  CD[asin, title, price] CIND3: CD[asin, title, price; genre = ‘a-book’]  book[asin, title, price; format = ‘audio’] Examples CINDs asintitlepricetypeasintitleprice ___CD___ asintitlepricegenreasintitlepriceformat ___a-book___audio ( order[asin, title, price; type]  CD[asin, title, price; nil], Tp) ( CD[asin, title, price; genre]  book[asin, title, price; format], Tp)

38 R1[X]  R2[Y] X: [A1, …, An] Y : [B1, …, Bn] As a CIND: (R1[X; nil]  R2[Y; nil], Tp) pattern tableau Tp: a single tuple consisting of _ only CINDs subsume traditional INDs Traditional CINDs as a special case A1…AnB1…Bn ______

39 DB = (DB1, DB2), where DBj is an instance of Rj, j = 1, 2. DB satisfies (R1[X; Xp]  R2[Y; Yp], Tp) iff for any tuples t1 in DB1, and any tuple tp in the pattern tableau Tp, if t1[X, Xp]  tp[X, Xp], then there exists t2 in DB2 such that t1[Y] = t2[Y] (traditional IND semantics) t2[Y, Yp]  tp[Y, Yp] (matching the pattern tuple on Y, Yp) Patterns: t1[X, Xp]  tp[X, Xp]: identifying the set of R1 tuples on which tp applies: { t1 | t1[X, Xp]  tp[X, Xp] } t2[Y, Yp]  tp[Y, Yp]: enforcing the embedded IND and the constraint specified by patterns Y, Yp Semantics of CINDs

40 ( CD[asin, title, price; genre]  book[asin, title, price; format], Tp) The following DB satisfies the CIND Example asinisbntitlepriceformat a23b32Harry Porter17.99Hard cover a56b65Snow white7.94audio asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book book CD asintitlepricegenreasintitlepriceformat ___a-book___audio

41 More examples CIND1: ( order[asin, title, price; type]  book[asin, title, price; nil], Tp) The following DB violates CIND1. Why? idasintitletypepricecountrycounty t1a23H. Porterbook17.99USDL t2a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56S. White7.94a-book order bookCD asintitlepricetypeasintitleprice ___book___

42 The satisfiability problem for CINDs The satisfiability problem for CINDs is to determine, given a set  of CINDs, whether or not there exists a nonempty database DB that satisfies , i.e., for any  in , DB satisfies . Recall Any set of traditional INDs is always satisfiable! For CFDs, the satisfiability problem is intractable. In contrast. Theorem. Any set of CINDs is always satisfiable! Despite the increased expressive power, the complexity of the satisfiability analysis does not go up.

43 The implication problem for CINDs The implication problem for CINDs is to decide, given a set  of CINDs and a single CIND , whether  implies  (  |=  ). For traditional INDs, the implication problem is PS PACE -complete For CINDs, the complexity does not hike up, to an extent: Theorem. For CINDs containing no finite-domain attributes, the implication problem is PSPACE -complete In the general setting, however, we have to pay a price: Theorem. The implication problem for CINDs is EXPTIME -complete Proof idea: Lower bound: reduction from two-player tiling game Upper bound: an extension of the chase for CINDs

44 Finite axiomatizability of CINDs Rules for inferring IND implication: –Reflexivity: If R[X]  R[X] –Projection and Permutation: If R1[A1, …, Ak]  R2[B1, …, Bk], then R1[Ai1, …, Aik]  R2[Bi1, …, Bik], –Transitivity: If R1[X]  R2[Y] and R2[Y]  R3[Z], then R1[X]  R3[Z] Sound and complete for IND implication CINDs retain the finite axiomatizability Theorem. There is a sound and complete inference system for implication analysis of CINDs There are 8 axioms.

45 Inference rules for CINDs Normal form of CINDs: (R1[X; Xp]  R2[Y; Yp], tp), tp is a single pattern tuple tp[A] is a constant iff A is in Xp or Yp (tp[B] = _ if B is in X or Y) Inference rules Reflexivity: (R[X; nil]  R[X; nil], tp), where A1…AkA1…Ak _…__…_ Projection and permutation: If (R1[X; Xp]  R2[Y; Yp], tp), then (R1[X’; X’p]  R2[Y’; Y’p], t’p), for any permutation of X, Xp XXpYYp _tp[Xp]_tp[Yp] tp t’p X’X’pY’Y’p _tp[X’p]_tp[Y’p]

46 Axioms for CINDs: transitivity Transitivity: if (R1[X; Xp]  R2[Y; Yp], tp), and (R2[Y; Yp]  R3[Z; Zp], t’p), XXpYYp _tp[Xp]_tp[Yp] XXpZZp _tp[Xp]_t’p[Zp] YYpZZp _tp[Yp]_t’p[Zp] (R1[X; Xp]  R3[Z; Zp], t”p) equal

47 downgrading: if (R1[X, A; Xp]  R2[Y, B; Yp], tp), XAXpYBYp __tp[Xp]__tp[Yp] (R1[X; Xp, A]  R2[Y; Yp, B], t’p) XXpAYYpB _tp[Xp]a_tp[Yp]a Axioms for CINDs: downgrading

48 Axioms for CINDs: augmentation augmentation: if (R1[X; Xp]  R2[Y; Yp], tp), A  attr(R1), XXpYYp _tp[Xp]_tp[Yp] XXpAYYp _tp[Xp]a_tp[Yp] (R1[X; Xp, A]  R2[Y; Yp], t’p)

49 Axioms for CINDs: reduction reduction: if (R1[X; Xp]  R2[Y; Yp, B], tp), XXpYYpB _tp[Xp]_tp[Yp]tp[B] then (R1[X; Xp]  R2[Y; Yp], t’p), XXpYYp _tp[Xp]_tp[Yp]

50 Axioms for CFDs: finite domain reduction F-reduction: if (R1[X; Xp, A]  R2[Y; Yp], tp), dom(A) = { a1,…, an} XXpAYYp _tp[Xp]a1_tp[Yp] _tp[Xp]…_tp[Yp] _tp[Xp]an_tp[Yp] then (R1[X; Xp]  R2[Y; Yp], tp), XXpYYp _tp[Xp]_tp[Yp]

51 Axioms for CFDs: finite domain upgrade upgrade: if (R1[X; Xp, A]  R2[Y, B; Yp], tp), dom(A) = { a1,…, an} XXpAYYpB _tp[Xp]a1_tp[Yp]a1 _tp[Xp]…_tp[Yp]… _tp[Xp]an_tp[Yp]an then (R1[X,A; Xp]  R2[Y, B; Yp], tp), XAXpYBYp __tp[Xp]__tp[Yp]

52 Static analyses: CIND vs. IND satisfiabilityimplicationfinite axiom’ty CINDO(1)EXPTIME-completeyes INDO(1)PSPACE-completeyes General setting: satisfiabilityimplicationfinite axiom’ty CINDO(1)PSPACE-completeyes INDO(1)PSPACE-completeyes in the absence of finite-domain attributes: Theorem: In the absence of finite-domain attributes, Reflexivity, Projection and Permutation, Transitivity, Augmentation, Downgrading and Reduction are sound and complete for CIND implication CINDs retain most complexity bounds of their traditional counterpart

53 CFDs and CINDs taken together We need both CFDs and CINDs for data cleaning schema matching Theorem. The implication problem for CFDs and CINDs is undecidable Not surprising: The implication problem for traditional FDs and INDs is already undecidable Theorem. The consistency problem for CFDs and CINDs is undecidable In contrast, any set of traditional FDs and INDs is consistent! Proof idea: induction from the implication problem for FDs and INDs

54 Static analyses: CFD + CIND vs. FD + IND satisfiabilityimplicationfinite axiom’ty CFD + CINDundecidable No FD + INDO(1)undecidableNo CINDs and CFDs properly subsume FDs and INDs Both the satisfiability analysis and implication analysis are beyond reach in practice This calls for effective heuristic methods

55 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Algorithms and open research issues

56 Detecting CFD Violations countryarea-codephonestreetcityzip 44131__Edi_ 01908__MH_ ______ countryarea-codephonestreetcityzip MayfieldNYCEH4 8LE CrichtonNYCEH4 8LE Mountain AveNYC07974 CFD: (cust(country, area-code, phone  street, city, zip), Tp) detection

57 Detecting CFD violations Input: a set  of CFDs and a database DB Output: the set of tuples in DB that violate at least one CFD in  Approach: automatically generate SQL queries to find violations Complication 1: consider (R: X  Y, Tp), the pattern tableau may be large (recall that each tuple in Tp is in fact a constraint) Goal: the size of the SQL queries is independent of Tp Trick: treat Tp as a data table CINDs can be checked along the same lines

58 Single CFD: step 1 A pair of SQL queries, treating Tp as a data table –Single-tuple violation (pattern matching) –Multi-tuple violations (traditional FDs) (cust(country, area-code, phone  street, city, zip), Tp) Single-tuple violation: Qc select * from R t, Tp tp where t[country]  tp[country] AND t[area-code]  tp[area-code] AND t[phone]  tp[phone] (t[street] <> tp[street] OR t[city] <> tp[city] OR t[zip] <> tp[zip])) – <>: not matching; –t[A1]  tp[A1]: (t[A1] = tp[A1] OR tp[A1] = _)

59 Single CFD: step 2 Multi-tuple violations (the semantics of traditional FDs): Qv select distinct t.country, t.area-code, t.phone from R t, Tp tp where t[country]  tp[country] AND t[area-code]  tp[area-code] AND t[phone]  tp[phone] group by t.country, t.area-code, t.phone having count(distinct street, city, zip) > 1 Tp is treated as a data table (cust(country, area-code, phone  street, city, zip), Tp)

60 Multiple CFDs Complication 2: if the set  has n CFDs, do we use 2n SQL queries, and thus 2n passes of the database DB? Goal: 2 SQL queries no matter how many CFDs are in  the size of the SQL queries is independent of Tp Trick: merge multiple CFDs into one Given (R: X1  Y1, Tp1), (R: X2  Y2, Tp2) Create a single pattern table: Tm = X1  X2  Y1  Y2, a don’t-care variable, to populate attributes of pattern tuples in X1 – X2, etc (tp[A] Modify the pair of SQL queries by using Tm

61 Handling multiple CFDs zipstate 07974NJ 90291CA 01202_ CFD 2 : (zip  state, T 2 ) areastate __ 212NY CFD 1 : (area  state, T 1 ) areazipstate CA CA CFD 3 : (area,zip  state, T 3 ) areazipstate CFD 2 CFD 2 CFD 2 CFD 1 CFD 3 : CA CFD 3 : CA CFD M :(area,zip  state, T M ) Qc: select * from R t, T M t p where t[area] ≍ t p [area] AND t[zip] ≍ t p [zip] AND t[state] <> t p [state] Qv: select distinct area, zip from Macro group by area, zip having count(distinct state) > 1 Macro: select (case t p [area] when then else t[area] end) as area... from R t, T M t p where t[area] ≍ t p [area] AND t[zip] ≍ t p [zip] AND t p [state] =_

62 Keeping things tidy… ccareaphoneaddrcityzipstate t1:t1: Elm Str.MH07974NJ t2:t2: Pine Str.MH07974NY t3:t3: Oak Str.NYC01202NY t4:t4: Main Str.LA90291CA areastate __ 212NY CFD: (area  state, T) t5:t5: Rice Str.LA90291CT Tuple deletions: A tuple deletion might remove violations. Tuples that were dirty, before the deletion, might become clean! Tuple insertions: A tuple insertion might introduce violations. Tuples that were clean, before the insertion, might become dirty! Updating database

63 Incremental inconsistency detection Incremental approach: compute change  V such that the new violations V’ = the old violations V +  V Why? Input: a set  of CFDs, a database DB, the set V of tuples in DB that violate , and changes  DB to DB Output: the set V’ of tuples in DB +  DB that violate CFDs in   DB: a set of tuples to be inserted into DB or deleted from DB Approaches: Batch approach: –Compute DB’ = DB +  DB –Apply the SQL detection queries to DB’

64 The need for incremental inconsistency detection Small  DB to DB tends to incur only small changes  V to V –more efficient to compute  V then computing V’ starting from scratch Minimize unnecessary recomputation and traversal of DB Goal: generate new SQL queries that perform incremental detection modify the SQL detection queries by leveraging  DB and V design and maintain auxiliary structures (indexing, mark)

65 Logging violations areastate __ 212NY CFD: (area  state, T) ccareaphoneaddrcityzipstateBCBC BVBV t1:t1: Elm Str.MH07974NJ01 t2:t2: Pine Str.MH07974NY01 t3:t3: Oak Str.NYC01202NJ10 t4:t4: Main Str.LA90291CA00 Use one pair of columns, for each CFD Attribute B C records whether tuple t violates query Qc of the CFD Attribute B V records whether tuple t violates query Qv of the CFD Initialization: update R t set t[B C ] = 1 where t in (Q C ) update R t set t[B V ] = 1 where t in (Q V )

66 Handling deletions Let t del be the tuple we want to delete from R Step 1:delete from R t where t = t del Step 2:update R t set t[B V ] = 0 where t[B V ] = 1 AND t[area] = t del [area] AND 1 = (select count(distinct state) from R t’ where t’[area] = t del [area]) ccareaphoneaddrcityzipstateBCBC BVBV t1:t1: Elm Str.MH07974NY01 t2:t2: Pine Str.MH07974NJ01 t3:t3: Oak Str.NYC01202NJ10 t4:t4: Main Str.LA90291CA00 areastate __ 212NY CFD: (area  state, T) How about batch deletions? How about batch deletions?

67 Handling insertions ccareaphoneaddrcityzipstateBCBC BVBV t3:t3: Oak Str.NYC01202NJ10 t4:t4: Main Str.LA90291CA00 t ins : Rice Str.LA90291CT areastate __ 212NY CFD: (area  state, T) Let t ins be the tuple we want to insert into R Step 1:insert into R values t ins Step 2:update R t set t[B C ] = 1 where t = t ins AND exists ( select * from T t p where t[area] ≍ t p [area] AND t[state] <>t p [state]) Step 3:update R t set t[B V ] = 1, t ins [B V ] = 1 where t[area] = t ins [area] AND t[state] ≠ t ins [state] AND exists ( select * from T t p where t ins [area] ≍ t p [area] AND t p [state] = _) Step 2Step 3 01 / 1 Step 1 How about batch insertions? How about batch insertions?

68 Handling batch insertions Step 1 Let R ins be the tuples we want to insert into R Step 1:Find tuples in R ins which violate Qc Step 2:Find clean tuples in R that become dirty, due to some tuple(s) in R ins Step 3:Find tuples in R ins that become dirty due to some dirty tuple(s) in R Step 4:Find clean tuples in R ins that violate the CFD Step 5:Insert R ins in R Step 2 Step 3 Step 4 R ins : R:

69 The source code Let R ins be the tuples we want to insert into R Step 1: update R ins t ins set t ins [B C ] = 1 where t ins in (Q C ) Step 2: update R t set t[B V ] = 1 where t[B C ] = 0 AND t[B V ] = 0 AND exists (select * from T where t[area] ≍ t p [area] AND t p [state] = _) AND exists (select * from R ins t ins where t ins [area] = t[area] AND t ins [state] ≠ t[state]) Step 3: update R ins t ins set t ins [B V ] = 1 where exists (select * from R t where t ins [area] = t[area] AND t[B V ] = 1) Step 4: update R ins t ins set t ins [B V ] = 1 where t ins [B C ] = 0 AND t ins [B V ] = 0 AND t ins [area] IN (select area from R ins t’ ins, T p t p where t’ ins [B C ] = 0 AND t’ ins [B V ] = 0 AND t’ ins [area] ≍ t p [area] AND t p [state] = _ group by area having count(distinct state) > 1) Step 5: insert into R values (select * from R ins )

70 Scalability in Instance Size

71 Scalability in NumConsts

72 Scalability in Noise

73 Merging CFDs

74 Incremental Deletions

75 Incremental Batch Deletions

76 Incremental Insertions

77 Incremental Batch Insertions

78 Checking the satisfiability of CFDs Input: a set  of CFDs MAXSC: find a maximum subset of  that is consistent Complexity: the MAXSC problem for CFDs is NP-complete Theorem: there is an  -approximation algorithm for MAXSC there exist constant  such that for the subset  m found by the algorithm has a bound: card(  m ) >  card(OPT(  )) Proof idea: approximation factor preserving reduction to MAXGSAT Open questions: effective heuristic algorithms for checking the satisfiability of CFDs + CINDs (undecidable) for determining implication of CFDs, CINDs, and CFDs + CINDs for finding minimum cover of CFDs, CINDs, and CFDs + CINDs

79 Automated methods for finding a repair Input: a relational database DB, and a set  of CFDs Output: a repair DB’ of DB such that cost(DB’, DB) is minimal repair: DB’ satisfies  “good”: cost(DB’, DB) –DB’ is “close” to the original data in DB –Minimizing changes to “accurate” attributes Complexity: Finding an optimal repair is NP-complete (data complexity) for traditional FDs, for a fixed set of FDs (or INDs) and fixed schema PSPACE-complete for CFDs + CINDs (combined complexity) Open questions: effective heuristic for repairing databases based on CFDs, CINDs, and CFDs + CINDs

80 Incremental repair Input: a clean database DB, changes  DB to DB, and a set  of CFDs Output: a repair DB’ of DB +  DB Complexity. The local data cleaning problem is NP-hard, even if  DB consists of a single tuple. Open questions: find effective heuristic algorithms for incrementally repairing databases based on CFDs CINDs CFDs + CINDs

81 Discovering CFDs and CINDs Input: Sample databases of a schema R Output: CFDs and CINDs that hold on all (or most) database instances of R Difficulty. A naïve approach may find non-representative CFDs and CINDs as large as the sample data Open questions: find effective method for discovering CFDs CINDs CFDs + CINDs

82 Summary Conditional functional dependencies –for data cleaning rather than schema design –complexity bounds of satisfiability and implication analyses –a sound and complete inference system Conditional inclusion dependencies –for data cleaning and schema matching in practice –complexity bounds of satisfiability and implication analyses –a sound and complete inference system Complexity bounds for CFDs and CINDs taken together SQL techniques for automatic detection of CFD violations –a pair of SQL queries for validating multiple CFDs –incremental techniques for validating CFDs A practical method for data cleaning and schema matching

83 References Conditional Functional Dependencies for Data Cleaning The 23rd International Conference on Database Engineering (ICDE), Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis Extending Dependencies with Conditions Loreto Bravo, Wenfei Fan, Shuai Ma Improving Data Quality: Consistency and Accuracy Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, Shuai Ma Conditional Functional Dependencies for Capturing Data Inconsistencies Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis