Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories.

Similar presentations


Presentation on theme: "1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories."— Presentation transcript:

1 1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories

2 2 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Outline of Part III

3 3 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Conditional functional dependencies (CFDs)

4 4 Data in real-life is often dirty Errors, conflicts and inconsistencies Australia: 500,000 dead people retain active Medicare cards US: Pentagon asked 275 dead/wounded officers to re-enlist. UK: there are 81 million National Insurance numbers but only 60 million eligible citizens. It is estimated that in a 500,000 customer database, 120,000 customer records become invalid within a year, due to deaths, divorces, marriages, moves. typical data error rate in industry: 1% - 5%, up to 30%...

5 5 Dirty data is costly Poor data costs US companies $600 billion annually Wrong price data in retail databases costs US customers $2.5 billion each year AAA improves data quality by 20%, and saves $150,000… in postage stamps alone 30%-80% of the development time for data cleaning in a data integration project and don’t forget CIA intelligence on WMD in Iraq! The need for (semi-)automated methods to clean data!

6 6 Characterizing the consistency of data One of the central technical problems is how to tell whether the data is dirty or clean Specify consistency using integrity constraints Inconsistencies emerge as violations of constraints Constraints considered so far: traditional –functional dependencies –inclusion dependencies –denial constraints (a special case of full dependencies) –... Question: are these traditional dependencies sufficient?

7 7 Example: customer relation Schema: Cust(country, area-code, phone, street, city, zip) Instance: countryarea-codephonestreetcityzip 441311234567MayfieldNYCEH4 8LE 441313456789CrichtonNYCEH4 8LE 019083456789Mountain AveNYC07974 functional dependencies (FDs): cust[country, area-code, phone]  cust[street, city, zip] cust[country, area-code]  cust[city] The database satisfies the FDs. Is the data consistent?

8 8 Capturing inconsistencies in the data cust ([country = 44, zip]  [street]) In the UK, zip code uniquely determines the street The constraint may not hold for other countries It expresses a fundamental part of the semantics of the data It can NOT be expressed as a traditional FD –It does not hold on the entire relation; instead, it holds on tuples representing UK customers only countryarea-codephonestreetcityzip 441311234567MayfieldNYCEH4 8LE 441313456789CrichtonNYCEH4 8LE 019083456789Mountain AveNYC07974

9 9 Two more constraints cust([country = 44, area-code = 131, phone]  [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone]  [street, zip, city = MH]) –In the UK, if the area code is 131, then the city has to be EDI –In the US, if the area code is 908, then the city has to be MH t1, t2 and t3 violate these constraints –refining cust([country, area-code, phno]  [street, city, zip]) –combining constants and variables idcountryArea-codephonestreetcityzip t1441311234567MayfieldNYCEH4 8LE t2441313456789CrichtonNYCEH4 8LE t3019083456789Mountain AveNYC07974

10 10 The need for new constraints cust([country = 44, zip]  [street]) cust([country = 44, area-code = 131, phone]  [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone]  [street, zip, city = MH]) They capture inconsistencies that traditional FDs cannot detect Traditional constraints were developed for schema design, not for data cleaning! Data integration in real-life: source constraints –hold on a subset of sources –hold conditionally on the integrated data They are NOT expressible as traditional FDs –do not hold on the entire relation –contain constant data values, besides logical variables

11 11 An extension of traditional FDs: (R: X  Y, Tp) X  Y: embedded traditional FD on R Tp: a pattern tableau –attributes: X  Y –each tuple in Tp consists of constants and unnamed variable _ Example: cust([country = 44, zip]  [street]) (cust (country, zip  street), Tp) pattern tableau Tp Conditional Functional Dependencies (CFDs) countryzipstreet 44__

12 12 Represent cust([country = 44, area-code = 131, phone]  [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone]  [street, zip, city = MH]) cust([country, area-code, phone]  [street, city, zip]) as a SINGLE CFD: (cust(country, area-code, phone  street, city, zip ), Tp) pattern tableau Tp: one tuple for each constraint Example CFDs countryarea-codephonestreetcityzip 44131__Edi_ 01908__MH_ ______

13 13 Express cust[country, area-code]  cust[city] as a CFD: (cust(country, area-code,  city ), Tp) pattern tableau Tp: a single tuple consisting of _ only CFDs subsume traditional FDs Traditional FDs as a special case countryarea-codecity ___

14 14 a  b (a matches b) if –either a or b is _ –both a and b are constants and a = b tuple t1 matches t2: t1  t2 (a, b)  (a, _), but (a, b) does not match (a, c) DB satisfies (R: X  Y, Tp) iff for any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB, if t1[X] = t2[X]  tp[X], then t1[Y] = t2[Y]  tp[Y] –tp[X]: identifying the set of tuples on which the constraint tp applies, ie, { t | t[X]  tp[X]} –t1[Y] = t2[Y]  tp[Y]: enforcing the embedded FD, and the pattern of tp Semantics of CFDs

15 15 cust([country = 44, zip]  [street]) Tuples t1 and t2 violate the CFD t1[country, zip] = t2[country, zip]  tp[country, zip] t1[street]  t2[street] The CFD applies to t1 and t2 since they match tp[country, zip] Example: violation of CFDs idcountryarea-codephonestreetcityzip t1441311234567MayfieldNYCEH4 8LE t2441313456789CrichtonNYCEH8 8LE t3019083456789Mountain AveNYC07974 countryzipstreet 44__

16 16 (cust(country, area-code  city ), Tp) Tuple t1 does not satisfy the CFD t1[country, area-code] = t1[country, area-code]  tp1[country, area-code] t1[city] = t1[city]; however, t1[city] does not match tp1[city] In contrast to traditional FDs, a single tuple may violate a CFD Violation of CFDs by a single tuple idcountryarea-codecity tp144131Edi tp201908MH tp3___ idcountryarea-codephonestreetcityzip t1441311234567MayfieldNYCEH4 8LE t2441313456789CrichtonNYCEH8 8LE t3019083456789Mountain AveNYC07974

17 17 Conditional tables, Codd tables and variable tables have been studied for incomplete information Conditional tables: representing infinitely many relation instances, one for each instantiation of variables Pattern tableau in a CFD: each pattern tuple is a constraint, and all constraints applying to the same relation instance Relational table, traditional dependencies and CFDs One end of the spectrum: relations consisting of data only The other end of the spectrum: traditional dependencies defined in terms of logic variables CFD: in the between, both data values and logic variables CFDs: enforcing binding of semantically related data values CFDs vs. conditional tables

18 18 “Dirty” constraints? A set of CFDs may be inconsistent! Inconsistent: (R(A  B), Tp) In any nonempty database DB and for any tuple t in DB, –tp1: t[B] must be b –tp2: t[B] must be c –Inconsistent if b and c are different inconsistent  = {  1,  2 },  1 = (R(A  B ), Tp1),  2 = (R(B  A ), Tp2) Why? idAB tp1_b tp2_c Tp idAB tp1trueb tp2falsec idBA tp3bfalse tp4ctrue

19 19 The satisfiability problem The satisfiability problem for CFDs is to determine, given a set  of CFDs, whether or not there exists a nonempty database DB that satisfies , i.e., for any  in , DB satisfies . Whether or not  makes sense For traditional FDs, it is not an issue: one can specify any FDs without worrying about their consistency In contrast, a set of CFDs may be inconsistent!

20 20 The complexity of the satisfiability analysis Theorem. The satisfiability problem for CFDs is NP-complete. Nontrivial: contrast this with the trivial consistency analysis of FDs! Proof idea: Upper bound: the small model property: if  is satisfiable, then there is DB that satisfies  and consists of a single tuple! Lower bound: reduction from the non-tautology problem Good news: PTIME special cases Theorem. Given a set  of CFDs on a relation schema R, the satisfiability of  can be determined in O(|  | 2 ) time if either the schema R is predefined (fixed), or no attributes in  have a finite domain Proof idea: an extension of chase for CFDs

21 21 The implication problem The implication problem for CFDs is to determine, given a set  of CFDs and a single CFD , whether  implies , denoted by  |= , i.e., for any database DB, if DB satisfies , then DB satisfies . Example:  = {  1,  2 },  1 = (R(A  B ), Tp1),  2 = (R(B  C ), Tp2)  = (R(A  C ), Tp)  |= . Why? idAB tp1_b Tp1 Tp2 idBC tp1_c idAC tpac

22 22 The complexity of the implication problem For traditional FDs, the implication problem is in linear time In contrast, the implication problem for CFDs is intractable Theorem. The implication problem for CFDs is coNP-complete. Tractable special cases Theorem. Given a set  of CFDs and a single CFD  on a relation schema R, whether  |=  can be determined in O((|  | + |  |) 2 ) time if either the schema R is predefined, or no attributes in  and  have a finite domain Proof idea: an extension of chase for CFDs

23 23 Finite axiomatizability: Flashback Armstrong’s axioms can be found in every database textbook: Reflexivity: If Y  X, then X  Y Augmentation: If X  Y, then XZ  YZ Transitivity: If X  Y and Y  Z, then X  Z Sound and complete for FD implication, i.e,  |=  iff  can be inferred  from using reflexivity, augmentation, transitivity. Question: is there a sound and complete inference system for the implication analysis of CFDs?

24 24 Finite axiomatizability of CFDs Theorem. There is a sound and complete inference system I for implication analysis of CFDs Sound: if  |- , i.e.,  can be proved from  using I, then  |=  Complete: if  |= , then  |-  using I The inference system is more involved than its counterpart for traditional FDs, namely, Armstrong’s axioms. There are 5 axioms. A normal form of CFDs: (R: X  A, tp), tp is a single pattern tuple.

25 25 Axioms for CFDs: extension of Armstrong’s axioms Reflexivity: If A  X, then (R : X  A, tp), where A1…AkAA _…___ or A1…AkAA _…_aa Augmentation: If (X  A, tp) and B  attr(R), then (BX  A, t’p) A1…AkA tp[A1]…tp[Ak]tp[A] A1…AkBA tp[A1]…tp[Ak]_tp[A] tp t’p

26 26 Axioms for CFDs: transitivity Transitivity: if ([A1,…,Ak]  [B1,…,Bm], tp) and ([B1,…,Bm]  [C1,…,Cn], t’p) A1…AkB1…Bm tp[A1]…tp[Ak]tp[B1]tp[Bm] A1…AkC1…Cn tp[A1]…tp[Ak]t’p[C1]t’p[Cn] B1…BmC1…Cn tp’[B1]…t’p[Bm]t’p[C1]t’p[Cm] ([A1,…,Ak]  [C1,…,Cn], t’p) match

27 27 Axioms for CFDs: reduction reduction: if ([B, X]  A, tp), tp[B] = _, and tp[A] = a A1…AkBA tp[A1]…tp[Ak]_a then (X  A, t’p) A1…AkA tp[A1]…tp[Ak]a

28 28 Axioms for CFDs: finite domain upgrade upgrade: if only consistent values for B are b1, b2,..., bn, dom(B) = { b1, …, bn, …, bm}, and (R : [A1,...,Ak, B]  A, tp) A1…AkBA tp[A1]…tp[Ak]b1tp[A] tp[A1]…tp[Ak]…tp[A] tp[A1]…tp[Ak]bntp[A] then (R : [A1,...,Ak, B]  A, tp) A1…AkBA tp[A1]…tp[Ak]_tp[A]

29 29 Static analyses: CFD vs. FD satisfiabilityimplicationfinite axiom’ty CFDNP-completecoNP-completeyes FDO(1)O(n)yes General setting: satisfiabilityimplicationfinite axiom’ty CFDO(n 2 ) yes FDO(1)O(n)yes in the absence of finite-domain attributes: Theorem: In the absence of finite-domain attributes, Reflexivity, Augmentation, Transitivity and Reduction are sound and complete for CFD implication complications: finite-domain attributes, interaction between satisfiability and implication analyses

30 30 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Conditional Inclusion dependencies (CINDs)

31 31 Example: Amazon database Schema: order(asin, title, type, price, country, county) -- source book(asin, isbn, title, price, format) -- target CD(asin, title, price, genre) asin: Amazon standard identification number Instances: asintitletypepricecountrycounty a23H. Porterbook17.99USDL a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD

32 32 Schema matching Traditional inclusion dependencies: order[asin, title, price]  book[asin, title, price] order[asin, title, price]  CD[asin, title, price] These inclusion dependencies do not make sense! Inclusion dependencies from source to target (e.g., Clio) asintitletypepricecountrycounty asinisbntitlepriceasintitlepricegenre

33 33 Schema matching: dependencies with conditions asintitletypepricecountrycounty asinisbntitlepriceasintitlepricegenre Conditional inclusion dependencies: order[asin, title, price; type = book]  book[asin, title, price] order[asin, title, price; type = CD]  CD[asin, title, price] order[asin, title, price]  book[asin, title, price] holds only if type = book order[asin, title, price]  CD[asin, title, price] holds only if type = CD The constraints do not hold on the entire order table

34 34 Date cleaning with conditional dependencies CIND1: order[asin, title, price; type = book]  book[asin, title, price] CIND2: order[asin, title, price; type = CD]  CD[asin, title, price] Tuple t1 violates CIND1 Tuple t2 violates CIND2 idasintitletypepricecountrycounty t1a23H. Porterbook17.99USDL t2a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD

35 35 More on data cleaning CD[asin, title, price; genre = ‘a-book’]  book[asin, title, price; format = ‘audio’] Inclusion relation CD[asin, title, price]  book[asin, title, price] holds only if genre = ‘a-book’, i.e., when the CD is an audio book In addition, the format of the corresponding book must be audio – a pattern for the referenced tuple asinisbntitlepriceformat a23b32Harry Porter17.99Hard cover a56b65Snow White17.94audio asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book book CD

36 36 (R1[X; Xp]  R2[Y; Yp], Tp) R1[X]  R2[Y]: embedded traditional IND from R1 to R2 Tp: a pattern tableau –attributes: X  Xp  Y  Yp –tuples in Tp consist of constants and unnamed variable _ Example: express CIND1: order[asin, title, price; type = book]  book[asin, title, price] ( order[asin, title, price; type]  book[asin, title, price; nil], Tp) nil: empty list pattern tableau Tp Conditional Inclusion Dependencies (CINDs) asintitlepricetypeasintitleprice ___book___

37 37 CIND2: order[asin, title, price; type = CD]  CD[asin, title, price] CIND3: CD[asin, title, price; genre = ‘a-book’]  book[asin, title, price; format = ‘audio’] Examples CINDs asintitlepricetypeasintitleprice ___CD___ asintitlepricegenreasintitlepriceformat ___a-book___audio ( order[asin, title, price; type]  CD[asin, title, price; nil], Tp) ( CD[asin, title, price; genre]  book[asin, title, price; format], Tp)

38 38 R1[X]  R2[Y] X: [A1, …, An] Y : [B1, …, Bn] As a CIND: (R1[X; nil]  R2[Y; nil], Tp) pattern tableau Tp: a single tuple consisting of _ only CINDs subsume traditional INDs Traditional CINDs as a special case A1…AnB1…Bn ______

39 39 DB = (DB1, DB2), where DBj is an instance of Rj, j = 1, 2. DB satisfies (R1[X; Xp]  R2[Y; Yp], Tp) iff for any tuples t1 in DB1, and any tuple tp in the pattern tableau Tp, if t1[X, Xp]  tp[X, Xp], then there exists t2 in DB2 such that t1[Y] = t2[Y] (traditional IND semantics) t2[Y, Yp]  tp[Y, Yp] (matching the pattern tuple on Y, Yp) Patterns: t1[X, Xp]  tp[X, Xp]: identifying the set of R1 tuples on which tp applies: { t1 | t1[X, Xp]  tp[X, Xp] } t2[Y, Yp]  tp[Y, Yp]: enforcing the embedded IND and the constraint specified by patterns Y, Yp Semantics of CINDs

40 40 ( CD[asin, title, price; genre]  book[asin, title, price; format], Tp) The following DB satisfies the CIND Example asinisbntitlepriceformat a23b32Harry Porter17.99Hard cover a56b65Snow white7.94audio asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book book CD asintitlepricegenreasintitlepriceformat ___a-book___audio

41 41 More examples CIND1: ( order[asin, title, price; type]  book[asin, title, price; nil], Tp) The following DB violates CIND1. Why? idasintitletypepricecountrycounty t1a23H. Porterbook17.99USDL t2a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56S. White7.94a-book order bookCD asintitlepricetypeasintitleprice ___book___

42 42 The satisfiability problem for CINDs The satisfiability problem for CINDs is to determine, given a set  of CINDs, whether or not there exists a nonempty database DB that satisfies , i.e., for any  in , DB satisfies . Recall Any set of traditional INDs is always satisfiable! For CFDs, the satisfiability problem is intractable. In contrast. Theorem. Any set of CINDs is always satisfiable! Despite the increased expressive power, the complexity of the satisfiability analysis does not go up.

43 43 The implication problem for CINDs The implication problem for CINDs is to decide, given a set  of CINDs and a single CIND , whether  implies  (  |=  ). For traditional INDs, the implication problem is PS PACE -complete For CINDs, the complexity does not hike up, to an extent: Theorem. For CINDs containing no finite-domain attributes, the implication problem is PSPACE -complete In the general setting, however, we have to pay a price: Theorem. The implication problem for CINDs is EXPTIME -complete Proof idea: Lower bound: reduction from two-player tiling game Upper bound: an extension of the chase for CINDs

44 44 Finite axiomatizability of CINDs Rules for inferring IND implication: –Reflexivity: If R[X]  R[X] –Projection and Permutation: If R1[A1, …, Ak]  R2[B1, …, Bk], then R1[Ai1, …, Aik]  R2[Bi1, …, Bik], –Transitivity: If R1[X]  R2[Y] and R2[Y]  R3[Z], then R1[X]  R3[Z] Sound and complete for IND implication CINDs retain the finite axiomatizability Theorem. There is a sound and complete inference system for implication analysis of CINDs There are 8 axioms.

45 45 Inference rules for CINDs Normal form of CINDs: (R1[X; Xp]  R2[Y; Yp], tp), tp is a single pattern tuple tp[A] is a constant iff A is in Xp or Yp (tp[B] = _ if B is in X or Y) Inference rules Reflexivity: (R[X; nil]  R[X; nil], tp), where A1…AkA1…Ak _…__…_ Projection and permutation: If (R1[X; Xp]  R2[Y; Yp], tp), then (R1[X’; X’p]  R2[Y’; Y’p], t’p), for any permutation of X, Xp XXpYYp _tp[Xp]_tp[Yp] tp t’p X’X’pY’Y’p _tp[X’p]_tp[Y’p]

46 46 Axioms for CINDs: transitivity Transitivity: if (R1[X; Xp]  R2[Y; Yp], tp), and (R2[Y; Yp]  R3[Z; Zp], t’p), XXpYYp _tp[Xp]_tp[Yp] XXpZZp _tp[Xp]_t’p[Zp] YYpZZp _tp[Yp]_t’p[Zp] (R1[X; Xp]  R3[Z; Zp], t”p) equal

47 47 downgrading: if (R1[X, A; Xp]  R2[Y, B; Yp], tp), XAXpYBYp __tp[Xp]__tp[Yp] (R1[X; Xp, A]  R2[Y; Yp, B], t’p) XXpAYYpB _tp[Xp]a_tp[Yp]a Axioms for CINDs: downgrading

48 48 Axioms for CINDs: augmentation augmentation: if (R1[X; Xp]  R2[Y; Yp], tp), A  attr(R1), XXpYYp _tp[Xp]_tp[Yp] XXpAYYp _tp[Xp]a_tp[Yp] (R1[X; Xp, A]  R2[Y; Yp], t’p)

49 49 Axioms for CINDs: reduction reduction: if (R1[X; Xp]  R2[Y; Yp, B], tp), XXpYYpB _tp[Xp]_tp[Yp]tp[B] then (R1[X; Xp]  R2[Y; Yp], t’p), XXpYYp _tp[Xp]_tp[Yp]

50 50 Axioms for CFDs: finite domain reduction F-reduction: if (R1[X; Xp, A]  R2[Y; Yp], tp), dom(A) = { a1,…, an} XXpAYYp _tp[Xp]a1_tp[Yp] _tp[Xp]…_tp[Yp] _tp[Xp]an_tp[Yp] then (R1[X; Xp]  R2[Y; Yp], tp), XXpYYp _tp[Xp]_tp[Yp]

51 51 Axioms for CFDs: finite domain upgrade upgrade: if (R1[X; Xp, A]  R2[Y, B; Yp], tp), dom(A) = { a1,…, an} XXpAYYpB _tp[Xp]a1_tp[Yp]a1 _tp[Xp]…_tp[Yp]… _tp[Xp]an_tp[Yp]an then (R1[X,A; Xp]  R2[Y, B; Yp], tp), XAXpYBYp __tp[Xp]__tp[Yp]

52 52 Static analyses: CIND vs. IND satisfiabilityimplicationfinite axiom’ty CINDO(1)EXPTIME-completeyes INDO(1)PSPACE-completeyes General setting: satisfiabilityimplicationfinite axiom’ty CINDO(1)PSPACE-completeyes INDO(1)PSPACE-completeyes in the absence of finite-domain attributes: Theorem: In the absence of finite-domain attributes, Reflexivity, Projection and Permutation, Transitivity, Augmentation, Downgrading and Reduction are sound and complete for CIND implication CINDs retain most complexity bounds of their traditional counterpart

53 53 CFDs and CINDs taken together We need both CFDs and CINDs for data cleaning schema matching Theorem. The implication problem for CFDs and CINDs is undecidable Not surprising: The implication problem for traditional FDs and INDs is already undecidable Theorem. The consistency problem for CFDs and CINDs is undecidable In contrast, any set of traditional FDs and INDs is consistent! Proof idea: induction from the implication problem for FDs and INDs

54 54 Static analyses: CFD + CIND vs. FD + IND satisfiabilityimplicationfinite axiom’ty CFD + CINDundecidable No FD + INDO(1)undecidableNo CINDs and CFDs properly subsume FDs and INDs Both the satisfiability analysis and implication analysis are beyond reach in practice This calls for effective heuristic methods

55 55 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Algorithms and open research issues

56 56 Detecting CFD Violations countryarea-codephonestreetcityzip 44131__Edi_ 01908__MH_ ______ countryarea-codephonestreetcityzip 441311234567MayfieldNYCEH4 8LE 441313456789CrichtonNYCEH4 8LE 019083456789Mountain AveNYC07974 CFD: (cust(country, area-code, phone  street, city, zip), Tp) detection

57 57 Detecting CFD violations Input: a set  of CFDs and a database DB Output: the set of tuples in DB that violate at least one CFD in  Approach: automatically generate SQL queries to find violations Complication 1: consider (R: X  Y, Tp), the pattern tableau may be large (recall that each tuple in Tp is in fact a constraint) Goal: the size of the SQL queries is independent of Tp Trick: treat Tp as a data table CINDs can be checked along the same lines

58 58 Single CFD: step 1 A pair of SQL queries, treating Tp as a data table –Single-tuple violation (pattern matching) –Multi-tuple violations (traditional FDs) (cust(country, area-code, phone  street, city, zip), Tp) Single-tuple violation: Qc select * from R t, Tp tp where t[country]  tp[country] AND t[area-code]  tp[area-code] AND t[phone]  tp[phone] (t[street] <> tp[street] OR t[city] <> tp[city] OR t[zip] <> tp[zip])) – <>: not matching; –t[A1]  tp[A1]: (t[A1] = tp[A1] OR tp[A1] = _)

59 59 Single CFD: step 2 Multi-tuple violations (the semantics of traditional FDs): Qv select distinct t.country, t.area-code, t.phone from R t, Tp tp where t[country]  tp[country] AND t[area-code]  tp[area-code] AND t[phone]  tp[phone] group by t.country, t.area-code, t.phone having count(distinct street, city, zip) > 1 Tp is treated as a data table (cust(country, area-code, phone  street, city, zip), Tp)

60 60 Multiple CFDs Complication 2: if the set  has n CFDs, do we use 2n SQL queries, and thus 2n passes of the database DB? Goal: 2 SQL queries no matter how many CFDs are in  the size of the SQL queries is independent of Tp Trick: merge multiple CFDs into one Given (R: X1  Y1, Tp1), (R: X2  Y2, Tp2) Create a single pattern table: Tm = X1  X2  Y1  Y2, Introduce @, a don’t-care variable, to populate attributes of pattern tuples in X1 – X2, etc (tp[A] = @) Modify the pair of SQL queries by using Tm

61 61 Handling multiple CFDs zipstate 07974NJ 90291CA 01202_ CFD 2 : (zip  state, T 2 ) areastate __ 212NY CFD 1 : (area  state, T 1 ) areazipstate 48095120CA 31090995CA CFD 3 : (area,zip  state, T 3 ) areazipstate CFD 2 :@07974NJ CFD 2 :@90291CA CFD 2 :@01202_ CFD 1 :_@_ 212@NY CFD 3 :48095120CA CFD 3 :31090995CA CFD M :(area,zip  state, T M ) Qc: select * from R t, T M t p where t[area] ≍ t p [area] AND t[zip] ≍ t p [zip] AND t[state] <> t p [state] Qv: select distinct area, zip from Macro group by area, zip having count(distinct state) > 1 Macro: select (case t p [area] when “@” then “@” else t[area] end) as area... from R t, T M t p where t[area] ≍ t p [area] AND t[zip] ≍ t p [zip] AND t p [state] =_

62 62 Keeping things tidy… ccareaphoneaddrcityzipstate t1:t1:+1908111-1111Elm Str.MH07974NJ t2:t2:+1908222-2222Pine Str.MH07974NY t3:t3:+1212333-3333Oak Str.NYC01202NY t4:t4:+1310444-4444Main Str.LA90291CA areastate __ 212NY CFD: (area  state, T) t5:t5:+1310555-5555Rice Str.LA90291CT Tuple deletions: A tuple deletion might remove violations. Tuples that were dirty, before the deletion, might become clean! Tuple insertions: A tuple insertion might introduce violations. Tuples that were clean, before the insertion, might become dirty! Updating database

63 63 Incremental inconsistency detection Incremental approach: compute change  V such that the new violations V’ = the old violations V +  V Why? Input: a set  of CFDs, a database DB, the set V of tuples in DB that violate , and changes  DB to DB Output: the set V’ of tuples in DB +  DB that violate CFDs in   DB: a set of tuples to be inserted into DB or deleted from DB Approaches: Batch approach: –Compute DB’ = DB +  DB –Apply the SQL detection queries to DB’

64 64 The need for incremental inconsistency detection Small  DB to DB tends to incur only small changes  V to V –more efficient to compute  V then computing V’ starting from scratch Minimize unnecessary recomputation and traversal of DB Goal: generate new SQL queries that perform incremental detection modify the SQL detection queries by leveraging  DB and V design and maintain auxiliary structures (indexing, mark)

65 65 Logging violations areastate __ 212NY CFD: (area  state, T) ccareaphoneaddrcityzipstateBCBC BVBV t1:t1:+1908111-1111Elm Str.MH07974NJ01 t2:t2:+1908222-2222Pine Str.MH07974NY01 t3:t3:+1212333-3333Oak Str.NYC01202NJ10 t4:t4:+1310444-4444Main Str.LA90291CA00 Use one pair of columns, for each CFD Attribute B C records whether tuple t violates query Qc of the CFD Attribute B V records whether tuple t violates query Qv of the CFD Initialization: update R t set t[B C ] = 1 where t in (Q C ) update R t set t[B V ] = 1 where t in (Q V )

66 66 Handling deletions Let t del be the tuple we want to delete from R Step 1:delete from R t where t = t del Step 2:update R t set t[B V ] = 0 where t[B V ] = 1 AND t[area] = t del [area] AND 1 = (select count(distinct state) from R t’ where t’[area] = t del [area]) ccareaphoneaddrcityzipstateBCBC BVBV t1:t1:+1908111-1111Elm Str.MH07974NY01 t2:t2:+1908222-2222Pine Str.MH07974NJ01 t3:t3:+1212333-3333Oak Str.NYC01202NJ10 t4:t4:+1310444-4444Main Str.LA90291CA00 areastate __ 212NY CFD: (area  state, T) How about batch deletions? How about batch deletions?

67 67 Handling insertions ccareaphoneaddrcityzipstateBCBC BVBV t3:t3:+1212333-3333Oak Str.NYC01202NJ10 t4:t4:+1310444-4444Main Str.LA90291CA00 t ins :+1310555-555Rice Str.LA90291CT areastate __ 212NY CFD: (area  state, T) Let t ins be the tuple we want to insert into R Step 1:insert into R values t ins Step 2:update R t set t[B C ] = 1 where t = t ins AND exists ( select * from T t p where t[area] ≍ t p [area] AND t[state] <>t p [state]) Step 3:update R t set t[B V ] = 1, t ins [B V ] = 1 where t[area] = t ins [area] AND t[state] ≠ t ins [state] AND exists ( select * from T t p where t ins [area] ≍ t p [area] AND t p [state] = _) Step 2Step 3 01 / 1 Step 1 How about batch insertions? How about batch insertions?

68 68 Handling batch insertions Step 1 Let R ins be the tuples we want to insert into R Step 1:Find tuples in R ins which violate Qc Step 2:Find clean tuples in R that become dirty, due to some tuple(s) in R ins Step 3:Find tuples in R ins that become dirty due to some dirty tuple(s) in R Step 4:Find clean tuples in R ins that violate the CFD Step 5:Insert R ins in R Step 2 Step 3 Step 4 R ins : R:

69 69 The source code Let R ins be the tuples we want to insert into R Step 1: update R ins t ins set t ins [B C ] = 1 where t ins in (Q C ) Step 2: update R t set t[B V ] = 1 where t[B C ] = 0 AND t[B V ] = 0 AND exists (select * from T where t[area] ≍ t p [area] AND t p [state] = _) AND exists (select * from R ins t ins where t ins [area] = t[area] AND t ins [state] ≠ t[state]) Step 3: update R ins t ins set t ins [B V ] = 1 where exists (select * from R t where t ins [area] = t[area] AND t[B V ] = 1) Step 4: update R ins t ins set t ins [B V ] = 1 where t ins [B C ] = 0 AND t ins [B V ] = 0 AND t ins [area] IN (select area from R ins t’ ins, T p t p where t’ ins [B C ] = 0 AND t’ ins [B V ] = 0 AND t’ ins [area] ≍ t p [area] AND t p [state] = _ group by area having count(distinct state) > 1) Step 5: insert into R values (select * from R ins )

70 70 Scalability in Instance Size

71 71 Scalability in NumConsts

72 72 Scalability in Noise

73 73 Merging CFDs

74 74 Incremental Deletions

75 75 Incremental Batch Deletions

76 76 Incremental Insertions

77 77 Incremental Batch Insertions

78 78 Checking the satisfiability of CFDs Input: a set  of CFDs MAXSC: find a maximum subset of  that is consistent Complexity: the MAXSC problem for CFDs is NP-complete Theorem: there is an  -approximation algorithm for MAXSC there exist constant  such that for the subset  m found by the algorithm has a bound: card(  m ) >  card(OPT(  )) Proof idea: approximation factor preserving reduction to MAXGSAT Open questions: effective heuristic algorithms for checking the satisfiability of CFDs + CINDs (undecidable) for determining implication of CFDs, CINDs, and CFDs + CINDs for finding minimum cover of CFDs, CINDs, and CFDs + CINDs

79 79 Automated methods for finding a repair Input: a relational database DB, and a set  of CFDs Output: a repair DB’ of DB such that cost(DB’, DB) is minimal repair: DB’ satisfies  “good”: cost(DB’, DB) –DB’ is “close” to the original data in DB –Minimizing changes to “accurate” attributes Complexity: Finding an optimal repair is NP-complete (data complexity) for traditional FDs, for a fixed set of FDs (or INDs) and fixed schema PSPACE-complete for CFDs + CINDs (combined complexity) Open questions: effective heuristic for repairing databases based on CFDs, CINDs, and CFDs + CINDs

80 80 Incremental repair Input: a clean database DB, changes  DB to DB, and a set  of CFDs Output: a repair DB’ of DB +  DB Complexity. The local data cleaning problem is NP-hard, even if  DB consists of a single tuple. Open questions: find effective heuristic algorithms for incrementally repairing databases based on CFDs CINDs CFDs + CINDs

81 81 Discovering CFDs and CINDs Input: Sample databases of a schema R Output: CFDs and CINDs that hold on all (or most) database instances of R Difficulty. A naïve approach may find non-representative CFDs and CINDs as large as the sample data Open questions: find effective method for discovering CFDs CINDs CFDs + CINDs

82 82 Summary Conditional functional dependencies –for data cleaning rather than schema design –complexity bounds of satisfiability and implication analyses –a sound and complete inference system Conditional inclusion dependencies –for data cleaning and schema matching in practice –complexity bounds of satisfiability and implication analyses –a sound and complete inference system Complexity bounds for CFDs and CINDs taken together SQL techniques for automatic detection of CFD violations –a pair of SQL queries for validating multiple CFDs –incremental techniques for validating CFDs A practical method for data cleaning and schema matching

83 83 References Conditional Functional Dependencies for Data Cleaning The 23rd International Conference on Database Engineering (ICDE), 2007. Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis Extending Dependencies with Conditions Loreto Bravo, Wenfei Fan, Shuai Ma Improving Data Quality: Consistency and Accuracy Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, Shuai Ma Conditional Functional Dependencies for Capturing Data Inconsistencies Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis


Download ppt "1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories."

Similar presentations


Ads by Google