1 Conditional Dependencies Wenfei Fan University of Edinburgh and Bell Laboratories
2 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Outline of Part III
3 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Conditional functional dependencies (CFDs)
4 Data in real-life is often dirty Errors, conflicts and inconsistencies Australia: 500,000 dead people retain active Medicare cards US: Pentagon asked 275 dead/wounded officers to re-enlist. UK: there are 81 million National Insurance numbers but only 60 million eligible citizens. It is estimated that in a 500,000 customer database, 120,000 customer records become invalid within a year, due to deaths, divorces, marriages, moves. typical data error rate in industry: 1% - 5%, up to 30%...
5 Dirty data is costly Poor data costs US companies $600 billion annually Wrong price data in retail databases costs US customers $2.5 billion each year AAA improves data quality by 20%, and saves $150,000… in postage stamps alone 30%-80% of the development time for data cleaning in a data integration project and don’t forget CIA intelligence on WMD in Iraq! The need for (semi-)automated methods to clean data!
6 Characterizing the consistency of data One of the central technical problems is how to tell whether the data is dirty or clean Specify consistency using integrity constraints Inconsistencies emerge as violations of constraints Constraints considered so far: traditional –functional dependencies –inclusion dependencies –denial constraints (a special case of full dependencies) –... Question: are these traditional dependencies sufficient?
7 Example: customer relation Schema: Cust(country, area-code, phone, street, city, zip) Instance: countryarea-codephonestreetcityzip MayfieldNYCEH4 8LE CrichtonNYCEH4 8LE Mountain AveNYC07974 functional dependencies (FDs): cust[country, area-code, phone] cust[street, city, zip] cust[country, area-code] cust[city] The database satisfies the FDs. Is the data consistent?
8 Capturing inconsistencies in the data cust ([country = 44, zip] [street]) In the UK, zip code uniquely determines the street The constraint may not hold for other countries It expresses a fundamental part of the semantics of the data It can NOT be expressed as a traditional FD –It does not hold on the entire relation; instead, it holds on tuples representing UK customers only countryarea-codephonestreetcityzip MayfieldNYCEH4 8LE CrichtonNYCEH4 8LE Mountain AveNYC07974
9 Two more constraints cust([country = 44, area-code = 131, phone] [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone] [street, zip, city = MH]) –In the UK, if the area code is 131, then the city has to be EDI –In the US, if the area code is 908, then the city has to be MH t1, t2 and t3 violate these constraints –refining cust([country, area-code, phno] [street, city, zip]) –combining constants and variables idcountryArea-codephonestreetcityzip t MayfieldNYCEH4 8LE t CrichtonNYCEH4 8LE t Mountain AveNYC07974
10 The need for new constraints cust([country = 44, zip] [street]) cust([country = 44, area-code = 131, phone] [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone] [street, zip, city = MH]) They capture inconsistencies that traditional FDs cannot detect Traditional constraints were developed for schema design, not for data cleaning! Data integration in real-life: source constraints –hold on a subset of sources –hold conditionally on the integrated data They are NOT expressible as traditional FDs –do not hold on the entire relation –contain constant data values, besides logical variables
11 An extension of traditional FDs: (R: X Y, Tp) X Y: embedded traditional FD on R Tp: a pattern tableau –attributes: X Y –each tuple in Tp consists of constants and unnamed variable _ Example: cust([country = 44, zip] [street]) (cust (country, zip street), Tp) pattern tableau Tp Conditional Functional Dependencies (CFDs) countryzipstreet 44__
12 Represent cust([country = 44, area-code = 131, phone] [street, zip, city = EDI]) cust([country = 01, area-code = 908, phone] [street, zip, city = MH]) cust([country, area-code, phone] [street, city, zip]) as a SINGLE CFD: (cust(country, area-code, phone street, city, zip ), Tp) pattern tableau Tp: one tuple for each constraint Example CFDs countryarea-codephonestreetcityzip 44131__Edi_ 01908__MH_ ______
13 Express cust[country, area-code] cust[city] as a CFD: (cust(country, area-code, city ), Tp) pattern tableau Tp: a single tuple consisting of _ only CFDs subsume traditional FDs Traditional FDs as a special case countryarea-codecity ___
14 a b (a matches b) if –either a or b is _ –both a and b are constants and a = b tuple t1 matches t2: t1 t2 (a, b) (a, _), but (a, b) does not match (a, c) DB satisfies (R: X Y, Tp) iff for any tuple tp in the pattern tableau Tp and for any tuples t1, t2 in DB, if t1[X] = t2[X] tp[X], then t1[Y] = t2[Y] tp[Y] –tp[X]: identifying the set of tuples on which the constraint tp applies, ie, { t | t[X] tp[X]} –t1[Y] = t2[Y] tp[Y]: enforcing the embedded FD, and the pattern of tp Semantics of CFDs
15 cust([country = 44, zip] [street]) Tuples t1 and t2 violate the CFD t1[country, zip] = t2[country, zip] tp[country, zip] t1[street] t2[street] The CFD applies to t1 and t2 since they match tp[country, zip] Example: violation of CFDs idcountryarea-codephonestreetcityzip t MayfieldNYCEH4 8LE t CrichtonNYCEH8 8LE t Mountain AveNYC07974 countryzipstreet 44__
16 (cust(country, area-code city ), Tp) Tuple t1 does not satisfy the CFD t1[country, area-code] = t1[country, area-code] tp1[country, area-code] t1[city] = t1[city]; however, t1[city] does not match tp1[city] In contrast to traditional FDs, a single tuple may violate a CFD Violation of CFDs by a single tuple idcountryarea-codecity tp144131Edi tp201908MH tp3___ idcountryarea-codephonestreetcityzip t MayfieldNYCEH4 8LE t CrichtonNYCEH8 8LE t Mountain AveNYC07974
17 Conditional tables, Codd tables and variable tables have been studied for incomplete information Conditional tables: representing infinitely many relation instances, one for each instantiation of variables Pattern tableau in a CFD: each pattern tuple is a constraint, and all constraints applying to the same relation instance Relational table, traditional dependencies and CFDs One end of the spectrum: relations consisting of data only The other end of the spectrum: traditional dependencies defined in terms of logic variables CFD: in the between, both data values and logic variables CFDs: enforcing binding of semantically related data values CFDs vs. conditional tables
18 “Dirty” constraints? A set of CFDs may be inconsistent! Inconsistent: (R(A B), Tp) In any nonempty database DB and for any tuple t in DB, –tp1: t[B] must be b –tp2: t[B] must be c –Inconsistent if b and c are different inconsistent = { 1, 2 }, 1 = (R(A B ), Tp1), 2 = (R(B A ), Tp2) Why? idAB tp1_b tp2_c Tp idAB tp1trueb tp2falsec idBA tp3bfalse tp4ctrue
19 The satisfiability problem The satisfiability problem for CFDs is to determine, given a set of CFDs, whether or not there exists a nonempty database DB that satisfies , i.e., for any in , DB satisfies . Whether or not makes sense For traditional FDs, it is not an issue: one can specify any FDs without worrying about their consistency In contrast, a set of CFDs may be inconsistent!
20 The complexity of the satisfiability analysis Theorem. The satisfiability problem for CFDs is NP-complete. Nontrivial: contrast this with the trivial consistency analysis of FDs! Proof idea: Upper bound: the small model property: if is satisfiable, then there is DB that satisfies and consists of a single tuple! Lower bound: reduction from the non-tautology problem Good news: PTIME special cases Theorem. Given a set of CFDs on a relation schema R, the satisfiability of can be determined in O(| | 2 ) time if either the schema R is predefined (fixed), or no attributes in have a finite domain Proof idea: an extension of chase for CFDs
21 The implication problem The implication problem for CFDs is to determine, given a set of CFDs and a single CFD , whether implies , denoted by |= , i.e., for any database DB, if DB satisfies , then DB satisfies . Example: = { 1, 2 }, 1 = (R(A B ), Tp1), 2 = (R(B C ), Tp2) = (R(A C ), Tp) |= . Why? idAB tp1_b Tp1 Tp2 idBC tp1_c idAC tpac
22 The complexity of the implication problem For traditional FDs, the implication problem is in linear time In contrast, the implication problem for CFDs is intractable Theorem. The implication problem for CFDs is coNP-complete. Tractable special cases Theorem. Given a set of CFDs and a single CFD on a relation schema R, whether |= can be determined in O((| | + | |) 2 ) time if either the schema R is predefined, or no attributes in and have a finite domain Proof idea: an extension of chase for CFDs
23 Finite axiomatizability: Flashback Armstrong’s axioms can be found in every database textbook: Reflexivity: If Y X, then X Y Augmentation: If X Y, then XZ YZ Transitivity: If X Y and Y Z, then X Z Sound and complete for FD implication, i.e, |= iff can be inferred from using reflexivity, augmentation, transitivity. Question: is there a sound and complete inference system for the implication analysis of CFDs?
24 Finite axiomatizability of CFDs Theorem. There is a sound and complete inference system I for implication analysis of CFDs Sound: if |- , i.e., can be proved from using I, then |= Complete: if |= , then |- using I The inference system is more involved than its counterpart for traditional FDs, namely, Armstrong’s axioms. There are 5 axioms. A normal form of CFDs: (R: X A, tp), tp is a single pattern tuple.
25 Axioms for CFDs: extension of Armstrong’s axioms Reflexivity: If A X, then (R : X A, tp), where A1…AkAA _…___ or A1…AkAA _…_aa Augmentation: If (X A, tp) and B attr(R), then (BX A, t’p) A1…AkA tp[A1]…tp[Ak]tp[A] A1…AkBA tp[A1]…tp[Ak]_tp[A] tp t’p
26 Axioms for CFDs: transitivity Transitivity: if ([A1,…,Ak] [B1,…,Bm], tp) and ([B1,…,Bm] [C1,…,Cn], t’p) A1…AkB1…Bm tp[A1]…tp[Ak]tp[B1]tp[Bm] A1…AkC1…Cn tp[A1]…tp[Ak]t’p[C1]t’p[Cn] B1…BmC1…Cn tp’[B1]…t’p[Bm]t’p[C1]t’p[Cm] ([A1,…,Ak] [C1,…,Cn], t’p) match
27 Axioms for CFDs: reduction reduction: if ([B, X] A, tp), tp[B] = _, and tp[A] = a A1…AkBA tp[A1]…tp[Ak]_a then (X A, t’p) A1…AkA tp[A1]…tp[Ak]a
28 Axioms for CFDs: finite domain upgrade upgrade: if only consistent values for B are b1, b2,..., bn, dom(B) = { b1, …, bn, …, bm}, and (R : [A1,...,Ak, B] A, tp) A1…AkBA tp[A1]…tp[Ak]b1tp[A] tp[A1]…tp[Ak]…tp[A] tp[A1]…tp[Ak]bntp[A] then (R : [A1,...,Ak, B] A, tp) A1…AkBA tp[A1]…tp[Ak]_tp[A]
29 Static analyses: CFD vs. FD satisfiabilityimplicationfinite axiom’ty CFDNP-completecoNP-completeyes FDO(1)O(n)yes General setting: satisfiabilityimplicationfinite axiom’ty CFDO(n 2 ) yes FDO(1)O(n)yes in the absence of finite-domain attributes: Theorem: In the absence of finite-domain attributes, Reflexivity, Augmentation, Transitivity and Reduction are sound and complete for CFD implication complications: finite-domain attributes, interaction between satisfiability and implication analyses
30 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Conditional Inclusion dependencies (CINDs)
31 Example: Amazon database Schema: order(asin, title, type, price, country, county) -- source book(asin, isbn, title, price, format) -- target CD(asin, title, price, genre) asin: Amazon standard identification number Instances: asintitletypepricecountrycounty a23H. Porterbook17.99USDL a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD
32 Schema matching Traditional inclusion dependencies: order[asin, title, price] book[asin, title, price] order[asin, title, price] CD[asin, title, price] These inclusion dependencies do not make sense! Inclusion dependencies from source to target (e.g., Clio) asintitletypepricecountrycounty asinisbntitlepriceasintitlepricegenre
33 Schema matching: dependencies with conditions asintitletypepricecountrycounty asinisbntitlepriceasintitlepricegenre Conditional inclusion dependencies: order[asin, title, price; type = book] book[asin, title, price] order[asin, title, price; type = CD] CD[asin, title, price] order[asin, title, price] book[asin, title, price] holds only if type = book order[asin, title, price] CD[asin, title, price] holds only if type = CD The constraints do not hold on the entire order table
34 Date cleaning with conditional dependencies CIND1: order[asin, title, price; type = book] book[asin, title, price] CIND2: order[asin, title, price; type = CD] CD[asin, title, price] Tuple t1 violates CIND1 Tuple t2 violates CIND2 idasintitletypepricecountrycounty t1a23H. Porterbook17.99USDL t2a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book order book CD
35 More on data cleaning CD[asin, title, price; genre = ‘a-book’] book[asin, title, price; format = ‘audio’] Inclusion relation CD[asin, title, price] book[asin, title, price] holds only if genre = ‘a-book’, i.e., when the CD is an audio book In addition, the format of the corresponding book must be audio – a pattern for the referenced tuple asinisbntitlepriceformat a23b32Harry Porter17.99Hard cover a56b65Snow White17.94audio asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book book CD
36 (R1[X; Xp] R2[Y; Yp], Tp) R1[X] R2[Y]: embedded traditional IND from R1 to R2 Tp: a pattern tableau –attributes: X Xp Y Yp –tuples in Tp consist of constants and unnamed variable _ Example: express CIND1: order[asin, title, price; type = book] book[asin, title, price] ( order[asin, title, price; type] book[asin, title, price; nil], Tp) nil: empty list pattern tableau Tp Conditional Inclusion Dependencies (CINDs) asintitlepricetypeasintitleprice ___book___
37 CIND2: order[asin, title, price; type = CD] CD[asin, title, price] CIND3: CD[asin, title, price; genre = ‘a-book’] book[asin, title, price; format = ‘audio’] Examples CINDs asintitlepricetypeasintitleprice ___CD___ asintitlepricegenreasintitlepriceformat ___a-book___audio ( order[asin, title, price; type] CD[asin, title, price; nil], Tp) ( CD[asin, title, price; genre] book[asin, title, price; format], Tp)
38 R1[X] R2[Y] X: [A1, …, An] Y : [B1, …, Bn] As a CIND: (R1[X; nil] R2[Y; nil], Tp) pattern tableau Tp: a single tuple consisting of _ only CINDs subsume traditional INDs Traditional CINDs as a special case A1…AnB1…Bn ______
39 DB = (DB1, DB2), where DBj is an instance of Rj, j = 1, 2. DB satisfies (R1[X; Xp] R2[Y; Yp], Tp) iff for any tuples t1 in DB1, and any tuple tp in the pattern tableau Tp, if t1[X, Xp] tp[X, Xp], then there exists t2 in DB2 such that t1[Y] = t2[Y] (traditional IND semantics) t2[Y, Yp] tp[Y, Yp] (matching the pattern tuple on Y, Yp) Patterns: t1[X, Xp] tp[X, Xp]: identifying the set of R1 tuples on which tp applies: { t1 | t1[X, Xp] tp[X, Xp] } t2[Y, Yp] tp[Y, Yp]: enforcing the embedded IND and the constraint specified by patterns Y, Yp Semantics of CINDs
40 ( CD[asin, title, price; genre] book[asin, title, price; format], Tp) The following DB satisfies the CIND Example asinisbntitlepriceformat a23b32Harry Porter17.99Hard cover a56b65Snow white7.94audio asintitlepricegenre a12J. Denver17.99country a56Snow White7.94a-book book CD asintitlepricegenreasintitlepriceformat ___a-book___audio
41 More examples CIND1: ( order[asin, title, price; type] book[asin, title, price; nil], Tp) The following DB violates CIND1. Why? idasintitletypepricecountrycounty t1a23H. Porterbook17.99USDL t2a12J. DenverCD7.94UKReyden asinisbntitleprice a23b32Harry Porter17.99 a56b65Snow white7.94 asintitlepricegenre a12J. Denver17.99country a56S. White7.94a-book order bookCD asintitlepricetypeasintitleprice ___book___
42 The satisfiability problem for CINDs The satisfiability problem for CINDs is to determine, given a set of CINDs, whether or not there exists a nonempty database DB that satisfies , i.e., for any in , DB satisfies . Recall Any set of traditional INDs is always satisfiable! For CFDs, the satisfiability problem is intractable. In contrast. Theorem. Any set of CINDs is always satisfiable! Despite the increased expressive power, the complexity of the satisfiability analysis does not go up.
43 The implication problem for CINDs The implication problem for CINDs is to decide, given a set of CINDs and a single CIND , whether implies ( |= ). For traditional INDs, the implication problem is PS PACE -complete For CINDs, the complexity does not hike up, to an extent: Theorem. For CINDs containing no finite-domain attributes, the implication problem is PSPACE -complete In the general setting, however, we have to pay a price: Theorem. The implication problem for CINDs is EXPTIME -complete Proof idea: Lower bound: reduction from two-player tiling game Upper bound: an extension of the chase for CINDs
44 Finite axiomatizability of CINDs Rules for inferring IND implication: –Reflexivity: If R[X] R[X] –Projection and Permutation: If R1[A1, …, Ak] R2[B1, …, Bk], then R1[Ai1, …, Aik] R2[Bi1, …, Bik], –Transitivity: If R1[X] R2[Y] and R2[Y] R3[Z], then R1[X] R3[Z] Sound and complete for IND implication CINDs retain the finite axiomatizability Theorem. There is a sound and complete inference system for implication analysis of CINDs There are 8 axioms.
45 Inference rules for CINDs Normal form of CINDs: (R1[X; Xp] R2[Y; Yp], tp), tp is a single pattern tuple tp[A] is a constant iff A is in Xp or Yp (tp[B] = _ if B is in X or Y) Inference rules Reflexivity: (R[X; nil] R[X; nil], tp), where A1…AkA1…Ak _…__…_ Projection and permutation: If (R1[X; Xp] R2[Y; Yp], tp), then (R1[X’; X’p] R2[Y’; Y’p], t’p), for any permutation of X, Xp XXpYYp _tp[Xp]_tp[Yp] tp t’p X’X’pY’Y’p _tp[X’p]_tp[Y’p]
46 Axioms for CINDs: transitivity Transitivity: if (R1[X; Xp] R2[Y; Yp], tp), and (R2[Y; Yp] R3[Z; Zp], t’p), XXpYYp _tp[Xp]_tp[Yp] XXpZZp _tp[Xp]_t’p[Zp] YYpZZp _tp[Yp]_t’p[Zp] (R1[X; Xp] R3[Z; Zp], t”p) equal
47 downgrading: if (R1[X, A; Xp] R2[Y, B; Yp], tp), XAXpYBYp __tp[Xp]__tp[Yp] (R1[X; Xp, A] R2[Y; Yp, B], t’p) XXpAYYpB _tp[Xp]a_tp[Yp]a Axioms for CINDs: downgrading
48 Axioms for CINDs: augmentation augmentation: if (R1[X; Xp] R2[Y; Yp], tp), A attr(R1), XXpYYp _tp[Xp]_tp[Yp] XXpAYYp _tp[Xp]a_tp[Yp] (R1[X; Xp, A] R2[Y; Yp], t’p)
49 Axioms for CINDs: reduction reduction: if (R1[X; Xp] R2[Y; Yp, B], tp), XXpYYpB _tp[Xp]_tp[Yp]tp[B] then (R1[X; Xp] R2[Y; Yp], t’p), XXpYYp _tp[Xp]_tp[Yp]
50 Axioms for CFDs: finite domain reduction F-reduction: if (R1[X; Xp, A] R2[Y; Yp], tp), dom(A) = { a1,…, an} XXpAYYp _tp[Xp]a1_tp[Yp] _tp[Xp]…_tp[Yp] _tp[Xp]an_tp[Yp] then (R1[X; Xp] R2[Y; Yp], tp), XXpYYp _tp[Xp]_tp[Yp]
51 Axioms for CFDs: finite domain upgrade upgrade: if (R1[X; Xp, A] R2[Y, B; Yp], tp), dom(A) = { a1,…, an} XXpAYYpB _tp[Xp]a1_tp[Yp]a1 _tp[Xp]…_tp[Yp]… _tp[Xp]an_tp[Yp]an then (R1[X,A; Xp] R2[Y, B; Yp], tp), XAXpYBYp __tp[Xp]__tp[Yp]
52 Static analyses: CIND vs. IND satisfiabilityimplicationfinite axiom’ty CINDO(1)EXPTIME-completeyes INDO(1)PSPACE-completeyes General setting: satisfiabilityimplicationfinite axiom’ty CINDO(1)PSPACE-completeyes INDO(1)PSPACE-completeyes in the absence of finite-domain attributes: Theorem: In the absence of finite-domain attributes, Reflexivity, Projection and Permutation, Transitivity, Augmentation, Downgrading and Reduction are sound and complete for CIND implication CINDs retain most complexity bounds of their traditional counterpart
53 CFDs and CINDs taken together We need both CFDs and CINDs for data cleaning schema matching Theorem. The implication problem for CFDs and CINDs is undecidable Not surprising: The implication problem for traditional FDs and INDs is already undecidable Theorem. The consistency problem for CFDs and CINDs is undecidable In contrast, any set of traditional FDs and INDs is consistent! Proof idea: induction from the implication problem for FDs and INDs
54 Static analyses: CFD + CIND vs. FD + IND satisfiabilityimplicationfinite axiom’ty CFD + CINDundecidable No FD + INDO(1)undecidableNo CINDs and CFDs properly subsume FDs and INDs Both the satisfiability analysis and implication analysis are beyond reach in practice This calls for effective heuristic methods
55 Conditional functional dependencies (CFDs) –Motivation for extending FDs with conditions: data cleaning –Syntax and semantics –Static analysis: satisfiability, implication, axiomatizability Conditional inclusion dependencies (CINDs) –Motivation: data cleaning and schema matching –Syntax and semantics –Static analysis: consistency, implication, axiomatizability Algorithms and open research issues –SQL techniques for inconsistency detection –Heuristic for satisfiability and implication checking –Repair Algorithms and open research issues
56 Detecting CFD Violations countryarea-codephonestreetcityzip 44131__Edi_ 01908__MH_ ______ countryarea-codephonestreetcityzip MayfieldNYCEH4 8LE CrichtonNYCEH4 8LE Mountain AveNYC07974 CFD: (cust(country, area-code, phone street, city, zip), Tp) detection
57 Detecting CFD violations Input: a set of CFDs and a database DB Output: the set of tuples in DB that violate at least one CFD in Approach: automatically generate SQL queries to find violations Complication 1: consider (R: X Y, Tp), the pattern tableau may be large (recall that each tuple in Tp is in fact a constraint) Goal: the size of the SQL queries is independent of Tp Trick: treat Tp as a data table CINDs can be checked along the same lines
58 Single CFD: step 1 A pair of SQL queries, treating Tp as a data table –Single-tuple violation (pattern matching) –Multi-tuple violations (traditional FDs) (cust(country, area-code, phone street, city, zip), Tp) Single-tuple violation: Qc select * from R t, Tp tp where t[country] tp[country] AND t[area-code] tp[area-code] AND t[phone] tp[phone] (t[street] <> tp[street] OR t[city] <> tp[city] OR t[zip] <> tp[zip])) – <>: not matching; –t[A1] tp[A1]: (t[A1] = tp[A1] OR tp[A1] = _)
59 Single CFD: step 2 Multi-tuple violations (the semantics of traditional FDs): Qv select distinct t.country, t.area-code, t.phone from R t, Tp tp where t[country] tp[country] AND t[area-code] tp[area-code] AND t[phone] tp[phone] group by t.country, t.area-code, t.phone having count(distinct street, city, zip) > 1 Tp is treated as a data table (cust(country, area-code, phone street, city, zip), Tp)
60 Multiple CFDs Complication 2: if the set has n CFDs, do we use 2n SQL queries, and thus 2n passes of the database DB? Goal: 2 SQL queries no matter how many CFDs are in the size of the SQL queries is independent of Tp Trick: merge multiple CFDs into one Given (R: X1 Y1, Tp1), (R: X2 Y2, Tp2) Create a single pattern table: Tm = X1 X2 Y1 Y2, a don’t-care variable, to populate attributes of pattern tuples in X1 – X2, etc (tp[A] Modify the pair of SQL queries by using Tm
61 Handling multiple CFDs zipstate 07974NJ 90291CA 01202_ CFD 2 : (zip state, T 2 ) areastate __ 212NY CFD 1 : (area state, T 1 ) areazipstate CA CA CFD 3 : (area,zip state, T 3 ) areazipstate CFD 2 CFD 2 CFD 2 CFD 1 CFD 3 : CA CFD 3 : CA CFD M :(area,zip state, T M ) Qc: select * from R t, T M t p where t[area] ≍ t p [area] AND t[zip] ≍ t p [zip] AND t[state] <> t p [state] Qv: select distinct area, zip from Macro group by area, zip having count(distinct state) > 1 Macro: select (case t p [area] when then else t[area] end) as area... from R t, T M t p where t[area] ≍ t p [area] AND t[zip] ≍ t p [zip] AND t p [state] =_
62 Keeping things tidy… ccareaphoneaddrcityzipstate t1:t1: Elm Str.MH07974NJ t2:t2: Pine Str.MH07974NY t3:t3: Oak Str.NYC01202NY t4:t4: Main Str.LA90291CA areastate __ 212NY CFD: (area state, T) t5:t5: Rice Str.LA90291CT Tuple deletions: A tuple deletion might remove violations. Tuples that were dirty, before the deletion, might become clean! Tuple insertions: A tuple insertion might introduce violations. Tuples that were clean, before the insertion, might become dirty! Updating database
63 Incremental inconsistency detection Incremental approach: compute change V such that the new violations V’ = the old violations V + V Why? Input: a set of CFDs, a database DB, the set V of tuples in DB that violate , and changes DB to DB Output: the set V’ of tuples in DB + DB that violate CFDs in DB: a set of tuples to be inserted into DB or deleted from DB Approaches: Batch approach: –Compute DB’ = DB + DB –Apply the SQL detection queries to DB’
64 The need for incremental inconsistency detection Small DB to DB tends to incur only small changes V to V –more efficient to compute V then computing V’ starting from scratch Minimize unnecessary recomputation and traversal of DB Goal: generate new SQL queries that perform incremental detection modify the SQL detection queries by leveraging DB and V design and maintain auxiliary structures (indexing, mark)
65 Logging violations areastate __ 212NY CFD: (area state, T) ccareaphoneaddrcityzipstateBCBC BVBV t1:t1: Elm Str.MH07974NJ01 t2:t2: Pine Str.MH07974NY01 t3:t3: Oak Str.NYC01202NJ10 t4:t4: Main Str.LA90291CA00 Use one pair of columns, for each CFD Attribute B C records whether tuple t violates query Qc of the CFD Attribute B V records whether tuple t violates query Qv of the CFD Initialization: update R t set t[B C ] = 1 where t in (Q C ) update R t set t[B V ] = 1 where t in (Q V )
66 Handling deletions Let t del be the tuple we want to delete from R Step 1:delete from R t where t = t del Step 2:update R t set t[B V ] = 0 where t[B V ] = 1 AND t[area] = t del [area] AND 1 = (select count(distinct state) from R t’ where t’[area] = t del [area]) ccareaphoneaddrcityzipstateBCBC BVBV t1:t1: Elm Str.MH07974NY01 t2:t2: Pine Str.MH07974NJ01 t3:t3: Oak Str.NYC01202NJ10 t4:t4: Main Str.LA90291CA00 areastate __ 212NY CFD: (area state, T) How about batch deletions? How about batch deletions?
67 Handling insertions ccareaphoneaddrcityzipstateBCBC BVBV t3:t3: Oak Str.NYC01202NJ10 t4:t4: Main Str.LA90291CA00 t ins : Rice Str.LA90291CT areastate __ 212NY CFD: (area state, T) Let t ins be the tuple we want to insert into R Step 1:insert into R values t ins Step 2:update R t set t[B C ] = 1 where t = t ins AND exists ( select * from T t p where t[area] ≍ t p [area] AND t[state] <>t p [state]) Step 3:update R t set t[B V ] = 1, t ins [B V ] = 1 where t[area] = t ins [area] AND t[state] ≠ t ins [state] AND exists ( select * from T t p where t ins [area] ≍ t p [area] AND t p [state] = _) Step 2Step 3 01 / 1 Step 1 How about batch insertions? How about batch insertions?
68 Handling batch insertions Step 1 Let R ins be the tuples we want to insert into R Step 1:Find tuples in R ins which violate Qc Step 2:Find clean tuples in R that become dirty, due to some tuple(s) in R ins Step 3:Find tuples in R ins that become dirty due to some dirty tuple(s) in R Step 4:Find clean tuples in R ins that violate the CFD Step 5:Insert R ins in R Step 2 Step 3 Step 4 R ins : R:
69 The source code Let R ins be the tuples we want to insert into R Step 1: update R ins t ins set t ins [B C ] = 1 where t ins in (Q C ) Step 2: update R t set t[B V ] = 1 where t[B C ] = 0 AND t[B V ] = 0 AND exists (select * from T where t[area] ≍ t p [area] AND t p [state] = _) AND exists (select * from R ins t ins where t ins [area] = t[area] AND t ins [state] ≠ t[state]) Step 3: update R ins t ins set t ins [B V ] = 1 where exists (select * from R t where t ins [area] = t[area] AND t[B V ] = 1) Step 4: update R ins t ins set t ins [B V ] = 1 where t ins [B C ] = 0 AND t ins [B V ] = 0 AND t ins [area] IN (select area from R ins t’ ins, T p t p where t’ ins [B C ] = 0 AND t’ ins [B V ] = 0 AND t’ ins [area] ≍ t p [area] AND t p [state] = _ group by area having count(distinct state) > 1) Step 5: insert into R values (select * from R ins )
70 Scalability in Instance Size
71 Scalability in NumConsts
72 Scalability in Noise
73 Merging CFDs
74 Incremental Deletions
75 Incremental Batch Deletions
76 Incremental Insertions
77 Incremental Batch Insertions
78 Checking the satisfiability of CFDs Input: a set of CFDs MAXSC: find a maximum subset of that is consistent Complexity: the MAXSC problem for CFDs is NP-complete Theorem: there is an -approximation algorithm for MAXSC there exist constant such that for the subset m found by the algorithm has a bound: card( m ) > card(OPT( )) Proof idea: approximation factor preserving reduction to MAXGSAT Open questions: effective heuristic algorithms for checking the satisfiability of CFDs + CINDs (undecidable) for determining implication of CFDs, CINDs, and CFDs + CINDs for finding minimum cover of CFDs, CINDs, and CFDs + CINDs
79 Automated methods for finding a repair Input: a relational database DB, and a set of CFDs Output: a repair DB’ of DB such that cost(DB’, DB) is minimal repair: DB’ satisfies “good”: cost(DB’, DB) –DB’ is “close” to the original data in DB –Minimizing changes to “accurate” attributes Complexity: Finding an optimal repair is NP-complete (data complexity) for traditional FDs, for a fixed set of FDs (or INDs) and fixed schema PSPACE-complete for CFDs + CINDs (combined complexity) Open questions: effective heuristic for repairing databases based on CFDs, CINDs, and CFDs + CINDs
80 Incremental repair Input: a clean database DB, changes DB to DB, and a set of CFDs Output: a repair DB’ of DB + DB Complexity. The local data cleaning problem is NP-hard, even if DB consists of a single tuple. Open questions: find effective heuristic algorithms for incrementally repairing databases based on CFDs CINDs CFDs + CINDs
81 Discovering CFDs and CINDs Input: Sample databases of a schema R Output: CFDs and CINDs that hold on all (or most) database instances of R Difficulty. A naïve approach may find non-representative CFDs and CINDs as large as the sample data Open questions: find effective method for discovering CFDs CINDs CFDs + CINDs
82 Summary Conditional functional dependencies –for data cleaning rather than schema design –complexity bounds of satisfiability and implication analyses –a sound and complete inference system Conditional inclusion dependencies –for data cleaning and schema matching in practice –complexity bounds of satisfiability and implication analyses –a sound and complete inference system Complexity bounds for CFDs and CINDs taken together SQL techniques for automatic detection of CFD violations –a pair of SQL queries for validating multiple CFDs –incremental techniques for validating CFDs A practical method for data cleaning and schema matching
83 References Conditional Functional Dependencies for Data Cleaning The 23rd International Conference on Database Engineering (ICDE), Philip Bohannon, Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis Extending Dependencies with Conditions Loreto Bravo, Wenfei Fan, Shuai Ma Improving Data Quality: Consistency and Accuracy Gao Cong, Wenfei Fan, Floris Geerts, Xibei Jia, Shuai Ma Conditional Functional Dependencies for Capturing Data Inconsistencies Wenfei Fan, Floris Geerts, Xibei Jia, Anastasios Kementsietsidis