Presentation is loading. Please wait.

Presentation is loading. Please wait.

On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

Similar presentations


Presentation on theme: "On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)"— Presentation transcript:

1 On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)

2 Example of Matching Keys For identifying the same real-world entities, matching keys specify – What attributes to compare and – How to compare them ψ2 : (name,address,department ∥ [0,4],[0,2],[0,0]) states that for any tuples ti,tj in a relation – if their distance on attribute name is in [0, 4], i.e., ≥ 0 and ≤ 4 the distance on address is in [0, 2] the same department, with distance in [0,0] – their ssn must be identified SSNNameAddressDepartment t1234***Jason SmithMark RoadSocial Science t22****3J SmithMark RdSocial Science [Fan et al., VLDB09]

3 Relative Candidate Keys (RCKs) A matching key ψ2 is said redundant w.r.t. a relation, if – all the tuple pairs that can be identified by ψ2 – can also be identified by another ψ1 ψ1 : (name,address ∥ [0,4],[0,3]) ψ2 : (name,address,department ∥ [0,4],[0,3],[0,0]) RCKs, a special group of matching keys – the number of compared attributes is minimized – Analogous to candidate keys, w.r.t. functional dependencies SSNNameAddressDepartment t1234***Jason SmithMark RoadSocial Science t22****3J SmithMark RdSocial Science t3862***Wixom J SmithPark StSocial Science t4862***W J SmithPark StreetSocial Science ψ1 is an RCK [Fan et al., VLDB09]

4 Minimal Matching Keys Redundancy issues exist – not only w.r.t. “what attributes to compare” – but also in “how to compare them” ψ1 : (name,address ∥ [0,4],[0,2]) ψ3 : (name,address ∥ [0,0],[0,2]) Redundancy among matching keys on the same attributes – any tuple pair agreeing ψ3 with name distance in [0, 0] always satisfies [0,4] of ψ1 SSNNameAddressDepartment t1234***Jason SmithMark RoadSocial Science t22****3SmithMark RdSocial Science t3862***SmithPark StSocial Science t4862***Will J SmithPark StreetSocial Science t50****5C GreenMark RoadComputing t60****5C GreenMark RdComputing ψ3 is an RCK but not minimal

5 Reliable Matching Keys Consider a training data instance – the same real-world entities in attribute Y are pre-identified – e.g., the matching tuple pairs (t1,t2),(t3,t4),(t5,t6) on ssn Support – the number of tuple pairs that can be covered by ψ Confidence – the proportion of covered tuple pairs that correspond to true identifications on Y ψ5 : (name, address ∥ [0, 4], [0, 4]) supp(ψ5) = 4/15 conf(ψ5) = 3/4 SSNNameAddressDepartment t1234863Jason SmithMark RoadSocial Science t2234863J SmithMark RdSocial Science t3862731W J SmithPark StSocial Science t4862731Will J SmithPark StreetSocial Science t5068335C GreenMark RoadComputing t6068335C GreenMark RdComputing

6 Matching Key Set Consider a set Φ of matching keys relative to the same Y – ψ1 : (name,address ∥ [0,4],[0,2]) – ψ6 : (name,department ∥ [0,0],[0,0]) A tuple pair may agree on (be covered by) several keys ψ ∈ Φ To avoid duplicate counting, consider the distinct tuple pairs that are covered by a set of matching keys SSNNameAddressDepartment t1234863Jason SmithMark RoadSocial Science t2234863J SmithMark RdSocial Science t3862731W J SmithPark StBusiness t4862731W J SmithPark StreetBusiness t5068335C GreenMark RoadComputing t6068335C GreenMark RdComputing supp(ψ1) = 2/15 supp(ψ6) = 2/15 supp(Φ) = 3/15

7 Hardness and Solutions Given a relation instance r of R, a Y over R, a constant k, and – the minimum requirements of support ηs and confidence ηc, To find a set Φ of matching keys such that – supp(Φ) ≥ ηs, conf(Φ) ≥ ηc, and – the size of the set |Φ| is minimized The problem is NP-hard Greedy solution – Select a ψ with the maximum support in each iteration – does not stop until the minimum support ηs is satisfied

8 Redundancy Free Results Subsume: on distance restrictions, – [0,4] subsumes [0,2] Dominate: ψ1 ≺ ψ2, – if all distance restrictions in ψ1 subsume that of ψ2 Minimal: a ψ is minimal – if there does not exist any ψ ′ such that – ψ ′ ≺ ψ, ( and conf(ψ’)≥ηc ) Minimal matching keys are always RCKs – “minimal” definition is more strict than the RCK definition Greedy algorithm returns minimal results – For any ψ1 ≺ ψ2, we have supp(ψ1) ≥ supp(ψ2) ψ1 : (name,address ∥ [0,4],[0,2]) ψ2 : (name,address,department ∥ [0,4],[0,2],[0,0]) ψ3 : (name,address ∥ [0,0],[0,2])

9 Pruning Idea If a ψ1 is selected to result set Φ – any ψ2, ψ1 ≺ ψ2, has no further contribution to supp(Φ) i.e., supp({ψ1}) = supp({ψ1, ψ2}) – ψ2 can be directly ignored Example: suppose that ψ1 is selected to Φ – supp({ψ1}) = supp({ψ1, ψ2}) = supp({ψ1, ψ2}) = 2/15 – ψ2, ψ3 can be pruned in the following computation SSNNameAddressDepartment t1234863Jason SmithMark RoadSocial Science t2234863J SmithMark RdSocial Science t3862731W J SmithPark StSocial Science t4862731Will J SmithPark StreetSocial Science t5068335C GreenMark RoadComputing t6068335C GreenMark RdComputing ψ1 : (name,address ∥ [0,4],[0,2]) ψ2 : (name,address,department ∥ [0,4],[0,2],[0,0]) ψ3 : (name,address ∥ [0,0],[0,2])

10 Experiments The returned set size is affected by ηs and ηc – Higher ηs and ηc lead to larger set size. – When both ηs and ηc are too high, there may not exist any valid matching key set Pruning technique significantly reduced the time costs

11 Experiments Concise RCK sets with support commitment ηs – higher accuracy Compare with considering all RCKs – the recall is high by all RCKs – but the precision is low many irrational keys with low support probably overfit the data

12 Conclusion Relative candidate keys (RCKs) clear up redundant semantics – w.r.t. “what attributes to compare” – minimal on the number of compared attributes Minimal matching keys, a concise set of RCKs – Redundancy among RCKs on the same attributes – about “how to compare them” Introduce a greedy discovery algorithm The return results are guaranteed to be – RCKs (minimal w.r.t. attributes), and also – minimal w.r.t. distance restrictions i.e., redundancy free w.r.t. “how to compare the attributes”

13 Thanks


Download ppt "On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)"

Similar presentations


Ads by Google