Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David

Data Cleaning  Real-world data is dirty  Examples: Syntactical Errors: e.g., Micrsoft Heterogeneous Formats: e.g., Phone number formats Missing Values (Incomplete data) Violation of Integrity Constraints: e.g., FDs/INDs Duplicate Records  Data Cleaning is the process of improving data quality by removing errors, inconsistencies and anomalies of data VLDB 2009George Beskales2

Duplicate Elimination Process IDnameZIPIncome P1Green5151930k P2Green5151832k P3Peter3052840k P4Peter3052840k P5Gree5151955k P6Chuck5151930k VLDB 2009George Beskales3 IDnameZIPIncome C1Green5151939k C2Peter3052840k C3Chuck5151930k Compute Pairwise Similarity P1P2 P3 P4 P5 P6 0.30.5 0.9 1.0 Cluster Records P1P2 P3P4 P5 P6Merge Clusters C1 C3 C2 Unclean Relation Clean Relation

Probabilistic Data Cleaning VLDB 2009George Beskales4 (a) One-Shot Cleaning Unclean Database Deterministic Database Deterministic Duplicate Elimination RDBMS QueriesResults Single Clean Instance Queries (b) Probabilistic Cleaning Unclean Database Uncertain Database Probabilistic Duplicate Elimination Uncertainty and Cleaning-aware RDBMS Probabilistic Results Single Clean Instance Single Clean Instance Multiple Possible Clean Instances

Motivation Probabilistic Data Cleaning: 1.Avoid deterministic resolution of conflicts during data cleaning 2.Enrich query results by considering all possible cleaning instances 3.Allows specifying query-time cleaning requirements VLDB 2009George Beskales5

VLDB 2009George Beskales6 Probabilistic Duplicate Elimination Probabilistic Duplicate Elimination Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Uncertain Database Uncertain Database Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Queries Unclean Database Uncertainty and Cleaning-aware RDBMS Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Probabilistic Results Outline  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation

Uncertainty in Duplicate Detection VLDB 2009George Beskales7 Person Uncertain Clustering X1X1 {P1} {P2} {P3,P4} {P5} {P6} X2X2 {P1,P2} {P3,P4} {P5} {P6} X3X3 {P1,P2,P5} {P3,P4} {P6} Possible Repairs  Uncertain Duplicates: Determining what records should be clustered is uncertain due to noisy similarity measurements  A possible repair of a relation is a clustering (partitioning) of the unclean relation IDNameZIPIncome P1Green5151930k P2Green5151832k P3Peter3052840k P4Peter3052840k P5Gree5151955k P6Chuck5151930k

Challenges 1.The space of all possible clusterings (repairs) is exponentially large 2.How to efficiently generate, store and query the possible repairs? VLDB 2009George Beskales8

The Space of Possible Repairs VLDB 2009George Beskales9 All possible repairs 1 Possible repairs given by any fixed parameterized clustering algorithm 2 Possible repairs given by any fixed hierarchical clustering algorithm Link-based algorithms Hierarchical cut clustering algorithm … 3

Hierarchical Clustering Algorithms  Records are clustered in a form of a hierarchy: all singletons are at leaves, and one cluster is at root  Example: Linkage-based Clustering Algorithm VLDB 2009George Beskales10 r1r1 r2r2 r4r4 r5r5 r3r3 Pair-wise Distance Distance Threshold (  ) {r 1,r 2 },{r 3,r 4,r 5 }

Uncertain Hierarchical Clustering  Hierarchical clustering algorithms can be modified to accept uncertain parameters  The number of generated possible repairs is linear VLDB 2009George Beskales11 r1r1 r2r2 r4r4 r5r5 r3r3 Possible Thresholds [  l,  u ] {r 1 },{r 2 },{r 3 },{r 4 },{r 5 } {r 1 },{r 2 },{r 3 },{r 4,r 5 } {r 1,r 2 },{r 3 },{r 4,r 5 } {r 1,r 2 },{r 3,r 4,r 5 }

Probabilities of Repairs VLDB 2009George Beskales12 {P1,P2} {P3,P4} {P5} {P6} {P1,P2,P5} {P3,P4} {P6} {P1} {P2} {P3,P4} {P5} {P6} Clustering 1Clustering 2Clustering 3 1    3 Pr = 0.2 3    10 Pr = 0.7 0    1 Pr = 0.1 IDNameZIPIncome P1Green5151930k P2Green5151832k P3Peter3052840k P4Peter3052840k P5Gree5151955k P6Chuck5151930k  ~U(0,10)

Representing the Possible Repairs VLDB 2009George Beskales13 ID…IncomeCP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,10) CP3…55k{P5}[0,3) CP4…30k{P6}[0,10) CP5…39k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) U-clean Relation Person C {P1,P2} {P3,P4} {P5} {P6} {P1,P2,P5} {P3,P4} {P6} {P1} {P2} {P3,P4} {P5} {P6} Clustering 1Clustering 2Clustering 3 1    3 Pr = 0.2 3    10 Pr = 0.7 0    1 Pr = 0.1

VLDB 2009George Beskales14 Probabilistic Duplicate Elimination Single Clean Instance Single Clean Instance Multiple Possible Clean Instances Uncertain Database Queries Unclean Database Uncertainty and Cleaning-aware RDBMS Probabilistic Results Queries Uncertainty and Cleaning-aware RDBMS Probabilistic Results Outline  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation

Queries over U-Clean Relations  We adopt the possible worlds semantics to define queries on U-clean relations VLDB 2009George Beskales15 U-Clean Relation(s) R C Possible Clean Instances X 1,X 2,…,X n Possible Clean Instances Q(X 1 ), Q(X 2 ),…,Q(X n ) U-Clean Relation Q(R C ) Definition Implementation Instances Representation

Example: Selection Query VLDB 2009George Beskales16 SELECT ID, Income FROM Person c WHERE Income>35k IDIncomeCP CP240k{P3,P4}[0,10) CP355k{P5}[0,3) CP539k{P1,P2,P5}[3,10) ID… Income CP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,10) CP3…55k{P5}[0,3) CP4…30k{P6}[0,10) CP5…39k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) Person C

Example: Projection Query VLDB 2009George Beskales17 IncomeCP 30k {P1} v {P6}[0,1) v [3,10) 31k {P1,P2}[1,3) 32k{P2}[0,1) 40k {P3,P4} v {P1,P2,P5} [0,1) v [3,10) 55k {P5}[0,3) SELECT DISTINCT Income FROM Person c ID… Income CP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,1) CP3…55k{P5}[0,3) CP4…30k{P6}[3,10) CP5…40k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) Person C

Example: Join Query VLDB 2009George Beskales18 SELECT Income, Price FROM Person c, Vehicle c WHERE Income/10 >= Price IncomePriceCP 40k4k {P3,P4} ^ {V4}  1 : [0, 10) ^  2 : [3,5)... ID…IncomeCP CP1…31k{P1,P2}  1 :[1,3) CP2…40k{P3,P4}  1 :[0,10) …………… IDPriceCP CV54k{V4}  2 : [3,5) CV66k{V3,V4}  2 : [5,10) ……. Person c Vehicle c

Aggregation Queries VLDB 2009George Beskales19 CP2 CP3 CP4 CP6 CP7 CP1 CP2 CP3 CP4 CP2 CP4 CP5 Pr = 0.1 Pr = 0.2 Pr = 0.7 X1X1 X2X2 X3X3 SELECT Sum(Income) FROM Person c Sum=187k Sum=156k Sum=109k ID… Income CP CP1…31k{P1,P2}[1,3) CP2…40k{P3,P4}[0,10) CP3…55k{P5}[0,3) CP4…30k{P6}[0,10) CP5…39k{P1,P2,P5}[3,10) CP6…30k{P1}[0,1) CP7…32k{P2}[0,1) Person c

Other Meta-Queries 1.Obtaining the most probable clean instance 2.Obtaining the -certain clusters 3.Obtaining a clean instance corresponding to a specific parameters of the clustering algorithms 4.Obtaining the probability of clustering a set of records together VLDB 2009George Beskales20

Outline VLDB 2009George Beskales21  Generating and Storing the Possible Clean Instances  Querying the Clean Instances  Experimental Evaluation

Experimental Evaluation  A prototype as an extension of PostgreSQL  Synthetic data generator provided by Febrl (freely extensible biomedical record linkage)  Two hierarchical algorithms: Single-Linkage (S.L.) Nearest-neighbor based clustering algorithm (N.N.) [Chaudhuri et al., ICDE’05] VLDB 2009George Beskales22

Experimental Evaluation VLDB 2009George Beskales23

Conclusion  We allow representing and querying multiple possible clean instances  We modified hierarchical clustering algorithms to allow generating multiple repairs  We compactly store the possible repairs by keeping the lineage information of clusters in special attributes  New (probabilistic) query types can be issued against the population of possible repairs VLDB 2009George Beskales26

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Similar presentations

Presentation on theme: "Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.

Similar presentations

Presentation on theme: "Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David."— Presentation transcript:

Similar presentations

About project

Feedback