Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching
2 Decide if mentions refer to the same real-world entity Key problem in numerous applications –Information integration –Natural language understanding –Semantic Web Entity Matching Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 Chen Li, Doug Chan. “Ensemble Learning” C. Li, D. Chan. “Ensemble Learning”. ICML 2003
3 State of the Art Numerous solutions in the AI, Database, and Web communities –Cohen, Ravikumar, & Fienberg 2003 –Li, Morie, & Roth 2004 –Bhattacharya & Getoor 2004 –McCallum, Nigam, & Ungar 2000 –Pasula et. al –Wellner et. al Most solutions largely exploit only syntactic similarity –“Jeff Smith” ≈ “J. Smith” –“(217) ” ≈ “ ”
4 Semantic Constraints Incompatible Subsumption Layout C. Li. “User Interfaces”. SIGCHI 2000 C. Li, J. Smith. “Numerical Analysis”. SIAM 2001 Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 “Numerical Analysis”, SIAM 2001 with J. Smith. DBLP Chris Li’s Homepage Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003 C. Li. “Data Mining”. KDD 2000 Chen Li’s Homepage
5 Numerous Semantic Constraint Types TypeExample AggregateNo researcher has chaired more than 3 conferences in a year SubsumptionIf a citation X from DBLP matches a citation Y in a homepage, then each author in Y matches some author in X NeighborhoodIf authors X and Y share similar names and some coauthors, they are likely to match IncompatibleNo researcher exists who has published in both HCI and numerical analysis LayoutIf two mentions in the same document share similar names, they are likely to match UniquenessMentions in the PC listing of a conference refer to different researchers OrderingIf two citations match,then their authors will be matched in order IndividualThe researcher named “Mayssam Saria” has fewer than five mentions in DBLP (e.g. being a new graduate student with fewer than five papers)
6 Our Contributions Develop a solution to exploit semantic constraints –Models constraints in a uniform probabilistic manner –Clusters mentions using a generative model –Uses relaxation labeling to handle constraints –Adds a pairwise layer to further improve accuracy Experimental results on two real-world domains –Researchers, IMDB –Improved accuracy over state of the art by 3-12% F-1
7 Probabilistic Modeling of Constraints Modeled as the effect on the probability that a mention refers to a real-world entity “If two mentions in the same document share similar names, they are likely to match”: Constraint probabilities have a natural interpretation Can be learned or manually specified by a domain expert P (m 2 =e 1 | m 1 = e 1 ) = 0.8 m 1 : Chen Li e 1 m 2 : C. Li
8 The Entity Matching Problem Solution 1.Model document generation 2.Cluster mentions using this model m 3 :Chris Lee m 1 :Chen Li m 2 :C. Li d1d1 d2d2 c 1 = layout constraint p(c 1 ) = 0.8 Documents: m 1 = m 2 Matching Pairs: Constraints:
9 Generate mentions for each document –Select entities –Generate and “sprinkle” mentions Check constraints for each mention –Decide whether to enforce constraint c –If enforced, check if mention violates c –If yes, discard documents and repeat process (Extension of model in Li, Morie & Roth 2004) Modeling Document Generation m 3 : Chris Lee m 1 :Chen Li m 2 :C. Li d1d1 d2d2 e 1 Chen Li e 2 Chris Lee E e 2 Chris Lee c 1 : layout constraint p(c 1 ) = 0.8
10 Clustering with the Generative Model Find mention assignments F and model parameters to maximize P (D, F | ) Difficult to compute exactly, so use a variant of EM...
11 Incorporating Constraints Extend the step that assigns mentions –Basic mention assignment: – Extension: Use constraints to improve mention assignments
12 Apply constraints at each iteration Use relaxation labeling to apply constraints to mention assignments Enforcing Constraints on Clusters Assign mentionsApply constraintsCompute parameters
13 Relaxation Labeling Start with an initial labeling of mentions with entities Iteratively improve mention labels, given constraints Can be extended to probabilistic constraints Scalable Chris Lee = e 2 Jane Smith = e 4 Chen Li = e 1 C. Li = e 2 Y. Lee = e 3 C. Lee = e 2 Smith, J = e 4 Constraints: c 1 = layout constraint p(c 1 ) = 0.8
14 Relaxation Labeling Start with an initial labeling of mentions with entities Iteratively improve mention labels, given constraints Can be extended to probabilistic constraints Scalable Chris Lee = e 2 Jane Smith = e 4 Chen Li = e 1 C. Li = e 2 e 1 Y. Lee = e 3 C. Lee = e 2 Smith, J = e 4 Constraints: c 1 = layout constraint p(c 1 ) = 0.8
15 Handling Probabilistic Constraints Relaxation labeling can combine multiple probabilistic constraints
16 Pairwise Layer So far, we have applied constraints to clusters It may be unclear how to enforce constraints on clusters Add a pairwise layer –Convert clusters into predicted matching pairs –Remove only pairs that negative pairwise hard constraints apply to Chen Li Li, C. Li, Chen C. Li Constraint: C. Li ≠ Li, C. Remove C. Li or Li, C. ? Assign mentionsApply constraintsCompute parameters
17 Empirical Evaluation Two real-world domains –Researchers, IMDB For each domain –Collected documents –Researchers: homepages from DBLP and the web –IMDB: text and structured records from IMDB –Marked up mentions and their attributes –4,991 researcher mentions –3,889 movie titles from IMDB –Manually identified all correct matching pairs Evaluation Metric: Precision = # true positives / # predicted pairs Recall = # true positives / # correct pairs F1 = (2 * P * R) / (P + R)
18 Using Constraints Improves Accuracy Relaxation labeler improves F-1 by 3-12% Relaxation labeling very fast F1 (P / R)ResearchersMovies Baseline.66 (.67/.65).69 (.61/.79) Baseline + Relax.78 (.78/.78).72 (.63/.83) Baseline + Relax + Pairwise.79 (.80/.79).73 (.64/.83)
19 Using Constraints Individually Each constraint makes a contribution ResearchersF1 (P / R) Baseline.66 (.67/.65) + Rare Value.66 (.67/.66) + Subsumption.67 (.68/.65) + Neighborhood.70 (.68/.72) + Individual.70 (.77/.64) + Layout.71 (.68/.74) MoviesF1 (P / R) Baseline.69 (.61/.79) + Incompatible.70 (.62/.79) + Neighborhood.70 (.62/.81) + Individual.71 (.62/.82)
20 Related Work Much work in entity matching Cohen, Ravikumar, & Fienberg 2003 Li, Morie, & Roth 2004 Bhattacharya & Getoor 2004 McCallum, Nigam, & Ungar 2000 Pasula et. al Wellner et. al Recent work has looked at exploiting semantic constraints –Personal Information Management (Dong et. al. 2004) –Profiler based entity matching (Doan et. al. 2003) Semantic constraints successfully exploited in other applications –Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al. 2002)
21 Summary and Future Work Exploit semantic constraints in entity matching –Models constraints in a uniform probabilistic manner –Uses a generative model and relaxation labeling to handle constraints in a scalable way –Experimental results on two real-world domains show effectiveness Future work: Learning constraints effectively from current or external data