Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Slides:



Advertisements
Similar presentations
Meta Data Larry, Stirling md on data access – data types, domain meta-data discovery Scott, Ohio State – caBIG md driven architecture semantic md Alexander.
Advertisements

Learning Clusterwise Similarity with First-Order Features Aron Culotta and Andrew McCallum University of Massachusetts - Amherst NIPS Workshop on Theoretical.
Relevance Feedback Limitations –Must yield result within at most 3-4 iterations –Users will likely terminate the process sooner –User may get irritated.
CrowdER - Crowdsourcing Entity Resolution
BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Learning to Map between Ontologies on the Semantic Web AnHai Doan, Jayant Madhavan, Pedro Domingos, and Alon Halevy Databases and Data Mining group University.
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
A Probabilistic Framework for Semi-Supervised Clustering
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
CSE 574 – Artificial Intelligence II Statistical Relational Learning Instructor: Pedro Domingos.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach AnHai Doan Pedro Domingos Alon Halevy.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Collaborative Recommendation via Adaptive Association Rule Mining KDD-2000 Workshop on Web Mining for E-Commerce (WebKDD-2000) Weiyang Lin Sergio A. Alvarez.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
A Discriminative Latent Variable Model for Online Clustering Rajhans Samdani, Kai-Wei Chang, Dan Roth Department of Computer Science University of Illinois.
Webpage Understanding: an Integrated Approach
Graphical models for part of speech tagging
URI Disambiguation in the Context of Linked Data Afraz Jaffri, Hugh Glaser, Ian MillardECS, University of Southampton
A Hybrid Recommender System: User Profiling from Keywords and Ratings Ana Stanescu, Swapnil Nagar, Doina Caragea 2013 IEEE/WIC/ACM International Conferences.
Recent Trends in Text Mining Girish Keswani
Computing & Information Sciences Kansas State University Boulder, Colorado First International Conference on Weblogs And Social Media (ICWSM-2007) Structural.
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data First-Order Probabilistic Models for Information Extraction Advisor: Hsin-His Chen.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
A Machine Learning Approach to Sentence Ordering for Multidocument Summarization and Its Evaluation D. Bollegala, N. Okazaki and M. Ishizuka The University.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Data Tagging Architecture for System Monitoring in Dynamic Environments Bharat Krishnamurthy, Anindya Neogi, Bikram Sengupta, Raghavendra Singh (IBM Research.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
WebIQ: Learning from the Web to Match Deep-Web Query Interfaces Wensheng Wu Database & Information Systems Group University of Illinois, Urbana Joint work.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
LOGO 1 Corroborate and Learn Facts from the Web Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Shubin Zhao, Jonathan Betz (KDD '07 )
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Information Retrieval Search Engine Technology (8) Prof. Dragomir R. Radev.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Making Holistic Schema Matching Robust: An Ensemble Approach Bin He Joint work with: Kevin Chen-Chuan Chang Univ. Illinois at Urbana-Champaign.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Tuning using Synthetic Workload Summary & Future Work Experimental Results Schema Matching Systems Tuning Schema Matching Systems Formalization of Tuning.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
Why Intelligent Data Analysis? Joost N. Kok Leiden Institute of Advanced Computer Science Universiteit Leiden.
Pedro DeRose University of Wisconsin-Madison The DBLife Prototype System in The Cimple Project on Community Information Management.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Recent Trends in Text Mining
Collective Network Linkage across Heterogeneous Social Platforms
Mikhail Bilenko, Sugato Basu, Raymond J. Mooney
Clustering Algorithms for Noun Phrase Coreference Resolution
MatchCatcher: A Debugger for Blocking in Entity Matching
iSRD Spam Review Detection with Imbalanced Data Distributions
Jiawei Han Department of Computer Science
Probabilistic Latent Preference Analysis
Statistical Relational AI
Presentation transcript:

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching

2 Decide if mentions refer to the same real-world entity Key problem in numerous applications –Information integration –Natural language understanding –Semantic Web Entity Matching Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 Chen Li, Doug Chan. “Ensemble Learning” C. Li, D. Chan. “Ensemble Learning”. ICML 2003

3 State of the Art Numerous solutions in the AI, Database, and Web communities –Cohen, Ravikumar, & Fienberg 2003 –Li, Morie, & Roth 2004 –Bhattacharya & Getoor 2004 –McCallum, Nigam, & Ungar 2000 –Pasula et. al –Wellner et. al Most solutions largely exploit only syntactic similarity –“Jeff Smith” ≈ “J. Smith” –“(217) ” ≈ “ ”

4 Semantic Constraints Incompatible Subsumption Layout C. Li. “User Interfaces”. SIGCHI 2000 C. Li, J. Smith. “Numerical Analysis”. SIAM 2001 Chris Li, Jane Smith. “Numerical Analysis”. SIAM 2001 “Numerical Analysis”, SIAM 2001 with J. Smith. DBLP Chris Li’s Homepage Chen Li, Doug Chan. “Ensemble Learning”. ICML 2003 C. Li. “Data Mining”. KDD 2000 Chen Li’s Homepage

5 Numerous Semantic Constraint Types TypeExample AggregateNo researcher has chaired more than 3 conferences in a year SubsumptionIf a citation X from DBLP matches a citation Y in a homepage, then each author in Y matches some author in X NeighborhoodIf authors X and Y share similar names and some coauthors, they are likely to match IncompatibleNo researcher exists who has published in both HCI and numerical analysis LayoutIf two mentions in the same document share similar names, they are likely to match UniquenessMentions in the PC listing of a conference refer to different researchers OrderingIf two citations match,then their authors will be matched in order IndividualThe researcher named “Mayssam Saria” has fewer than five mentions in DBLP (e.g. being a new graduate student with fewer than five papers)

6 Our Contributions Develop a solution to exploit semantic constraints –Models constraints in a uniform probabilistic manner –Clusters mentions using a generative model –Uses relaxation labeling to handle constraints –Adds a pairwise layer to further improve accuracy Experimental results on two real-world domains –Researchers, IMDB –Improved accuracy over state of the art by 3-12% F-1

7 Probabilistic Modeling of Constraints Modeled as the effect on the probability that a mention refers to a real-world entity “If two mentions in the same document share similar names, they are likely to match”: Constraint probabilities have a natural interpretation Can be learned or manually specified by a domain expert P (m 2 =e 1 | m 1 = e 1 ) = 0.8 m 1 : Chen Li  e 1 m 2 : C. Li

8 The Entity Matching Problem Solution 1.Model document generation 2.Cluster mentions using this model m 3 :Chris Lee m 1 :Chen Li m 2 :C. Li d1d1 d2d2 c 1 = layout constraint p(c 1 ) = 0.8 Documents: m 1 = m 2 Matching Pairs: Constraints:

9 Generate mentions for each document –Select entities –Generate and “sprinkle” mentions Check constraints for each mention –Decide whether to enforce constraint c –If enforced, check if mention violates c –If yes, discard documents and repeat process (Extension of model in Li, Morie & Roth 2004) Modeling Document Generation m 3 : Chris Lee m 1 :Chen Li m 2 :C. Li d1d1 d2d2 e 1 Chen Li e 2 Chris Lee E e 2 Chris Lee c 1 : layout constraint p(c 1 ) = 0.8 

10 Clustering with the Generative Model Find mention assignments F and model parameters  to maximize P (D, F |  ) Difficult to compute exactly, so use a variant of EM...

11 Incorporating Constraints Extend the step that assigns mentions –Basic mention assignment: – Extension: Use constraints to improve mention assignments

12 Apply constraints at each iteration Use relaxation labeling to apply constraints to mention assignments Enforcing Constraints on Clusters Assign mentionsApply constraintsCompute parameters

13 Relaxation Labeling Start with an initial labeling of mentions with entities Iteratively improve mention labels, given constraints Can be extended to probabilistic constraints Scalable Chris Lee = e 2 Jane Smith = e 4 Chen Li = e 1 C. Li = e 2 Y. Lee = e 3 C. Lee = e 2 Smith, J = e 4 Constraints: c 1 = layout constraint p(c 1 ) = 0.8

14 Relaxation Labeling Start with an initial labeling of mentions with entities Iteratively improve mention labels, given constraints Can be extended to probabilistic constraints Scalable Chris Lee = e 2 Jane Smith = e 4 Chen Li = e 1 C. Li = e 2  e 1 Y. Lee = e 3 C. Lee = e 2 Smith, J = e 4 Constraints: c 1 = layout constraint p(c 1 ) = 0.8

15 Handling Probabilistic Constraints Relaxation labeling can combine multiple probabilistic constraints

16 Pairwise Layer So far, we have applied constraints to clusters It may be unclear how to enforce constraints on clusters Add a pairwise layer –Convert clusters into predicted matching pairs –Remove only pairs that negative pairwise hard constraints apply to Chen Li Li, C. Li, Chen C. Li Constraint: C. Li ≠ Li, C. Remove C. Li or Li, C. ? Assign mentionsApply constraintsCompute parameters

17 Empirical Evaluation Two real-world domains –Researchers, IMDB For each domain –Collected documents –Researchers: homepages from DBLP and the web –IMDB: text and structured records from IMDB –Marked up mentions and their attributes –4,991 researcher mentions –3,889 movie titles from IMDB –Manually identified all correct matching pairs Evaluation Metric: Precision = # true positives / # predicted pairs Recall = # true positives / # correct pairs F1 = (2 * P * R) / (P + R)

18 Using Constraints Improves Accuracy Relaxation labeler improves F-1 by 3-12% Relaxation labeling very fast F1 (P / R)ResearchersMovies Baseline.66 (.67/.65).69 (.61/.79) Baseline + Relax.78 (.78/.78).72 (.63/.83) Baseline + Relax + Pairwise.79 (.80/.79).73 (.64/.83)

19 Using Constraints Individually Each constraint makes a contribution ResearchersF1 (P / R) Baseline.66 (.67/.65) + Rare Value.66 (.67/.66) + Subsumption.67 (.68/.65) + Neighborhood.70 (.68/.72) + Individual.70 (.77/.64) + Layout.71 (.68/.74) MoviesF1 (P / R) Baseline.69 (.61/.79) + Incompatible.70 (.62/.79) + Neighborhood.70 (.62/.81) + Individual.71 (.62/.82)

20 Related Work Much work in entity matching Cohen, Ravikumar, & Fienberg 2003 Li, Morie, & Roth 2004 Bhattacharya & Getoor 2004 McCallum, Nigam, & Ungar 2000 Pasula et. al Wellner et. al Recent work has looked at exploiting semantic constraints –Personal Information Management (Dong et. al. 2004) –Profiler based entity matching (Doan et. al. 2003) Semantic constraints successfully exploited in other applications –Clustering algorithms (Bilenko et. al. 2004), ontology matching (Doan et. al. 2002)

21 Summary and Future Work Exploit semantic constraints in entity matching –Models constraints in a uniform probabilistic manner –Uses a generative model and relaxation labeling to handle constraints in a scalable way –Experimental results on two real-world domains show effectiveness Future work: Learning constraints effectively from current or external data