NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.

Slides:



Advertisements
Similar presentations
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Advertisements

Albert Gatt Corpora and Statistical Methods Lecture 13.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park.
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
A Probabilistic Framework for Semi-Supervised Clustering
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Funding Networks Abdullah Sevincer University of Nevada, Reno Department of Computer Science & Engineering.
Daozheng Chen 1, Mustafa Bilgic 2, Lise Getoor 1, David Jacobs 1, Lilyana Mihalkova 1, Tom Yeh 1 1 Department of Computer Science, University of Maryland,
On Computing Compression Trees for Data Collection in Wireless Sensor Networks Jian Li, Amol Deshpande and Samir Khuller Department of Computer Science,
Mutual Information Mathematical Biology Seminar
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Clustering Unsupervised learning Generating “classes”
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
Using Friendship Ties and Family Circles for Link Prediction Elena Zheleva, Lise Getoor, Jennifer Golbeck, Ugur Kuter (SNAKDD 2008)
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Computer Vision James Hays, Brown
Dongyeop Kang1, Youngja Park2, Suresh Chari2
Link Recommendation In P2P Social Networks Yusuf Aytaş, Hakan Ferhatosmanoğlu, Özgür Ulusoy Bilkent University, Ankara, Turkey.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Redeeming Relevance for Subject Search in Citation Indexes Shannon Bradshaw The University of Iowa
Annealing Paths for the Evaluation of Topic Models James Foulds Padhraic Smyth Department of Computer Science University of California, Irvine* *James.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008.
A Scalable Self-organizing Map Algorithm for Textual Classification: A Neural Network Approach to Thesaurus Generation Dmitri G. Roussinov Department of.
Protecting Sensitive Labels in Social Network Data Anonymization.
Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Finding Top-k Shortest Path Distance Changes in an Evolutionary Network SSTD th August 2011 Manish Gupta UIUC Charu Aggarwal IBM Jiawei Han UIUC.
Facilitating Document Annotation using Content and Querying Value.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Exploit of Online Social Networks with Community-Based Graph Semi-Supervised Learning Mingzhen Mo and Irwin King Department of Computer Science and Engineering.
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Lise Getoor University of Maryland, College Park Brigham Young University September 18, 2008 Graph Identification.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Lise Getoor University of Maryland, College Park Solomonov Seminar J. Stefan Institute 12 March, 2009 Graph Identification.
Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Facilitating Document Annotation Using Content and Querying Value.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Hierarchical Agglomerative Clustering on graphs
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Sofus A. Macskassy Fetch Technologies
Boosted Augmented Naive Bayes. Efficient discriminative learning of
Data Mining K-means Algorithm
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
Disambiguation Algorithm for People Search on the Web
Michal Rosen-Zvi University of California, Irvine
Presentation transcript:

NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park

Entity Resolution The Problem The Algorithms Graph-based Clustering (GBC) Probabilistic Model (LDA-ER) The Tool The Big Picture

“Jonthan Smith” John Smith Jonathan Smith James Smith “Jon Smith” “Jim Smith” “John Smith” The Entity Resolution Problem “James Smith” Issues: 1. Identification 2. Disambiguation “J Smith”

before after InfoVis Co-Author Network Fragment

Entity Resolution in Networks References not observed independently Links between references indicate relations between the entities Co-author relations for bibliographic data To, cc: lists for Use relations to improve identification and disambiguation

Relational Identification Very similar names. Added evidence from shared co-authors

Relational Disambiguation Very similar names but no shared collaborators

Relational Constraints Co-authors are typically distinct

Collective Entity Resolution One resolution provides evidence for another => joint resolution

Entity Resolution The Problem The Algorithms Relational Clustering (RC-ER) Bhattacharya and Getoor, DMKD’04, Wiley’06, TKDD’07 Probabilistic Model (LDA-ER) Experimental Evaluation The Tool The Big Picture

Objective Function Greedy clustering algorithm: merge cluster pair with max reduction in objective function Common cluster neighborhood Similarity of attributes weight for attributes weight for relations similarity of attributes 1 iff relational edge exists between c i and c j Minimize:

Relational Clustering Algorithm 1. Find similar references using ‘blocking’ 2. Bootstrap clusters using attributes and relations 3. Compute similarities for cluster pairs and insert into priority queue 4. Repeat until priority queue is empty 5. Find ‘closest’ cluster pair 6. Stop if similarity below threshold 7. Merge to create new cluster 8. Update similarity for ‘related’ clusters O(n k log n) algorithm w/ efficient implementation CODE AND DATA AND DATA GENERATOR AVAILABLE HERE:

Entity Resolution The Problem Relational Entity Resolution Algorithms Relational Clustering (RC-ER) Probabilistic Model (LDA-ER) SIAM SDM’06, Best Paper Award Experimental Evaluation Query-time Entity Resolution

Probabilistic Generative Model for Collective Entity Resolution Model how references co-occur in data 1.Generation of references from entities 2.Relationships between underlying entities Groups of entities instead of pair-wise relations

LDA-ER Model P R r θ z a T Φ A V α β Entity label a and group label z for each reference r Θ: ‘ mixture’ of groups for each co-occurrence Φ z : multinomial for choosing entity a for each group z V a : multinomial for choosing reference r from entity a Dirichlet priors with α and β

Approx. Inference Using Gibbs Sampling Conditional distribution over labels for each ref. Sample next labels from conditional distribution Repeat over all references until convergence Converges to most likely number of entities

Faster Inference: Split-Merge Sampling Naïve strategy reassigns references individually Alternative: allow entities to merge or split For entity a i, find conditional distribution for 1. Merging with existing entity a j 2. Splitting back to last merged entities 3. Remaining unchanged Sample next state for a i from distribution O(n g + e) time per iteration compared to O(n g + n e)

Entity Resolution The Problem Relational Entity Resolution Algorithms Relational Clustering (RC-ER) Probabilistic Model (LDA-ER) Experimental Evaluation Query-time Entity Resolution ER User Interface

Evaluation Datasets CiteSeer 1,504 citations to machine learning papers (Lawrence et al.) 2,892 references to 1,165 author entities arXiv 29,555 publications from High Energy Physics (KDD Cup’03) 58,515 refs to 9,200 authors Elsevier BioBase 156,156 Biology papers (IBM KDD Challenge ’05) 831,991 author refs Keywords, topic classifications, language, country and affiliation of corresponding author, etc

Baselines A: Pair-wise duplicate decisions w/ attributes only Names: Soft-TFIDF with Levenstein, Jaro, Jaro-Winkler Other textual attributes: TF-IDF A*: Transitive closure over A A+N: Add attribute similarity of co-occurring refs A+N*: Transitive closure over A+N Evaluate pair-wise decisions over references F1-measure (harmonic mean of precision and recall)

ER Evaluation RC-ER & LDA-ER outperform baselines in all datasets Collective resolution better than naïve relational resolution CiteSeer: Near perfect resolution; 22% error reduction arXiv: 6,500 additional correct resolutions; 20% err. red. BioBase: Biggest improvement over baselines CiteSeerarXivBioBase A A*` A+N A+N*` RC-ER LDA-ER

ER over Entire Dataset RC-ER and baselines require threshold as parameter Best achievable performance over all thresholds Best RC-ER performance better than LDA-ER LDA-ER does not require similarity threshold

Performance for Specific Names arXiv Significantly larger improvements for ‘ambiguous names’

Trends in Synthetic Data Bigger improvement with bigger % of ambiguous refs more refs per co-occurrence more neighbors per entity

Entity Resolution The Problem Relational Entity Resolution The Algorithms The Tool H. Kang, M. Bilgic, L. Licamele, B. Shneiderman VAST06, IV07 The Big Picture

D-Dupe: An Interactive Tool for Entity Resolution Novel combination of network visualization and statistical relational models well-suited to the visual analytic task at hand

Entity Resolution The Problem Relational Entity Resolution The Algorithms The Tool The Big Picture

Putting Everything together….

Summary In reality, want to be able to flexibly combine node, edge and graph-based inferences: While there are important pitfalls to take into account (confidence and privacy), there are many potential benefits and payoffs Entity Resolution + Link Prediction + Collective Classification = Graph Identification

Thanks! Work sponsored by the National Science Foundation, Google, KDD program and National Geospatial Agency