Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park.

Slides:

Advertisements

Similar presentations

Recommender System A Brief Survey.

Advertisements

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.

Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park.

Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.

Social Media Mining Chapter 5 1 Chapter 5, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool, September, 2010.

NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.

Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

SFU, CMPT 741, Fall 2009, Martin Ester 418 Outlook Outline Trends in KDD research Graph mining and social network analysis Recommender systems Information.

Search Engines and Information Retrieval

Communities in Heterogeneous Networks Chapter 4 1 Chapter 4, Community Detection and Mining in Social Media. Lei Tang and Huan Liu, Morgan & Claypool,

Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

Statistical Relational Learning for Link Prediction Alexandrin Popescul and Lyle H. Unger Presented by Ron Bjarnason 11 November 2003.

Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.

Algorithms for Data Mining and Querying with Graphs Investigators: Padhraic Smyth, Sharad Mehrotra University of California, Irvine Students: Joshua O’

Prefetching for Visual Data Exploration Punit R. Doshi, Elke A. Rundensteiner, Matthew O. Ward Computer Science Department Worcester Polytechnic Institute.

Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.

Chapter 5: Information Retrieval and Web Search

LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.

Models of Influence in Online Social Networks

Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Attention and Event Detection Identifying, attributing and describing spatial bursts Early online identification of attention items in social media Louis.

Modeling, Searching, and Explaining Abnormal Instances in Multi-Relational Networks Chapter 1. Introduction Speaker: Cheng-Te Li

Dongyeop Kang1, Youngja Park2, Suresh Chari2

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Search Engines and Information Retrieval Chapter 1.

C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Iterative Readability Computation for Domain-Specific Resources By Jin Zhao and Min-Yen Kan 11/06/2010.

Entity Resolution for Big Data Lise Getoor University of Maryland College Park, MD Ashwin Machanavajjhala Duke University Durham, NC

Copyright 2002 Prentice-Hall, Inc. Modern Systems Analysis and Design Third Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich Chapter 20 Object-Oriented.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

Computing & Information Sciences Kansas State University Laboratory for Knowledge Discovery in Databases PhD Research Proficiency Exam Jing.

Chapter 6: Information Retrieval and Web Search

On Node Classification in Dynamic Content-based Networks.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.

Fast Algorithms for Mining Association Rules Rakesh Agrawal and Ramakrishnan Srikant VLDB '94 presented by kurt partridge cse 590db oct 4, 1999.

Probabilistic Models for Discovering E-Communities Ding Zhou, Eren Manavoglu, Jia Li, C. Lee Giles, Hongyuan Zha The Pennsylvania State University WWW.

Lise Getoor University of Maryland, College Park Brigham Young University September 18, 2008 Graph Identification.

Network Community Behavior to Infer Human Activities.

Total Recall: Automatic Query Expansion with a Generative Feature Model for Object Retrieval O. Chum, et al. Presented by Brandon Smith Computer Vision.

DeepDive Model Dongfang Xu Ph.D student, School of Information, University of Arizona Dec 13, 2015.

Contribution and Proposed Solution Sequence-Based Features Collective Classification with Reports Results of Classification Using Reports Collective Spammer.

Data Mining and Decision Support

Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.

Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.

GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011

Meta-Path-Based Ranking with Pseudo Relevance Feedback on Heterogeneous Graph for Citation Recommendation By: Xiaozhong Liu, Yingying Yu, Chun Guo, Yizhou.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.

Feature Generation and Selection in SRL Alexandrin Popescul & Lyle H. Ungar Presented By Stef Schoenmackers.

Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.

Multi-Modal Bayesian Embeddings for Learning Social Knowledge Graphs Zhilin Yang 12, Jie Tang 1, William W. Cohen 2 1 Tsinghua University 2 Carnegie Mellon.

Optimizing Parallel Algorithms for All Pairs Similarity Search

Collective Network Linkage across Heterogeneous Social Platforms

Data Mining: Concepts and Techniques Course Outline

Data Warehousing and Data Mining

I don’t need a title slide for a lecture

Disambiguation Algorithm for People Search on the Web

Classification and Prediction

Data Pre-processing Lecture Notes for Chapter 2

Presentation transcript:

Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park

Learning in Structured Domains Traditional machine learning and data mining approaches assume: –A random sample of homogeneous objects from single relation Real-world datasets: –Multi-relational, heterogeneous and semi-structured represented as a graph or network Statistical Relational Learning: –newly emerging research area at the intersection of research in social network and link analysis, hypertext and web mining, natural language processing, graph mining, relational learning and ILP. Sample Domains: –web data, bibliographic data, epidemiological data, communication data, customer networks, collaborative filtering, trust networks, biological data –Nodes are objects »May have different kinds of objects »Objects have attributes »Objects may have labels or classes –Edges are links »May have different kinds of links »Links may have attributes »Links may be directed, are not required to be binary

Link MiningTasks & Challenges Challenges –Modeling Logical vs. Statistical dependencies –Feature construction –Instances vs. Classes –Collective Classification –Collective Consolidation –Effective Use of Labeled & Unlabeled Data –Link Prediction –Closed vs. Open World Object-Related Tasks Link-based Classification Link-based Ranking Group Detection Entity Resolution Link-Related Tasks Link Type Prediction Predicting Link Existence Link Cardinality Estimation Predicate Invention Graph-Related Tasks Subgraph Discovery Graph Classification Generative Models Meta-data Discovery Reference: SIGKDD Explorations Special Issue on Link Mining, December 2005, edited with Chris Diehl from Johns Hopkins Applied Physics Lab

LINQs UMD Members –myself, Indrajit Bhattacharya, Mustafa Bilgic, Rezarta Islamaj, Louis Licamele, Galileo Namata, John Park, Prithivaraj Sen, Vivek Senghal Projects –Link-based Classification –Entity Resolution (ER) Algorithms Query-time ER User Interface –Predictive Models for Social Network Analysis Affiliation Networks Social Capital in Friendship Event Networks –Temporal Analysis of Traffic Networks –Feature Generation for Sequences (biological data)

Entity Resolution The Problem Relational Entity Resolution Algorithms –Graph-based Clustering (GBC) –Probabilistic Model (LDA-ER) Query-time Entity Resolution ER User Interface

“Jonthan Smith” John Smith Jonathan Smith James Smith “Jon Smith” “Jim Smith” “John Smith” The Entity Resolution Problem “James Smith” Issues: 1.Identification 2.Disambiguation “J Smith”

“Jonthan Smith” John Smith James Smith “Jon Smith” “Jim Smith” “John Smith” “J Smith” The Entity Resolution Problem “James Smith” Unsupervised clustering approach –Number of clusters/entities unknown apriori Jonathan Smith “J Smith”

Pair-wise classification ? “Jonthan Smith” “Jon Smith” “Jim Smith” “John Smith” “J Smith” “James Smith” 0.8 ? Attribute-based Entity Resolution 1.Inability to disambiguate 2.Choosing threshold: precision/recall tradeoff 3.Perform transitive closure?

Relational Entity Resolution References not always observed independently –Links between references indicate relations between the entities –Co-author relations for bibliographic data Use relations to improve disambiguation and identification

Relational Identification Very similar names. Added evidence from shared co-authors

Relational Disambiguation Very similar names but no shared collaborators

Collective Entity Resolution Using Relations One resolutions provides evidence for another => joint resolution

Relational Constraints For Resolution Co-authors are typically distinct

Entity Resolution The Problem Relational Entity Resolution Algorithms –Graph-based Clustering (GBC-ER) –Probabilistic Model (LDA-ER) Query-time Entity Resolution ER User Interface

Example: Bibliographic Entity Resolution Resolve author, paper, venue, publisher entities from citation strings –R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, –Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

Exploiting Bibliographic Links Resolve author, paper, venue, publisher entities from citation strings –R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB-94, –Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994.

Exploiting Bibliographic Links R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 co-author writes published-in

Exploiting Bibliographic Links R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994

entity 1 entity 2 entity 3 entity 4 Exploiting Bibliographic Links R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994

entity 1 entity 2 entity 3 entity 4 Exploiting Bibliographic Links R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994

Approach 1: ER using Relational Clustering (RC-ER) Iteratively cluster ‘similar’ references into entities R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 c1c2 c3c4 c5c6 c7c8

c9 c1c2 Approach 1: ER using Relational Clustering (RC-ER) Iteratively cluster ‘similar’ references into entities R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 c5c6 c7c8

Approach 1: ER using Relational Clustering (RC-ER) Iteratively cluster ‘similar’ references into entities R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 c10 c9 c5c6 c7c8

Approach 1: ER using Relational Clustering (RC-ER) Iteratively cluster ‘similar’ references into entities R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 c10 c9 c11 c7c8

Approach 1: ER using Relational Clustering (RC-ER) Iteratively cluster ‘similar’ references into entities R. Agrawal R. Srikant Fast algorithms for mining association rules in large databases VLDB-94, 1994 Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules Proc. of the 20th Int'l Conference on Very Large Databases, Santiago, Chile, September 1994 c10 c9 c11 c12

Similarity Measure For Clustering sim(c i, c j ) = (1-  )*sim attr (c i, c j ) +  *sim rel (c i, c j ) Attribute similarity between clusters Relational similarity between clusters  Attribute Similarity: Compare attributes of individual references in the two clusters 1.Name: Single Valued Attribute −Cluster Similarity Metric / Representative Attribute −Jaro / Jaro-Winkler / Levenstein similarity with TF- IDF weights 2.Multi Valued Attributes −Countries, Addresses, Keywords, Classifications −Vector with TF-IDF weights; Cosine Similarity

Similarity Measure For Clustering sim(c i, c j ) = (1-  )*sim attr (c i, c j ) +  *sim rel (c i, c j ) Attribute similarity between clusters Relational similarity between clusters Relational Similarity: Use set similarity (eg Jaccard) to find shared clusters (resolutions) between links 1.Edge Detail Similarity Compare individual links of two clusters `Set of sets’ similarity Expensive 2.Neighborhood Similarity Compare neighborhoods of two clusters Reduce set of sets to multiset Cheaper approximation

Edge Detail Similarity Similarity of two links depends on their references –Consider resolution decisions on the references c9 c1c2 R. Agrawal R. Srikant Rakesh Agrawal Ramakrishnan Srikant Both links connect to cluster 9

Edge Detail Similarity Similarity of two links depends on their references –Consider resolution decisions on the references Label set E h (i) of i th link –set of cluster labels of its reference sim h (i,j) = Jaccard(E h (i), E h (j)) Edge Detail Similarity of two clusters –Sim rel (c, c’) = min(sim h (i), sim h (j)), i  H(c), j  H(c’)

Neighborhood Similarity Edge detail similarity is expensive –Ignore explicit link structure –Consider only set of neighborhood clusters Clusters c1, c2 still similar in terms of relationships c1c3c4c5c2 c5 c3 c4 link 1 link 2 link 3 link 4

Neighborhood Similarity Edge detail similarity is expensive –Ignore explicit link structure –Consider only set of neighborhood clusters N(c) : multiset of cluster labels covered by links in H(c) Neighborhood similarity of two clusters –Sim rel (c,c’) = Jaccard(N(c),N(c’))

Approach #1: Algorithm (GBC-ER) Iteratively merge the most similar cluster pairs Similarities are dynamic: Update related similarities after each merge Indexed priority queue for fast update and extraction Relational bootstrapping for improvements in performance and efficiency

Baseline Pairwise duplicate decisions using Soft-TFIDF (ATTR) –Secondary string similarity: Scaled Levenstein(SL), Jaro(JA), Jaro-Winkler(JW) Transitive Closure over pairwise decisions (ATTR*) Precision, Recall and F1 over pairwise decisions Requires similarity threshold –Report best performance over all thresholds

Evaluation Datasets CiteSeer –Machine Learning Citations –Originally created by Lawrence et al. –2,892 references to 1,165 true authors –1,504 links arXiv HEP –Papers from High Energy Physics –Used for KDD-Cup ‘03 Data Cleaning Challenge –58,515 references to 9,200 true authors –29,555 links BioBase –Biology papers on immunology and infectious diseases –IBM KDD Challenge dataset constructed at Cornell –156,156 publications, 831,991 author references –Ground truth for only ~1060 references

GBC Results: Best F1 Relational measures improve performance over attribute baseline in terms of precision, recall and F1 Neighbor similarity performs almost as well as edge detail or better Neighborhood similarity much faster than edge detail CiteSeerHEPBioBase Attr Attr* GBC-Nbr GBC-Edge

Structural Difference between Data Sets Percentage of Ambiguous References –0.5 % for Citeseer –9% for HEP –32% for BioBase Average number of collaborators per author –2.15 for Citeseer –4.5 for HEP Average number of references per author –2.5 for Citeseer –6.4 for HEP –106 for BioBase

Synthetic Data Generator Data generator mimics real collaborations Create collaboration graph in Stage 1 Create documents from this graph in Stage 2 Can control –Number of entities and documents –Average number of collaborators per author –Average number of references per entity –Average number of references per document –Percentage of ambiguous references –…

Trends in Synthetic Data Improvement increases sharply with higher ambiguity in references

Trends in Synthetic Data Improvement increases with more references per author

Trends in Synthetic Data Improvement increases with more references per document

Approach #2: Latent Dirichlet Model for ER Probabilistic model of entity collaboration groups –Entities (authors) belong to groups –Entities (authors) in a link (document) depend on the groups that are involved Latent group variable for each reference Group labels and entity labels unobserved

LDA for Entity Resolution (LDA-ER) Author entities not directly observed Generate entity a as before Entities have attributes v Generate attribute v i ’ for i th reference from entity attribute v a using noise process α θ z a Φ β T D RdRd v’ v A

LDA-ER Contributions Group labels capture relationships among entities Group label and entity label for each reference rather than a variable for each pair Unsupervised learning of labels Number of entities not assumed to be known – Gibbs sampling to infer number of entities

LDA-ER Performance CiteSeer –Improves precision –22% reduction in error arXiv –Improves recall as well as precision –20% reduction in error

ER Algorithm Comparison Two approaches to relational entity resolution 1.Graph-Based Clustering  Efficient  Customizable attribute similarity measure  Performs slightly better than probabilistic model  Unsupervised -- needs threshold to determine duplicates 2.Probabilistic Generative Model  Notion of optimal solution  Group label for references  Can generalize for unseen data  Able to handle noise

Entity Resolution The Problem Relational Entity Resolution Algorithms –Graph-based Clustering (GBC-ER) –Probabilistic Model (LDA-ER) Query-time Entity Resolution ER User Interface

Query-time Entity Resolution Goal: Allow users to query an unresolved or partially resolved database Adaptive strategy which constructs set of relevant references and performs collective resolution Define canonical queries: –Disambiguation query –Entity Resolution query

Preliminary Results: F1 arXivBiobase Attr Attr* Naïve Rel Naïve Rel* Collective ER – depth Collective ER – depth Adaptive Strategy 200 times faster and just as accurate

IBM KDD Entity Resolution Challenge Recent bake-off among researchers in KDD program Our algorithms performed among the top; especially impressive since our algorithms are unsupervised Focused our efforts on scalability, query specific entity resolution, caching, etc.

D-Dupe: An Interactive Tool for ER Tool Integrates –entity resolution algorithms –simple visual interface optimized for ER Case studies on bibliographic datasets –on two clean datasets we quickly were able to find many duplicates –on one dataset w/o author keys, we were able to easily clean dataset to construct keys Currently –adapting tool for database integration geospatial data academic genealogy archives

ER References Bibliographic Data –Author resolution using co-author links 1.Graph-based Clustering (GBC-ER) (DMKD ‘04, LinkKDD ‘04, Book Chapter, Tech Report) 2.LDA based Group model (LDA-ER) (SDM ’06,best paper awqard) 3.Query-based Entity Resolution (QB-ER) Participants in IBM KDD Entity Resolution Challenge Archives Name reference resolution using traffic network 1.Using a variety of temporal social network models (SDM ’06) Natural Language –Sense resolution using translation links in parallel corpora (ACL ‘04) 1.Sense Model: Senses in different languages depend directly on each other 2.Concept Model: Semantic sense groups or Concepts relate senses from different languages

Thanks!!