Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park.

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Robust spectral 3D-bodypart segmentation along time Fabio Cuzzolin, Diana Mateus, Edmond Boyer, Radu Horaud Perception project meeting 24/4/2007 Submitted.
Alexander Kotov and ChengXiang Zhai University of Illinois at Urbana-Champaign.
Lindsey Bleimes Charlie Garrod Adam Meyerson
Location Recognition Given: A query image A database of images with known locations Two types of approaches: Direct matching: directly match image features.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Chapter 5: Introduction to Information Retrieval
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
Multi-label Relational Neighbor Classification using Social Context Features Xi Wang and Gita Sukthankar Department of EECS University of Central Florida.
An Efficient Multi-Dimensional Index for Cloud Data Management Xiangyu Zhang Jing Ai Zhongyuan Wang Jiaheng Lu Xiaofeng Meng School of Information Renmin.
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
 Andisheh Keikha Ryerson University Ebrahim Bagheri Ryerson University May 7 th
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Christine Preisach, Steffen Rendle and Lars Schmidt- Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim Germany Relational.
Daozheng Chen 1, Mustafa Bilgic 2, Lise Getoor 1, David Jacobs 1, Lilyana Mihalkova 1, Tom Yeh 1 1 Department of Computer Science, University of Maryland,
Interactive Generation of Integrated Schemas Laura Chiticariu et al. Presented by: Meher Talat Shaikh.
Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Information Retrieval
Methodology Conceptual Database Design
Example Data Sets Prior Research Join related objects to form independent compound objects, cluster normally (Yin et al., 2005). Use attribute-based distance.
Chapter 5: Information Retrieval and Web Search
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Leveraging Conceptual Lexicon : Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Online Autonomous Citation Management for CiteSeer CSE598B Course Project By Huajing Li.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Chapter 6: Information Retrieval and Web Search
IEEE Int'l Symposium on Signal Processing and its Applications 1 An Unsupervised Learning Approach to Content-Based Image Retrieval Yixin Chen & James.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Date : 2012/10/25 Author : Yosi Mass, Yehoshua Sagiv Source : WSDM’12 Speaker : Er-Gang Liu Advisor : Dr. Jia-ling Koh 1.
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park.
Center for E-Business Technology Seoul National University Seoul, Korea Social Ranking: Uncovering Relevant Content Using Tag-based Recommender Systems.
Computing & Information Sciences Kansas State University IJCAI HINA 2015: 3 rd Workshop on Heterogeneous Information Network Analysis KSU Laboratory for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Searching Specification Documents R. Agrawal, R. Srikant. WWW-2002.
Panther: Fast Top-k Similarity Search in Large Networks JING ZHANG, JIE TANG, CONG MA, HANGHANG TONG, YU JING, AND JUANZI LI Presented by Moumita Chanda.
Challenge Problem: Link Mining Lise Getoor University of Maryland, College Park.
DivQ: Diversification for Keyword Search over Structured Databases Elena Demidova, Peter Fankhauser, Xuan Zhou and Wolfgang Nejfl L3S Research Center,
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Enhanced hypertext categorization using hyperlinks Soumen Chakrabarti (IBM Almaden) Byron Dom (IBM Almaden) Piotr Indyk (Stanford)
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
CONCLUSIONS & CONTRIBUTIONS Ground-truth dataset, simulated search tasks environment Implicit feedback, semi-explicit feedback (annotations), explicit.
Warren Shen, Xin Li, AnHai Doan Database & AI Groups University of Illinois, Urbana Constraint-Based Entity Matching.
Learning Bayesian Networks for Complex Relational Data
Presentation by: ABHISHEK KAMAT ABHISHEK MADHUSUDHAN SUYAMEENDRA WADKI
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Efficient Image Classification on Vertically Decomposed Data
Associative Query Answering via Query Feature Similarity
A SIMPLE GUIDE TO FIVE NORMAL FORMS (See the next slide for required reading) Prof. Ghandeharizadeh 2018/11/14.
Machine Learning for Online Query Relaxation
Efficient Image Classification on Vertically Decomposed Data
Efficient Subgraph Similarity All-Matching
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
CSE572: Data Mining by H. Liu
Rachit Saluja 03/20/2019 Relation Extraction with Matrix Factorization and Universal Schemas Sebastian Riedel, Limin Yao, Andrew.
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Presentation transcript:

Relational Clustering for Entity Resolution Queries Indrajit Bhattacharya, Louis Licamele and Lise Getoor University of Maryland, College Park

 Discover the domain entities  Map each reference to an entity The Entity Resolution Problem Abdulla Ansari WeiWei WangChih Chen Wenyi WangLiyuan Li P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari Chien-Te Chen

Query-time ER: Motivation  Most publicly available databases do not have resolved entities oPubMed, CiteSeer have many unresolved authors  Millions of queries everyday require resolved entities directly or indirectly o“I am looking for all papers by Stuart Russell”  How do we address this problem? 1.Leave the burden on the user to do the resolution 2.Ask owners to ‘clean’ their databases 3.Develop techniques for query-time resolution

Entity Resolution Queries  Disambiguation Query oAmong all papers with ‘W Wang’ as author, find those written by WeiWei Wang  Resolution Query oDo disambiguation oAlso retrieve papers by WeiWei Wang with a different author name, e.g. ‘W W Wang’ etc P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P4: “Autoimmunity in biliary cirrhosis”, W.W.Wang, A.Ansari P1: “A mouse immunity model”, W.Wang, C.Chen, A.Ansari P2: “A better mouse immunity model”, W.Wang, A.Ansari P3: “Measuring protein-bound fluxetine”, L.Li, C.Chen, W.Wang

Query-time ER using Relations 1.Simple approach for resolving queries oUse attributes oQuick but not accurate 2.Use best techniques available oCollective resolution using relationships oHow can localize collective resolution?  Two-phase collective resolution for query oExtract minimal set of relevant records oCollective resolution on extracted records

Cut-based Evaluation of Relational Clustering Vertices embedded in attribute space Additional (hyper)edges represent relationships Good separation of attributes Many cluster-cluster relationships  C1-C3, C1-C4, C2-C4 Worse in terms of attributes Fewer cluster-cluster relationships  C1-C3, C2-C4 C1 C2 C4 C3 C1 C2 C4 C3

A Cut-based Objective Function weight for attributes weight for relations similarity of attributes 1 iff relational edge exists between c i and c j compatibility of c i and c j  Greedy clustering algorithm: merge cluster pair with max reduction in objective function Common cluster neighborhood Jaccard works better than intersection Similarity of attributes Jaro, Levenstein; TF-IDF

W Wang P4: W W Wang P1: W Wang P2: W Wang P3: W Wang P4: A Ansari P2: A Ansari P1: A Ansari P1: C Chen P3: C Chen P3: L Li P: A Ansari P: C Chen P: L Li Extracting Relevant Records Start with query name or record Alternate between 1.Name expansion: For any relevant record, include other records with that name 2.Hyper-edge Expansion: For any relevant record, include other related records Terminate at some depth k Name expansion Name expansion Hyper-edge expansion Query Level 0Level 1 Level 2

Adaptive Expansion for a Query  Too many records with unconstrained expansion oAdaptively select records based on ‘ambiguity’ o‘Chen’ is more ambiguous than ‘Ansari’  Adaptive Name Expansion oExpand the more ambiguous records  They need extra evidence  Adaptive Hyper-edge expansion oAdd fewer ambiguous records  They lead to imprecision

Unsupervised Estimation of Ambiguity  Probability of multiple entities sharing an attribute value  Estimate ambiguity of one single valued attribute (A1=a) using another (A2) oCount number of different values of A2 observed for records having A1=a oe.g. #different first initials for last-name ‘Smith’  Estimate improves with more independent attributes

Evaluation Datasets  arXiv High Energy Physics o29,555 publications, 58,515 refs to 9,200 authors oQueries: All ambiguous names (75 in total)  True authors per name: 2 to 11 (avg. is 2.4)  Elsevier BioBase o156,156 publications, 831,991 author refs oKeywords, topic classifications, language, country and affiliation of corresponding author, etc oQueries: 100 most frequent names  True authors per name: 1 to 100 (avg. is 32)

Growth Rate of Relevant Records and Query Processing Time Number of relevant references grows rapidly with expansion depth RC-ER is fast but not good enough for query-time resolution

Query-time ER Results Unconstrained expansion oCollective resolution more accurate oAccuracy improves beyond depth 1 A : pair-wise attributes similarity ; A+N: also neighbors’ attributes ; * : transitive closure AX-2 : adaptive expansion at depths 2 and beyond AX-1 : adaptive expansion even at depth 1 Adaptive expansion oMinimal loss in accuracy oDramatic reduction in query processing time

Conclusions  Query-centric entity resolution  Cut-based evaluation of relational clustering  Adaptive selection of relevant references for a query  Resolution at query-time with minimal loss in accuracy Future Directions  Spectral algorithm for relational clustering  Stronger coupling between extraction and resolution  Localized resolution for incoming records

References  "Query-Time Entity Resolution", Indrajit Bhattacharya, Louis Licamele and Lise Getoor, ACM SIGKDD, 2006  "A Latent Dirichlet Model for Unsupervised Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIAM Data Mining, 2006  "Entity Resolution in Graphs", Indrajit Bhattacharya and Lise Getoor, Chapter in Mining Graph Data, Lawrence B. Holder and Diane J. Cook, Editors, Wiley, 2006 (to appear).  "Relational Clustering for Multi-type Entity Resolution", Indrajit Bhattacharya and Lise Getoor, SIGKDD Workshop on Multi Relational Data Mining (MRDM), 2005