Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.

Slides:



Advertisements
Similar presentations
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Advertisements

Welcome to the seminar course
A Unified Framework for Context Assisted Face Clustering
CHORD – peer to peer lookup protocol Shankar Karthik Vaithianathan & Aravind Sivaraman University of Central Florida.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Using Data Flow Diagrams
Using Dataflow Diagrams
TTCN-3 Test Case Generation from arbitrary traces Capture & Replay Bogdan Stanca-Kaposta & Theofanis Vassiliou-Gioles (Testing Technologies)
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Exploiting Relationships for Object Consolidation Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California,
Working on a Mini-Project Anders P. Ravn/Arne Skou Computer Science Aalborg University February 2011.
Liyan Zhang, Ronen Vaisenberg, Sharad Mehrotra, Dmitri V. Kalashnikov Department of Computer Science University of California, Irvine This material is.
Exploiting Relationships for Domain-Independent Data Cleaning Dmitri V. Kalashnikov Sharad Mehrotra Stella Chen Computer Science Department University.
Internet Indirection Infrastructure Ion Stoica UC Berkeley.
Using Dataflow Diagrams
1 BotGraph: Large Scale Spamming Botnet Detection Yao Zhao EECS Department Northwestern University.
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Entity-Relationship Data Model Alex Ostrovsky. Presentation Overview ► Short historical overview ► Elements of E-R Model ► Basic organization & relationships.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
CS /47 Illinois Institute of Technology CS487 Software Engineering Requirements II- part B Instructor David Lash.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Obtaining reliable feedback from students about teaching
Web People Search via Connection Analysis Authors : Dmitri V. Kalashnikov, Zhaoqi (Stella) Chen, Sharad Mehrotra, and Rabia Nuray-Turan From : IEEE Trans.
Entity Relationship Modelling. What is Entity Relationship Modelling? The Entity-Relationship model is – ” “ a data model for high-level descriptions.
Modeling, Searching, and Explaining Abnormal Instances in Multi-Relational Networks Chapter 1. Introduction Speaker: Cheng-Te Li
Overcoming the Quality Curse Sharad Mehrotra University of California, Irvine Collaborators/Students (Current) Dmitri Kalashnikov, Yasser Altowim, Hotham.
Mehdi Kargar Aijun An York University, Toronto, Canada Discovering Top-k Teams of Experts with/without a Leader in Social Networks.
Entity-Relationship Data Model N. Harika Lecturer(csc)
The Relational Model Database Systems Lecture 3 Natasha Alechina.
IDRM: Inter-Domain Routing Protocol for Mobile Ad Hoc Networks C.-K. Chau, J. Crowcroft, K.-W. Lee, S. H.Y. Wong.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Last weeks topics – lab reflections: Scientific Method Idea : Use observations to predict the future Definition: A scientific method consists of the collection.
2015/10/111 DBconnect: Mining Research Community on DBLP Data Osmar R. Zaïane, Jiyang Chen, Randy Goebel Web Mining and Social Network Analysis Workshop.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
A Quick Guide to beginning Research Where to Start.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
A Mechanized Model for CAN Protocols Context and objectives Our mechanized model Results Conclusions and Future Works Francesco Bongiovanni and Ludovic.
CS295: Info Quality & Entity Resolution University of California, Irvine Fall 2010 Course introduction slides Instructor: Dmitri V. Kalashnikov Copyright.
Ranking objects based on relationships Computing Top-K over Aggregation Sigmod 2006 Kaushik Chakrabarti et al.
Lecture 07: Dealing with Big Data
1 Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He Tok Wang Ling Dept. of Computer Science School of Computing National.
Page 1 PathSim: Meta Path-Based Top-K Similarity Search in Heterogeneous Information Networks Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S. Yu, Tianyi.
CoNMF: Exploiting User Comments for Clustering Web2.0 Items Presenter: He Xiangnan 28 June School of Computing National.
An Effective Method to Improve the Resistance to Frangibility in Scale-free Networks Kaihua Xu HuaZhong Normal University.
Venue Recommendation: Submitting your Paper with Style Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering, Lehigh University.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Banaras Hindu University. A Course on Software Reuse by Design Patterns and Frameworks.
Community detection via random walk Draft slides.
CSCI N201 Programming Concepts and Database 2 - STAIR Lingma Acheson Department of Computer and Information Science, IUPUI.
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Keyword Search over RDF Graphs
Urban Sensing Based on Human Mobility
Welcome to my presentation
Hanghang Tong, Brian Gallagher, Christos Faloutsos, Tina Eliassi-Rad
Distributed Representations of Subgraphs
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
Result of Ontology Alignment with RiMOM at OAEI’06
Computer Science Department University of California, Irvine
Weakly Learning to Match Experts in Online Community
Record Linkage with Uniqueness Constraints and Erroneous Values
Disambiguation Algorithm for People Search on the Web
Self-tuning in Graph-Based Reference Disambiguation
Jiawei Han Department of Computer Science
Effective Entity Recognition and Typing by Relation Phrase-Based Clustering
Actively Learning Ontology Matching via User Interaction
CS639: Data Management for Data Science
CSE594 Fall 2009 Jennifer Wong Oct. 14, 2009
Presentation transcript:

Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department University of California, Irvine Additional information is available at Copyright © by Dmitri V. Kalashnikov, 2007 ACM IEEE Joint Conference on Digital Libraries 2007

2 Structure of the Talk  Motivation Generic Disambiguation Framework –High-level Entity Resolution Approach –Part of the Framework Experiments

3 Entity Resolution & Data Cleaning Analysis on bad data leads to wrong conclusions! Uncertainty Errors Missing data

4 Why do we need “Entity Resolution”? q Hi, I’m Jane Smith. I’d like to apply for a faculty position. Wow! I am sure we will accept a strong candidate like that! Jane Smith – Fresh Ph.D.Tom - Recruiter OK, let me check something quickly … ??? Publications: 1.…… 2.…… 3.…… Publications: 1.…… 2.…… 3.…… CiteSeer Rank

5 Suspicious entries –Lets go to DBLP website –which stores bibliographic entries of many CS authors –Lets check two people –“A. Gupta” –“L. Zhang” What is the problem? CiteSeer: the top-k most cited authorsDBLP

6 Comparing raw and cleaned CiteSeer RankAuthorLocation 1 (100.00%)douglas 2 (100.00%)rakesh 3 (100.00%)hector 4 (100.00%)sally 5 (100.00%)jennifer 6 (100.00%)david 6 (100.00%)thomas 7 (100.00%)rajeev 8 (100.00%)willy 9 (100.00%)van 10 (100.00%)rajeev 11 (100.00%)john 12 (100.00%)joseph 13 (100.00%)andrew 14 (100.00%)peter 15 (100.00%)serge Raw CiteSeer’s Top-K Most Cited Authors Cleaned CiteSeer’s Top-K Most Cited Authors

7 What is the lesson? –Data should be cleaned first –E.g., determine the (unique) real authors of publications –Solving such challenges is not always “easy” –This explains a large body of work on Entity Resolution “Garbage in, garbage out” principle: Making decisions based on bad data, can lead to wrong results.

8 Typical Data Processing Flow

9 Two most common types of Entity Resolution Fuzzy lookup –match references to objects –list of all objects is given –[SDM’05], [TODS’06] Fuzzy grouping –group references that co-refer –[IQIS’05], [JCDL’07]

10 Structure of the Talk Motivation  Generic Framework –High-level Approach –Part of the Framework Experiments

11 Traditional Approach to Entity Resolution s (X,Y) = f (X,Y) Similarity = Similarity of Features

12 Key Observation: More Info is Available =

13 Solution: Main Idea s (X,Y) = c (X,Y) + γ f (X,Y) Similarity = Similarity of Features + “Connection Strength” New Paradigm

14 Illustrative Example “Indirect connections” –Suppose your co-worker’s name is “John White” –Suppose you see on the Web, on my homepage –My name: “Dmitri …” –Somebody named: “John White” –Who is the “John White”? –From data you might establish a connection: –“Dmitri” might be connected to more “John White”’s…

15 Key Features of the Framework Our goal is/was to create a framework, such that: –solid theoretic foundation –lookup –domain-independent framework –self-tuning –scales to large datasets –robust under uncertainty –high disambiguation quality

16 Structure of the Talk Motivation Generic Framework –High-level  Approach –Part of the Framework Experiments

17 Approach Graph Creation –Entity-Relationship Graph Consolidation Algorithm –Bottom-up clustering Adaptiveness to data –That is, self-tuning –Supervised learning External Data –To improve the quality further –A theoretic possibility –Not tested yet

18 ER Graph Creation

19 Virtual Connected Subgraph (VCS) VCS –Similarity edges form VCSs –Subgraphs in the ER graph 1. “Virtual” –Contains only similarity edges 2. “Connected” –A path between any 2 nodes 3. Completeness –Adding more nodes/edges would violate (1) and (2) Logically, the Goal is –Partition each VCS properly

20 Consolidation Algorithm: Merging

21 Self-tuning via Supervised Learning

22 Self-tuning (2)

23 External Knowledge to Improve Quality

24 Structure of the Talk Motivation Generic Framework –High-level Approach –Part of the Framework  Experiments

25 Quality “Context” is proposed in [Bhattacharya et al., DMKD’04] The two algos are proposed in [Dong et al., SIGMOD’05]

26 Scalability & Efficiency

27 Impact of Random Relationships

28 Contact Information Info about our disambiguation project – Overall design –Dmitri V. Kalashnikov –dvk [at] domain Implementation details in JCDL’07 –Zhaoqi (Stella) Chen –chenz [at] domain –domain = ics.uci.edu