Download presentation
Presentation is loading. Please wait.
Published byPhilippa Hunter Modified over 5 years ago
1
Self-tuning in Graph-Based Reference Disambiguation
Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine
2
DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
3
DASFAA 2007, Bangkok, Thailand
Data Cleaning Analysis on bad data leads to wrong conclusions 11 January 2019 DASFAA 2007, Bangkok, Thailand
4
Example of the problem: CiteSeer top-K
Suspicious entries Lets go to DBLP website which stores bibliographic entries of many CS authors Lets check two people “A. Gupta” “L. Zhang” they are in top-20 because there are many of them CiteSeer: the top-k most cited authors DBLP DBLP 11 January 2019 DASFAA 2007, Bangkok, Thailand
5
Two Most Common Entity-Resolution Challenges
Fuzzy lookup reference disambiguation match references to objects list of all objects is given Fuzzy grouping group together object repre-sentations, that correspond to the same object 11 January 2019 DASFAA 2007, Bangkok, Thailand
6
Standard Approach to Entity Resolution
11 January 2019 DASFAA 2007, Bangkok, Thailand
7
DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
8
DASFAA 2007, Bangkok, Thailand
RelDC Framework 11 January 2019 DASFAA 2007, Bangkok, Thailand
9
DASFAA 2007, Bangkok, Thailand
RelDC Framework Past work SDM’05, TODS’06 Domain-independent framework Viewing the dataset as an Entity Relationship Graph Analyzes paths in this graph Solid theoretic foundation Optimization problem Scales to large datasets Robust under uncertainty High disambiguation quality No Self-tuning This paper solves this challenge 11 January 2019 DASFAA 2007, Bangkok, Thailand
10
Entity-Relationship Graph
Choice node For uncertain references To encode options/possibilities yr1, … yrN Among options yr1, … yrN Pick the most strongly connected one CAP principle Analyze paths in G that exist between xr and yrj, for all j Use a model to measure connection strength “Connection strength” model c(u,v), for nodes u and v in G how strongly u and v are connected in G RandomWalk-based Fixed Based on Intuition!!! This paper, instead, learns such a model from data. 11 January 2019 DASFAA 2007, Bangkok, Thailand
11
DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
12
DASFAA 2007, Bangkok, Thailand
Adaptive Solution Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} If paths p1 and p2 are of the same type then they are treated as identical. We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} If there is a way to associate path Ti to wi then connection strength will be: 11 January 2019 DASFAA 2007, Bangkok, Thailand
13
DASFAA 2007, Bangkok, Thailand
Problems to Answer How will we classify the paths? How will we associate each path type with a weight? 11 January 2019 DASFAA 2007, Bangkok, Thailand
14
DASFAA 2007, Bangkok, Thailand
Classifying Paths Path Type Model (PTM): Views each path as a sequence of edges <e1,e2,e3,…,en> Each edge ei has a type Ei associated with it Thus, can associate each path p with a string <E1,E2,E3,…,En> Different strings correspond to different path types Associate each string a weight Different models are also possible 11 January 2019 DASFAA 2007, Bangkok, Thailand
15
Learning Path Weights : Optimization Problem
CAP Principle states that: the right option will be better connected Linear programming Learn path types weight w’s. 11 January 2019 DASFAA 2007, Bangkok, Thailand
16
DASFAA 2007, Bangkok, Thailand
Final Solution The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j Then final solution: 11 January 2019 DASFAA 2007, Bangkok, Thailand
17
DASFAA 2007, Bangkok, Thailand
Example -Graph P1= e1-e3-e P2= e1-e1-e3 P3= e1-e2-e2-e P4= e1-e2-e3-e2-e3 11 January 2019 DASFAA 2007, Bangkok, Thailand
18
DASFAA 2007, Bangkok, Thailand
Example- Solution w1 =1 w3 = w4 = 0 w2 can be anything between 0 and 1. 11 January 2019 DASFAA 2007, Bangkok, Thailand
19
DASFAA 2007, Bangkok, Thailand
Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
20
DASFAA 2007, Bangkok, Thailand
Experimental Setup Parameters When looking for L-short simple paths, L = 5 L is the path-length limit RealMov: movies (12K) people (22K) actors directors producers studious (1K) producing distributing ground truth is known SynPub datasets: many ds of five different types emulation of RealPub publications (5K) authors (1K) organizations (25K) departments (125K) ground truth is known 11 January 2019 DASFAA 2007, Bangkok, Thailand
21
Experimental Results on Movies
Parameters : Fraction : fraction of uncertain references in the dataset Each reference has 2 choices 11 January 2019 DASFAA 2007, Bangkok, Thailand
22
Experimental Results on Movies- II
Number of options based on PMF Distribution 11 January 2019 DASFAA 2007, Bangkok, Thailand
23
Experimental Results on SynPub
RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? Hybrid Model : 11 January 2019 DASFAA 2007, Bangkok, Thailand
24
Effect of Random Relationships in the Publications Domain
11 January 2019 DASFAA 2007, Bangkok, Thailand
25
DASFAA 2007, Bangkok, Thailand
Summary Main Contribution An adaptive solution for connection strength Model learns the weights of different path types Ongoing work Using different models to learn the importance of paths in the connection strength Use of standard machine learning techniques for learning: such as decision trees, etc… Different ways to classify paths 11 January 2019 DASFAA 2007, Bangkok, Thailand
26
DASFAA 2007, Bangkok, Thailand
Contact Information RelDC project (RESCUE) Rabia Nuray-Turan (contact author) Dmitri V. Kalashnikov Sharad Mehrotra 11 January 2019 DASFAA 2007, Bangkok, Thailand
27
Thank you !
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.