Self-tuning in Graph-Based Reference Disambiguation Rabia Nuray-Turan Dmitri V. Kalashnikov Sharad Mehrotra Computer Science Department University of California, Irvine
DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning Entity resolution RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Data Cleaning Analysis on bad data leads to wrong conclusions 11 January 2019 DASFAA 2007, Bangkok, Thailand
Example of the problem: CiteSeer top-K Suspicious entries Lets go to DBLP website which stores bibliographic entries of many CS authors Lets check two people “A. Gupta” “L. Zhang” they are in top-20 because there are many of them CiteSeer: the top-k most cited authors DBLP DBLP 11 January 2019 DASFAA 2007, Bangkok, Thailand
Two Most Common Entity-Resolution Challenges Fuzzy lookup reference disambiguation match references to objects list of all objects is given Fuzzy grouping group together object repre-sentations, that correspond to the same object 11 January 2019 DASFAA 2007, Bangkok, Thailand
Standard Approach to Entity Resolution 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand RelDC Framework 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand RelDC Framework Past work SDM’05, TODS’06 Domain-independent framework Viewing the dataset as an Entity Relationship Graph Analyzes paths in this graph Solid theoretic foundation Optimization problem Scales to large datasets Robust under uncertainty High disambiguation quality No Self-tuning This paper solves this challenge 11 January 2019 DASFAA 2007, Bangkok, Thailand
Entity-Relationship Graph Choice node For uncertain references To encode options/possibilities yr1, … yrN Among options yr1, … yrN Pick the most strongly connected one CAP principle Analyze paths in G that exist between xr and yrj, for all j Use a model to measure connection strength “Connection strength” model c(u,v), for nodes u and v in G how strongly u and v are connected in G RandomWalk-based Fixed Based on Intuition!!! This paper, instead, learns such a model from data. 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Adaptive Solution Classify the found paths in the graph into a finite set of path types ST ={ T1, T2, …, TN} If paths p1 and p2 are of the same type then they are treated as identical. We can show the connection between nodes u and v with a path-type count vector: Tuv = { c1, c2, …, cN} If there is a way to associate path Ti to wi then connection strength will be: 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Problems to Answer How will we classify the paths? How will we associate each path type with a weight? 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Classifying Paths Path Type Model (PTM): Views each path as a sequence of edges <e1,e2,e3,…,en> Each edge ei has a type Ei associated with it Thus, can associate each path p with a string <E1,E2,E3,…,En> Different strings correspond to different path types Associate each string a weight Different models are also possible 11 January 2019 DASFAA 2007, Bangkok, Thailand
Learning Path Weights : Optimization Problem CAP Principle states that: the right option will be better connected Linear programming Learn path types weight w’s. 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Final Solution The value of c(xr,yrj)- c(xr,yrl) should be maximized for all r, l≠j Then final solution: 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Example -Graph P1= e1-e3-e1 P2= e1-e1-e3 P3= e1-e2-e2-e3 P4= e1-e2-e3-e2-e3 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Example- Solution w1 =1 w3 = w4 = 0 w2 can be anything between 0 and 1. 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Overview Intro to Data Cleaning RelDC Framework Past work Adapting to data The new part Reduction to an Optimization problem Linear programming Experiments 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Experimental Setup Parameters When looking for L-short simple paths, L = 5 L is the path-length limit RealMov: movies (12K) people (22K) actors directors producers studious (1K) producing distributing ground truth is known SynPub datasets: many ds of five different types emulation of RealPub publications (5K) authors (1K) organizations (25K) departments (125K) ground truth is known 11 January 2019 DASFAA 2007, Bangkok, Thailand
Experimental Results on Movies Parameters : Fraction : fraction of uncertain references in the dataset Each reference has 2 choices 11 January 2019 DASFAA 2007, Bangkok, Thailand
Experimental Results on Movies- II Number of options based on PMF Distribution 11 January 2019 DASFAA 2007, Bangkok, Thailand
Experimental Results on SynPub RandomWalk, PTM and the Hybrid Model have the same accuracy Is RandomWalk the optimum model for Publications domain? Hybrid Model : 11 January 2019 DASFAA 2007, Bangkok, Thailand
Effect of Random Relationships in the Publications Domain 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Summary Main Contribution An adaptive solution for connection strength Model learns the weights of different path types Ongoing work Using different models to learn the importance of paths in the connection strength Use of standard machine learning techniques for learning: such as decision trees, etc… Different ways to classify paths 11 January 2019 DASFAA 2007, Bangkok, Thailand
DASFAA 2007, Bangkok, Thailand Contact Information RelDC project www.ics.uci.edu/~dvk/RelDC www.itr-rescue.org (RESCUE) Rabia Nuray-Turan (contact author) www.ics.uci.edu/~rnuray Dmitri V. Kalashnikov www.ics.uci.edu/~dvk Sharad Mehrotra www.ics.uci.edu/~sharad 11 January 2019 DASFAA 2007, Bangkok, Thailand
Thank you !