Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.

Slides:



Advertisements
Similar presentations
String Similarity Measures and Joins with Synonyms
Advertisements

L3S Research Center University of Hanover Germany
1 Weiren Yu 1,2, Xuemin Lin 1, Wenjie Zhang 1 1 University of New South Wales 2 NICTA, Australia Towards Efficient SimRank Computation over Large Networks.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
Cleaning Uncertain Data with Quality Guarantees Reynold Cheng, Jinchuan Chen, Xike Xie 2008 VLDB Presented by SHAO Yufeng.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
T h e G a s L a w s. T H E G A S L A W S z B o y l e ‘ s L a w z D a l t o n ‘ s L a w z C h a r l e s ‘ L a w z T h e C o m b i n e d G a s L a w z B.
DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Schema Summarization cong Yu Department of EECS University of Michigan H. V. Jagadish Department of EECS University of Michigan
Database Implementation of a Model-Free Classifier Konstantinos Morfonios ADBIS 2007 University of Athens.
Reference Reconciliation in Complex Information Spaces Xin (Luna) Dong, Alon Halevy, Jayant Sigmod 2005 University of Washington.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
How Useful is Old Information? On the Management and Efficiency of Cloud Based Services seminar MICHAEL MITZENMACHER November 2010 Computer Science faculty.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
Sharing Aggregate Computation for Distributed Queries Ryan Huebsch, UC Berkeley Minos Garofalakis, Yahoo! Research † Joe Hellerstein, UC Berkeley Ion Stoica,
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
Large-Scale Deduplication with Constraints using Dedupalog Arvind Arasu et al.
Adaptive Graphical Approach to Entity Resolution Dmitri V. Kalashnikov Stella Chen, Dmitri V. Kalashnikov, Sharad Mehrotra Computer Science Department.
Disambiguation Algorithm for People Search on the Web Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi Chen, Rabia Nuray-Turan, Naveen Ashish For questions.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen, Dmitri V. Kalashnikov, Sharad Mehrotra University of California,
1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
CSC 177 Research Paper Review Chad Crowe. A Microeconomic Data Mining Problem: Customer-Oriented Catalog Segmentation Authors: Martin Ester, Rong Ge,
Semantic Analytics on Social Networks: Experiences in Addressing the Problem of Conflict of Interest Detection Boanerges Aleman-Meza, Meenakshi Nagarajan,
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
Optimizing Plurality for Human Intelligence Tasks Luyi Mo University of Hong Kong Joint work with Reynold Cheng, Ben Kao, Xuan Yang, Chenghui Ren, Siyu.
Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution.
WMNL Sensors Deployment Enhancement by a Mobile Robot in Wireless Sensor Networks Ridha Soua, Leila Saidane, Pascale Minet 2010 IEEE Ninth International.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Graph Data Management Lab, School of Computer Science Add title here: Large graph processing
Materialized View Selection for XQuery Workloads Asterios Katsifodimos 1, Ioana Manolescu 1 & Vasilis Vassalos 2 1 Inria Saclay & Université Paris-Sud,
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Retrospective computation makes past states available inline with current state in a live system What is the language for retrospective computation? What.
Index Interactions in Physical Design Tuning Modeling, Analysis, and Applications Karl Schnaitter, UC Santa Cruz Neoklis Polyzotis, UC Santa Cruz Lise.
Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries Presented By: Muhammad Aamir Cheema Joint work with Xuemin Lin, Wenjie Zhang,
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Lexical Analysis: Finite Automata CS 471 September 5, 2007.
Efficient Processing of Top-k Spatial Preference Queries
1 LinkClus: Efficient Clustering via Heterogeneous Semantic Links Xiaoxin Yin, Jiawei Han Univ. of Illinois at Urbana-Champaign Philip S. Yu IBM T.J. Watson.
Guided Learning for Role Discovery (GLRD) Presented by Rui Liu Gilpin, Sean, Tina Eliassi-Rad, and Ian Davidson. "Guided learning for role discovery (glrd):
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
ACM SIGMOD International Conference on Management of Data, Beijing, June 14 th, Keyword Search on Relational Data Streams Alexander Markowetz Yin.
Speeding Up Warehouse Physical Design Using A Randomized Algorithm Minsoo Lee Joachim Hammer Dept. of Computer & Information Science & Engineering University.
Materialized View Selection and Maintenance using Multi-Query Optimization Hoshi Mistry Prasan Roy S. Sudarshan Krithi Ramamritham.
Computer Science and Engineering TreeSpan Efficiently Computing Similarity All-Matching Gaoping Zhu #, Xuemin Lin #, Ke Zhu #, Wenjie Zhang #, Jeffrey.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
Graph Data Management Lab, School of Computer Science Branch Code: A Labeling Scheme for Efficient Query Answering on Tree
Venue Recommendation: Submitting your Paper with Style Zaihan Yang and Brian D. Davison Department of Computer Science and Engineering, Lehigh University.
2010 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology (WI-IAT) Hierarchical Cost-sensitive Web Resource Acquisition.
A Cooperative Coevolutionary Genetic Algorithm for Learning Bayesian Network Structures Arthur Carvalho
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Safety Guarantee of Continuous Join Queries over Punctuated Data Streams Hua-Gang Li *, Songting Chen, Junichi Tatemura Divykant Agrawal, K. Selcuk Candan.
SCS CMU Speaker Hanghang Tong Colibri: Fast Mining of Large Static and Dynamic Graphs Speaking Skill Requirement.
RESULTS AND DISCUSSION
Structured Models for Multi-Agent Interactions
خشنه اتره اهورهه مزدا شيوۀ ارائه مقاله 17/10/1388.
Poster Presentations – Paper number ID 000
Disambiguation Algorithm for People Search on the Web
Structure and Content Scoring for XML
Structure and Content Scoring for XML
8° International Conference on :
Poster Presentations – Paper number ID 000
Poster Presentations – Paper number ID 000
Relax and Adapt: Computing Top-k Matches to XPath Queries
Affiliation/ City/Country/
Presentation transcript:

Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra Progressive Approach to Relational Entity Resolution Yasser Altowim, Dmitri Kalashnikov, Sharad Mehrotra

Progressive ER

IdNamePapers u1u1 Very Large Data Bases {p1}{p1} u2u2 ICDE Conference {p2}{p2} u3u3 VLDB {p3}{p3} u4u4 IEEE Data Eng. Bull {p4}{p4} IdTitleAuthorsVenue p1p1 Transaction Support in Read Optimized … { a 1, a 2 } u1u1 p2p2 Read Optimized File System Designs: … {a1}{a1} u2u2 p3p3 Transaction Support in Read Optimized … { a 3, a 4 } u3u3 p4p4 Berkeley DB: A Retrospective.. {a3}{a3} u4u4 Author Venue IdNamePapers a1a1 Marge Seltzer { p 1, p 2 } a2a2 Michael Stonebraker {p1}{p1} a3a3 Margo I. Seltzer { p 3, p 4 } a4a4 M. Stonebraker {p3}{p3} Paper Relational Dataset

duplicate Resolve Graph Representation u1, u3u1, u3 u1, u3u1, u3 p1, p3p1, p3 p1, p3p1, p3 duplicate

Problem Definition  Given a relational dataset D, and a cost budget BG,  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.  Given a relational dataset D, and a cost budget BG,  Our goal is to develop a progressive approach that produces a high-quality result using BG units of cost.

ER Graph R 1 S 1 R 2 T 2 T 1 S2S2

ER Graph R 1 S 1 R 2 T 2 T 1 S2S2 v1v1 v2v2 v3v3 v4v4 v8v8 v7v7 v6v6 v5v5 v9v9 v 10 v 11 v 12

R 2 T 2 S2S2 Partially Constructed Graph R 1 S 1 T 1 v1v1 v2v2 v3v3 v7v7 v6v6 v5v5 v4v4 v8v8 v9v9 v 10 v 11 v 12

Resolution Windows Window 1 Window 2 Window n … 1.Plan Generation. 2.Plan Execution ( ). Resolution Plan ( )  Set of blocks ( ) to be instantiated.  Set of nodes ( ) to be resolved. BG Lazy Resolution Strategy

Plan Cost and Benefit

Node Benefit … … … … … … Indirect Benefit Direct Benefit v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 Stat e

2. Generate a plan such that:  h.  is maximized. 2. Generate a plan such that:  h.  is maximized. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. 1. Benefit-vs-Cost Analysis:  Each node and block has an updated cost and benefit. Plan Generation Phase NP-hard Oregon-Trail Knapsack NP-hard Oregon-Trail Knapsack

Instantiated Unresolved Nodes Step#1 Step#2 Uninstantiated Blocks R1R1 R1R1 R2R2 R2R2 R4R4 R4R4 R5R5 R5R5 R6R6 R6R6 R8R8 R8R8 R9R9 R9R9 Plan Generation Algorithm v1v1 v2v2 v4v4 v6v6 v7v7 v 10 v 13 v 15 v 16 v 21 v1v1 v2v2 v6v6 v 10 v 16

Step#3 If > else return and R1R1 R1R1 R8R8 R8R8 R6R6 R6R6 R2R2 R2R2 … Plan Generation Algorithm v1v1 v2v2 v6v6 v 10 v 16 v1v1 v2v2 v 10 v 30 v 32 v 34 v 36 v 38 v 40 v 42 v 45 v 47 v 48

Experimental Evaluation 1.Papers (P) 2.Authors (A) 3.Venues (U) = ( Title, Abstract, Keywords, Authors, Venue ). = ( Name, , Affiliation, Address, Paper ). = ( Name, Year, Pages, Papers ). Number of Entities Blocking Functions Similarity Functions Resolve Function P 30,00023 Naïve Bayes A 83,15214 Naïve Bayes U 30,00013 Naïve Bayes CiteSeerX Dataset

Algorithms: 1.DepGraph.  X. Dong et al. Reference reconciliation in complex information spaces. SIGMOD. 2.Static.  S. E. Whang et al. Joint entity resolution. ICDE. 3.Full:  No lazy resolution strategy. 4.Random:  Lazy resolution strategy but with random order. Experimental Evaluation R R1R1 R1R1 R4R4 R4R4 R5R5 R5R5 … T6T6 T6T6 T1T1 T1T1 T3T3 T3T3 … S2S2 S2S2 S6S6 S6S6 S5S5 S5S5 … T S

Time vs. Recall

Our ApproachRandomFull Execution Time (sec) Plan Generation 4.76%3.81%2.58% Plan Execution 95.11%96.17%97.40 Lazy Resolution with Workflow Our ApproachRandomFull Execution Time (sec) Plan Generation 4.76%3.81%2.58% Reading Blocks 4.70%3.75%2.90% Graph Creation 8.40%6.25%4.72% Node Resolution 82.01%86.17%89.78%  Reading Blocks.  Creating Nodes.  Resolving Nodes.  Reading Blocks.  Creating Nodes.  Resolving Nodes.

Conclusion  Progressive Approach to Relational ER.  Cost and benefit model for generating a resolution plan.  Lazy resolution strategy to resolve nodes with the least amount of cost.  Experiments on publication and synthetic datasets to demonstrate the efficiency of our approach.

Questions