Entity Matching : How Similar Is Similar?

Slides:



Advertisements
Similar presentations
String Similarity Measures and Joins with Synonyms
Advertisements

Weiren Yu 1, Jiajin Le 2, Xuemin Lin 1, Wenjie Zhang 1 On the Efficiency of Estimating Penetrating Rank on Large Graphs 1 University of New South Wales.
CrowdER - Crowdsourcing Entity Resolution
DECISION TREES. Decision trees  One possible representation for hypotheses.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jianhua Feng (Tsinghua, China)
A Difference Resolution Approach to Compressing Access Control Lists
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
CES 514 – Data Mining Lecture 8 classification (contd…)
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
The community-search problem and how to plan a successful cocktail party Mauro SozioAris Gionis Max Planck Institute, Germany Yahoo! Research, Barcelona.
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
SVM by Sequential Minimal Optimization (SMO)
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
DBease: Making Databases User-Friendly and Easily Accessible Guoliang Li, Ju Fan, Hao Wu, Jiannan Wang, Jianhua Feng Database Group, Department of Computer.
1 Local search and optimization Local search= use single current state and move to neighboring states. Advantages: –Use very little memory –Find often.
A two-stage approach for multi- objective decision making with applications to system reliability optimization Zhaojun Li, Haitao Liao, David W. Coit Reliability.
An Improved Algorithm to Accelerate Regular Expression Evaluation Author: Michela Becchi, Patrick Crowley Publisher: 3rd ACM/IEEE Symposium on Architecture.
Querying Structured Text in an XML Database By Xuemei Luo.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Low-Rank Kernel Learning with Bregman Matrix Divergences Brian Kulis, Matyas A. Sustik and Inderjit S. Dhillon Journal of Machine Learning Research 10.
Test Architecture Design and Optimization for Three- Dimensional SoCs Li Jiang, Lin Huang and Qiang Xu CUhk Reliable Computing Laboratry Department of.
1 Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld.
On Concise Set of Relative Candidate Keys Shaoxu Song (Tsinghua), Lei Chen (HKUST), Hong Cheng (CUHK)
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Minimizing Delay in Shared Pipelines Ori Rottenstreich (Technion, Israel) Joint work with Isaac Keslassy (Technion, Israel) Yoram Revah, Aviran Kadosh.
Shaoxu Song 1, Aoqian Zhang 1, Lei Chen 2, Jianmin Wang 1 1 Tsinghua University, China 2Hong Kong University of Science & Technology, China 1/19 VLDB 2015.
The Canopies Algorithm from “Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching” Andrew McCallum, Kamal Nigam, Lyle.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Optimal Relay Placement for Indoor Sensor Networks Cuiyao Xue †, Yanmin Zhu †, Lei Ni †, Minglu Li †, Bo Li ‡ † Shanghai Jiao Tong University ‡ HK University.
1 Power Efficient Monitoring Management in Sensor Networks A.Zelikovsky Georgia State joint work with P. BermanPennstate G. Calinescu Illinois IT C. Shah.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
1 Efficient Processing of Partially Specified Twig Queries Junfeng Zhou Renmin University of China.
Yu Wang1, Gao Cong2, Guojie Song1, Kunqing Xie1
Data Driven Resource Allocation for Distributed Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
RE-Tree: An Efficient Index Structure for Regular Expressions
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Jan Rupnik Jozef Stefan Institute
th IEEE International Conference on Sensing, Communication and Networking Online Incentive Mechanism for Mobile Crowdsourcing based on Two-tiered.
Pass-Join: A Partition based Method for Similarity Joins
On Efficient Graph Substructure Selection
Effective Social Network Quarantine with Minimal Isolation Costs
Conflict-Aware Event-Participant Arrangement
Instance Based Learning
Efficient Record Linkage in Large Data Sets
Asymmetric Transitivity Preserving Graph Embedding
Leverage Consensus Partition for Domain-Specific Entity Coreference
A Framework for Testing Query Transformation Rules
Complexity Theory in Practice
A task of induction to find patterns
Efficient Processing of Top-k Spatial Preference Queries
A task of induction to find patterns
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Minimal Kernel Classifiers
An Efficient Partition Based Method for Exact Set Similarity Joins
Presentation transcript:

Entity Matching : How Similar Is Similar? Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China) Jeffrey Xu Yu (CUHK, HK, China) Jianhua Feng (Tsinghua, China)

Entity Matching Matching Find records referring to the same entity 2018/11/14 SiFi @ VLDB2011

Rule-based Method Matching An example of a rule similar name and the same tel ==> the same entity Matching Similar Same 2018/11/14 SiFi @ VLDB2011

Rule-based Method Advantages Explainable Programmable Efficient 2018/11/14 SiFi @ VLDB2011

Rule-based Method Problems Generate record-matching rules Expert’s experience Reasoning about record-matching rules (PVLDB’09) Support approximate-matching conditions Similarity joins E.g. SS-Join (ICDE’06) How similar is similar? similar name ??? ① similar name and the same tel ==> the same entity ② the same address ==> the same tel ③ similar name and the same address ==> the same entity 2018/11/14 SiFi @ VLDB2011

How Similar Is Similar? similar name iff. sim(name1 , name2) ≥ θ Example:S1 = “Jeffrey Yi” , S2 = “Jeffery Yi” sim=Jaccard , θ=0.7 Jaccard(S1, S2) = |S1∩ S2|/ |S1∩ S2| = 1/3 < 0.7 × sim=ES , θ=0.7 ES(S1, S2) = 1- ED(S1, S2)/max(|S1|, |S2|) = 0.8 ≥ 0.7 √ Similarity Function Threshold Edit distance 2018/11/14 SiFi @ VLDB2011

How Similar Is Similar? (Cont’d) Challenges Record-matching rules . similar address and the same email similar name and the same tel . Threshold . 0.64 0.72 . . . . . . Edit Similarity Jaccard Similarity . Similarity Functions 2018/11/14 SiFi @ VLDB2011

Outline SiFi Problem Formulation From Infinite Threshold to Finite Threshold Eliminating Redundancy Algorithms for SiFi Problem Experiment Conclusion 2018/11/14 SiFi @ VLDB2011

Attribute-matching Rule (AR) explicit Attribute-matching Rule (eAR) λe: (a, f , θ) a: An attribute f: A similarity function θ : A threshold r , r’ satisfy λe iff. f (r[a], r’[a])≥ θ λe: (name, Jacc , 0.8) RID name … r1 Jeffery Yi r2 Jeffrey Yi r3 dissatisfy λe satisfy λe 2018/11/14 SiFi @ VLDB2011

Attribute-matching Rule (AR) explicit Attribute-matching Rule (eAR) λe: (a, f , θ) a: An attribute f: A similarity function θ : A threshold r[a] , r’[a] satisfy λe iff. f (r[a], r’[a])≥ θ implicit Attribute-matching Rule (iAR) λi: (a, F , Θ) F: A set of similarity functions Θ: A range of thresholds λe is an instance of λi iff. f∈ F and θ ∈ Θ λi: (name, {Jacc, ES} , [0,1]) λe : (name, Jacc ,0.8) (Instance) 2018/11/14 SiFi @ VLDB2011

Record-matching Rule (RR) A conjunction of ARs φ: λ1 Λ λ2 Λ … Λ λk λ1e: (name, fe , 0.7) φ1 λ1i Λ λ2e λ3e Λ λ4i λ1i Λ λ4i Λ λ5e Φ φ2 φ3 λ4e: (addr, fj , 0.8) ψ1 λ1e Λ λ2e λ3e Λ λ4e λ1e Λ λ4e Λ λ5e Ψ ψ2 ψ3 2018/11/14 SiFi @ VLDB2011

Evaluate the quality of Ψ General Function: F (Ψ, M, D) Ψ : An instance of Φ M : A set of positive examples D : A set of negative examples Property: MΨ denotes record pairs that satisfy Ψ The larger MΨ∩M, the larger F (Ψ, M, D) The smaller MΨ∩D, the larger F (Ψ, M, D) Subsume many well-know functions Accuracy Rate: F-Measure: ,where 2018/11/14 SiFi @ VLDB2011

SiFi Problem Formulation similarity function identification in implicit record-matching rules for effective entity matching SiFi Problem Input Φ: A set of RRs M : A set of positive examples D : A set of negative examples Output Ψ: An instance of Φ to maximize F (Ψ, M, D) 2018/11/14 SiFi @ VLDB2011

Outline SiFi Problem Formulation From Infinite Threshold to Finite Threshold Eliminating Redundancy Algorithms for SiFi Problem Experiment Conclusion 2018/11/14 SiFi @ VLDB2011

From Infinite Threshold to Finite Threshold A range contains an infinite number of thresholds λi: (a, f , [0,1]) A finite number of thresholds θ is the upper-bound of Θ θ = f(r[a], r’[a]) where (r, r’)∈ M Only using this finite number of thresholds can also maximize the objective function F (Ψ, M, D) 2018/11/14 SiFi @ VLDB2011

Example λi : (name, { fe, fg }, [0, 1 ]) fe fg Record pairs fe fg RP1,6 0.8 0.5 RP1,7 0.9 0.7 RP2,5 0.73 0.55 RP3,4 0.1 0.09 RP6,7 0.31 A collection of records fe(“Jeffrey Yi” , “Jeffery Yi”) = 0.8 Positive examples: RPi,j denotes (ri, rj) 2018/11/14 SiFi @ VLDB2011

Outline SiFi Problem Formulation From Infinite Threshold to Finite Threshold Eliminating Redundancy Algorithms for SiFi Problem Experiment Conclusion 2018/11/14 SiFi @ VLDB2011

Two Types of Redundancy Grouping based on f Threshold Redundancy Threshold Redundancy Similarity-function Redundancy Gfe Gfg 2018/11/14 SiFi @ VLDB2011

Threshold Redundancy Definition An instance λei : (a, f, θi ) is threshold redundant if ∃ λej : (a, f, θj )∈ Gf (θi > θj ) s.t. there is no negative example in Record pairs that satisfy λej Record pairs that satisfy λei Intuitively, if λej can return more positive examples than λei and the same negative examples as λei , then λei is redundant w.r.t λej 2018/11/14 SiFi @ VLDB2011

Naive Solution Time complexity Example No negative example in: 2018/11/14 SiFi @ VLDB2011

Our Solution Example No negative example in: Time complexity An instance with a smaller threshold can return more record pairs than that with a larger one. 2018/11/14 SiFi @ VLDB2011

Similarity-function Redundancy Definition An instance λei : (a, fi, θi ) ∈ Gfi is similarity-function redundant if ∃ λej ∈ Gfj s.t. More positive examples Fewer negative examples 2018/11/14

Naive Solution Time complexity: 12 2018/11/14

Our Solution Gfi Gfj Time complexity Equivalent redundancy condition YES NO Gfj 2018/11/14 SiFi @ VLDB2011

Outline SiFi Problem Formulation From Infinite Threshold to Finite Threshold Eliminating Redundancy Algorithms for SiFi Problem Experiment Conclusion 2018/11/14 SiFi @ VLDB2011

NP-Hard Problem NP-hard  . . Record-matching rules Proof: Using a reduction from Maximum-Coverage Problem . similar address and the same email similar name and the same tel . Threshold One iAR (name, {f1, f2} , [0,1])  { (name, f1 , 0.85), (name, f1 , 0.7) , (name, f1 , 0.66),… ,(name, f2 , 0.95), (name, f2 , 0.78), … } Two iARs (address, {f2, f3} , [0,1])  { (address, f2 , 0.85), (address, f2 , 0.7),… ,(address, f3 , 0.95), (address, f3 , 0.78), … } Similarity Functions 2018/11/14 SiFi @ VLDB2011

Heuristic Algorithms SiFi-Greedy SiFi-Gradient SiFi-Hill Idea: The greedy algorithm for maximum-coverage problem Algorithm: Identify the instance of the best iAR each time SiFi-Gradient Idea: Gradient descent Algorithm: Iteratively adjust instances to their neighbors to reach a higher objective value SiFi-Hill Idea: Hill climbing Algorithm: Iteratively adjust the instance of a single iAR to any possible instance to reach a higher objective value Neglect the interaction among different iARs Only consider Neighbor instances 2018/11/14 SiFi @ VLDB2011

Outline SiFi Problem Formulation From Infinite Threshold to Finite Threshold Eliminating Redundancy Algorithms for SiFi Problem Experiment Conclusion 2018/11/14 SiFi @ VLDB2011

Experiment Setup Data sets Cora is a collection of citation entries Example size: |M| = 14,358, |D| = 170,380 Attributes (9): author, title, venue, address, publisher, editor, date, volume, pages Restaurant is a collection of restaurant records Example size: |M| = 87,492, |D| = 106 Attributes (5): name, address, phone, city, type DBGen is a random mailing-list generator Example size: |M|= 3071, |D| = 2426 Attributes (10): ssn, fname, minit, lname, stnum, stadd, apmt, city, state, zip 2018/11/14 SiFi @ VLDB2011

Experiment Setup Record-matching rule set Cora DBGen Restaurant 2018/11/14 SiFi @ VLDB2011

Comparison with Baseline Methods Our algorithm SiFi-Greedy, SiFi-Gradient, SiFi-Hill Baseline SiFi-Expert-1, SiFi-Expert-2, SiFi-Expert-3 SiFi-Equal Change “similar name and the same tel” to “the same name and the same tel” 2018/11/14 SiFi @ VLDB2011

Comparison with Baseline Methods 1. Our methods perform better 2. SiFi-Hill performs the best 3. It is necessary to study “How similar is Similar?” 4. It is important to match some attribute values approximately 2018/11/14 SiFi @ VLDB2011

Evaluation of Eliminating Redundancy 2. Optimizing the eliminating-redundancy algorithm is quite necessary 1. Eliminating redundancy can improve the performance of SiFi-Hill 2018/11/14 SiFi @ VLDB2011

Comparison with Existing methods OPTrees (VLDB’07) An executable operator tree with data cleaning operators SVM (KDD’03) A record pair  n|F|-dimensional vector SVM classifier R Join Jacc(addr)>0.8 &&ES(Name)>0.9 Maximum margins 2018/11/14 SiFi @ VLDB2011

Comparison with Existing methods 2. SVM gets the highest value, but SiFi-Hill and OpTrees are more explainable Effectiveness 1. SiFi-Hill gets higher values than OpTrees 2018/11/14 SiFi @ VLDB2011

Comparison with Existing methods Efficiency 2018/11/14 SiFi @ VLDB2011

Outline SiFi Problem Formulation From Infinite Threshold to Finite Threshold Eliminating Redundancy Algorithms for SiFi Problem Experiment Conclusion 2018/11/14 SiFi @ VLDB2011

Conclusion We formulate the problem of “How similar is similar” for entity matching (SiFi) We propose efficient methods to detect and eliminate redundancy among similarity functions and thresholds We device three heuristic methods to address SiFi problem Our method performs better than the state-of-the-art method, and is more explainable and efficient than machine learning based techniques 2018/11/14 SiFi @ VLDB2011

Thanks! Q&A 2018/11/14 SiFi @ VLDB2011