1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.

Slides:



Advertisements
Similar presentations
Space-Constrained Gram-Based Indexing for Efficient Approximate String Search, ICDE 2009, Shanghai Space-Constrained Gram-Based Indexing for Efficient.
Advertisements

Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Processing XML Keyword Search by Constructing Effective Structured Queries Jianxin Li, Chengfei Liu, Rui Zhou and Bo Ning Swinburne University of Technology,
A Unified Framework for Context Assisted Face Clustering
Modeling and Querying Possible Repairs in Duplicate Detection George Beskales Mohamed A. Soliman Ihab F. Ilyas Shai Ben-David.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Automatically Annotating and Integrating Spatial Datasets Chieng-Chien Chen, Snehal Thakkar, Crail Knoblock, Cyrus Shahabi Department of Computer Science.
Entity Profiling with Varying Source Reliabilities
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Speaker: Alexander Behm Space-Constrained Gram-Based Indexing for Efficient Approximate String Search Alexander Behm 1, Shengyue Ji 1, Chen Li 1, Jiaheng.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne.
Novelty Detection and Profile Tracking from Massive Data Jaime Carbonell Eugene Fink Santosh Ananthraman.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Santosh Ghimire – 066BCT533 Subit Raj Pokharel – 066BCT538 Sudip Kafle – 066BCT539.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.
4.4 Equations as Relations
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
Google News Personalization: Scalable Online Collaborative Filtering
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Efficient Instant-Fuzzy Search with Proximity Ranking Authors: Inci Centidil, Jamshid Esmaelnezhad, Taewoo Kim, and Chen Li IDCE Conference 2014 Presented.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
The Sweet Spot between Inverted Indices and Metric-Space Indexing for Top-K–List Similarity Search Evica Milchevski , Avishek Anand ★ and Sebastian Michel.
Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:
GStore: Answering SPARQL Queries Via Subgraph Matching Lei Zou 1, Jinghui Mo 1, Lei Chen 2, M. Tamer Özsu 3, Dongyan Zhao Peking University, 2 Hong.
Search for Approximate Matches in Large Databases Eugene Fink Jaime Carbonell Aaron Goldstein Philip Hayes.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Using Transportation Distances for Measuring Melodic Similarity Pichaya Tappayuthpijarn Qiang Wang.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Efficient Merging and Filtering Algorithms for Approximate String Searches Chen Li, Jiaheng Lu and Yiming Lu Univ. of California, Irvine, USA ICDE ’08.
Lecture 4: Data Integration and Cleaning CMPT 733, SPRING 2016 JIANNAN WANG.
Outline Introduction State-of-the-art solutions Equi-Truss Experiments
Efficient Approximate Search on String Collections Part I
Web Data Integration Using Approximate String Join
Integrating XML Data Sources Using Approximate Joins
Associative Query Answering via Query Feature Similarity
عناصر المثلثات المتشابهة Parts of Similar Triangles
Semantic Interoperability and Data Warehouse Design
Liang Zheng and Yuzhong Qu
15-826: Multimedia Databases and Data Mining
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
ورود اطلاعات بصورت غيربرخط
Efficient Record Linkage in Large Data Sets
Time Relaxed Spatiotemporal Trajectory Joins
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Donghui Zhang, Tian Xia Northeastern University
Inequalities x > a Means x is greater than a
Similarity Measures in Deep Web Data Integration
Presentation transcript:

1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

2 Motivation Correlate data from different data sources (e.g., data integration) — Data is often dirty — Needs to be cleansed before being used Example: — A hospital needs to merge patient records from different data sources — They have different formats, typos, and abbreviations

3 Example NameSSNAddr Jack Lemmon Maple St Harrison Ford Culver Blvd Tom Hanks Main St ……… Table R NameSSNAddr Ton Hanks Main Street Kevin Spacey Frost Blvd Jack Lemon Maple Street ……… Table S Find records from different datasets that could be the same entity

4 Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): (1981) P. BernsteinD. Chiu Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

5 Record linkage Problem statement: “Given two relations, identify the potentially matched records — Efficiently and — Effectively”

6 Challenges How to define good similarity functions? — Many functions proposed (edit distance, cosine similarity, …) — Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St” How to do matching efficiently — Offline join version — Online (interactive) search Nearest search Range search

7 Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work

8 Single-attribute Case Given — two sets of strings, R and S — a similarity function f between strings (metric space) Reflexive: f(s1,s2) = 0 iff s1=s2 Symmetric: f(s1,s2) = d(s2, s1) Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) — a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

9 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

10 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

11 Advantages Applicable to many metric similarity functions — Use edit distance as an example — Other similarity functions also tried, e.g., q- gram-based similarity Open to existing algorithms — Mapping techniques — Join techniques

12 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

13 Example: Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

14 Mapping: StringMap Input: A list of strings Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap — Each step greedily picks two strings (pivots) to form an axis — All axes are orthogonal

15 Can it preserve distances? Data Sources: — IMDB star names: 54,000 — German names: 132,000 Distribution of string lengths:

16 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?

17 Choose Dimensionality d Increase d? Good : — better to differentiate similar pairs from dissimilar ones. Bad  : — Step 1: Efficiency ↓ — Step 2: “curse of dimensionality”

18 Choose dimensionality d using sampling Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=

19 Choose Dimensionality d d=15 ~ 25

20 Choose new threshold k’ Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling — Sample 1Kx1K strings, find similar pairs — Calculate their maximum new distance as k’ — repeat multiple times, choose their maximum

21 New threshold k’ in step 2 d=20

22 Step 2: Similarity Join Input: Two sets of points in Euclidean space. Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used

23 Example Adopted an algorithm by Hjaltason and Samet. — Building two R-Trees. — Traverse two trees, find points whose distance is within k’. — Pruning during traversal (e.g., using MinDist).

24 Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k

25 Running time

26 Recall Recall: (#of found similar pairs)/(#of all similar pairs)

27 Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work

28 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

29 Evaluation strategies Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: — Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. — Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

30 Summary A novel two-step approach to record linkage. Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied

31 Related work Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] Efficient merge and purge: [Hernandez and Stolfo, 1995] String edit-distance join using DBMS: [Gravano et al, 2001]

32 The Flamingo Project on Data Cleansing