Efficient Record Linkage in Large Data Sets

Slides:



Advertisements
Similar presentations
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Advertisements

Spatio-temporal Databases
Clustering Categorical Data The Case of Quran Verses
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
University of Minnesota 1 Exploiting Page-Level Upper Bound (PLUB) for Multi-Type Nearest Neighbor (MTNN) Queries Xiaobin Ma Advisor: Shashi Shekhar Dec,
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Reference-based Indexing of Sequence Databases Jayendra Venkateswaran, Deepak Lachwani, Tamer Kahveci, Christopher Jermaine University of Florida-Gainesville.
1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003.
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
The Flamingo Software Package on Approximate String Queries Chen Li UC Irvine and Bimaple
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Spatio-temporal Databases Time Parameterized Queries.
Liang Jin (UC Irvine) Nick Koudas (AT&T) Chen Li (UC Irvine)
1 Searching and Integrating Information on the Web Seminar 3: Data Cleansing Professor Chen Li UC Irvine.
Aki Hecht Seminar in Databases (236826) January 2009
Dimensionality Reduction
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore VLDB’2005 * Liang Jin and.
Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.
1 Notes 06: Efficient Fuzzy Search Professor Chen Li Department of Computer Science UC Irvine CS122B: Projects in Databases and Web Applications Spring.
E.G.M. PetrakisDimensionality Reduction1  Given N vectors in n dims, find the k most important axes to project them  k is user defined (k < n)  Applications:
Dimensionality Reduction
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Dimensionality Reduction. Multimedia DBs Many multimedia applications require efficient indexing in high-dimensions (time-series, images and videos, etc)
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Efficient Exact Similarity Searches using Multiple Token Orderings Jongik Kim 1 and Hongrae Lee 2 1 Chonbuk National University, South Korea 2 Google Inc.
VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams Chen Li Bin Wang and Xiaochun Yang Northeastern University,
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1.
A Summary of XISS and Index Fabric Ho Wai Shing. Contents Definition of Terms XISS (Li and Moon, VLDB2001) Numbering Scheme Indices Stored Join Algorithms.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Approximate XML Joins Huang-Chun Yu Li Xu. Introduction XML is widely used to integrate data from different sources. Perform join operation for XML documents:
Date: 2011/12/26 Source: Dustin Lange et. al (CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Frequency-aware Similarity Measures 1.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
Clustering.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Liang Jin * UC Irvine Nick Koudas University of Toronto Chen Li * UC Irvine Anthony K.H. Tung National University of Singapore * Liang Jin and Chen Li:
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Efficient Approximate Search on String Collections Part I
Strategies for Spatial Joins
RE-Tree: An Efficient Index Structure for Regular Expressions
Probabilistic Data Management
Jiannan Wang (Tsinghua, China) Guoliang Li (Tsinghua, China)
Web Data Integration Using Approximate String Join
Evaluation of Relational Operations
Integrating XML Data Sources Using Approximate Joins
Entity Matching : How Similar Is Similar?
Hierarchical clustering approaches for high-throughput data
On Efficient Graph Substructure Selection
Discovering Functional Communities in Social Media
Jongik Kim1, Dong-Hoon Choi2, and Chen Li3
Time Relaxed Spatiotemporal Trajectory Joins
A Framework for Testing Query Transformation Rules
Relaxing Join and Selection Queries
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Donghui Zhang, Tian Xia Northeastern University
Presentation transcript:

Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003

Motivation Correlate data from different data sources (e.g., data integration) Data is often dirty Needs to be cleansed before being used Example: A hospital needs to merge patient records from different data sources They have different formats, typos, and abbreviations

Example Table R Table S Name SSN Addr Jack Lemmon 430-871-8294 Maple St Harrison Ford 292-918-2913 Culver Blvd Tom Hanks 234-762-1234 Main St … Name SSN Addr Ton Hanks 234-162-1234 Main Street Kevin Spacey 928-184-2813 Frost Blvd Jack Lemon 430-817-8294 Maple Street … Find records from different datasets that could be the same entity

Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

Record linkage Problem statement: “Given two relations, identify the potentially matched records Efficiently and Effectively”

Challenges How to define good similarity functions? Many functions proposed (edit distance, cosine similarity, …) Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St” How to do matching efficiently Offline join version Online (interactive) search Nearest search Range search

Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work

Single-attribute Case Given two sets of strings, R and S a similarity function f between strings (metric space) Reflexive: f(s1,s2) = 0 iff s1=s2 Symmetric: f(s1,s2) = d(s2, s1) Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S

Nested-loop? Not desirable for large data sets 5 hours for 30K strings!

Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space

Advantages Applicable to many metric similarity functions Use edit distance as an example Other similarity functions also tried, e.g., q-gram-based similarity Open to existing algorithms Mapping techniques Join techniques

Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space

Example: Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2

Mapping: StringMap Input: A list of strings Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap Each step greedily picks two strings (pivots) to form an axis All axes are orthogonal

Can it preserve distances? Data Sources: IMDB star names: 54,000 German names: 132,000 Distribution of string lengths:

Can it preserve distances? Use data set 1 (54K names) as an example k=2, d=20 Use k’=5.2 to differentiate similar and dissimilar pairs.

Choose Dimensionality d Increase d? Good : better to differentiate similar pairs from dissimilar ones. Bad : Step 1: Efficiency ↓ Step 2: “curse of dimensionality”

Choose dimensionality d using sampling Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=

Choose Dimensionality d

Choose new threshold k’ Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling Sample 1Kx1K strings, find similar pairs Calculate their maximum new distance as k’ repeat multiple times, choose their maximum

New threshold k’ in step 2

Step 2: Similarity Join Input: Two sets of points in Euclidean space. Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used

Example Adopted an algorithm by Hjaltason and Samet. Building two R-Trees. Traverse two trees, find points whose distance is within k’. Pruning during traversal (e.g., using MinDist).

Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k

Running time

Recall Recall: (#of found similar pairs)/(#of all similar pairs)

Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work

Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:

Evaluation strategies Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.

Summary A novel two-step approach to record linkage. Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied

Related work Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] Efficient merge and purge: [Hernandez and Stolfo, 1995] String edit-distance join using DBMS: [Gravano et al, 2001]

The Flamingo Project on Data Cleansing