Download presentation
Presentation is loading. Please wait.
1
Efficient Record Linkage in Large Data Sets
Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003
2
Motivation Correlate data from different data sources (e.g., data integration) Data is often dirty Needs to be cleansed before being used Example: A hospital needs to merge patient records from different data sources They have different formats, typos, and abbreviations
3
Example Table R Table S Name SSN Addr Jack Lemmon Maple St Harrison Ford Culver Blvd Tom Hanks Main St … Name SSN Addr Ton Hanks Main Street Kevin Spacey Frost Blvd Jack Lemon Maple Street … Find records from different datasets that could be the same entity
4
Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): 25-40(1981) Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
5
Record linkage Problem statement:
“Given two relations, identify the potentially matched records Efficiently and Effectively”
6
Challenges How to define good similarity functions?
Many functions proposed (edit distance, cosine similarity, …) Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St” How to do matching efficiently Offline join version Online (interactive) search Nearest search Range search
7
Outline Motivation of record linkage
Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work
8
Single-attribute Case
Given two sets of strings, R and S a similarity function f between strings (metric space) Reflexive: f(s1,s2) = 0 iff s1=s2 Symmetric: f(s1,s2) = d(s2, s1) Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S
9
Nested-loop? Not desirable for large data sets
5 hours for 30K strings!
10
Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space
11
Advantages Applicable to many metric similarity functions
Use edit distance as an example Other similarity functions also tried, e.g., q-gram-based similarity Open to existing algorithms Mapping techniques Join techniques
12
Step 1 Map strings into a high-dimensional Euclidean space
Metric Space Euclidean Space
13
Example: Edit Distance
A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2
14
Mapping: StringMap Input: A list of strings
Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap Each step greedily picks two strings (pivots) to form an axis All axes are orthogonal
15
Can it preserve distances?
Data Sources: IMDB star names: 54,000 German names: 132,000 Distribution of string lengths:
16
Can it preserve distances?
Use data set 1 (54K names) as an example k=2, d=20 Use k’=5.2 to differentiate similar and dissimilar pairs.
17
Choose Dimensionality d
Increase d? Good : better to differentiate similar pairs from dissimilar ones. Bad : Step 1: Efficiency ↓ Step 2: “curse of dimensionality”
18
Choose dimensionality d using sampling
Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=
19
Choose Dimensionality d
20
Choose new threshold k’
Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling Sample 1Kx1K strings, find similar pairs Calculate their maximum new distance as k’ repeat multiple times, choose their maximum
21
New threshold k’ in step 2
22
Step 2: Similarity Join Input: Two sets of points in Euclidean space.
Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used
23
Example Adopted an algorithm by Hjaltason and Samet.
Building two R-Trees. Traverse two trees, find points whose distance is within k’. Pruning during traversal (e.g., using MinDist).
24
Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k
25
Running time
26
Recall Recall: (#of found similar pairs)/(#of all similar pairs)
27
Outline Motivation of record linkage
Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work
28
Multi-attribute linkage
Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:
29
Evaluation strategies
Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.
30
Summary A novel two-step approach to record linkage.
Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied
31
Related work Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] Efficient merge and purge: [Hernandez and Stolfo, 1995] String edit-distance join using DBMS: [Gravano et al, 2001]
32
The Flamingo Project on Data Cleansing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.