1 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003
2 Motivation Correlate data from different data sources (e.g., data integration) — Data is often dirty — Needs to be cleansed before being used Example: — A hospital needs to merge patient records from different data sources — They have different formats, typos, and abbreviations
3 Example NameSSNAddr Jack Lemmon Maple St Harrison Ford Culver Blvd Tom Hanks Main St ……… Table R NameSSNAddr Ton Hanks Main Street Kevin Spacey Frost Blvd Jack Lemon Maple Street ……… Table S Find records from different datasets that could be the same entity
4 Another Example P. Bernstein, D. Chiu: Using Semi-Joins to Solve Relational Queries. JACM 28(1): (1981) P. BernsteinD. Chiu Philip A. Bernstein, Dah-Ming W. Chiu, Using Semi-Joins to Solve Relational Queries, Journal of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
5 Record linkage Problem statement: “Given two relations, identify the potentially matched records — Efficiently and — Effectively”
6 Challenges How to define good similarity functions? — Many functions proposed (edit distance, cosine similarity, …) — Domain knowledge is critical Names: “Wall Street Journal” and “LA Times” Address: “Main Street” versus “Main St” How to do matching efficiently — Offline join version — Online (interactive) search Nearest search Range search
7 Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work
8 Single-attribute Case Given — two sets of strings, R and S — a similarity function f between strings (metric space) Reflexive: f(s1,s2) = 0 iff s1=s2 Symmetric: f(s1,s2) = d(s2, s1) Triangle inequality: f(s1,s2)+f(s2,s3) >= f(s1,s3) — a threshold k Find: all pairs of strings (r, s) from R and S, such that f(r,s) <= k. R S
9 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!
10 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space
11 Advantages Applicable to many metric similarity functions — Use edit distance as an example — Other similarity functions also tried, e.g., q- gram-based similarity Open to existing algorithms — Mapping techniques — Join techniques
12 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space
13 Example: Edit Distance A widely used metric to define string similarity Ed(s1,s2)= minimum # of operations (insertion, deletion, substitution) to change s1 to s2 Example: s1: Tom Hanks s2: Ton Hank ed(s1,s2) = 2
14 Mapping: StringMap Input: A list of strings Output: Points in a high-dimensional Euclidean space that preserve the original distances well A variation of FastMap — Each step greedily picks two strings (pivots) to form an axis — All axes are orthogonal
15 Can it preserve distances? Data Sources: — IMDB star names: 54,000 — German names: 132,000 Distribution of string lengths:
16 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?
17 Choose Dimensionality d Increase d? Good : — better to differentiate similar pairs from dissimilar ones. Bad : — Step 1: Efficiency ↓ — Step 2: “curse of dimensionality”
18 Choose dimensionality d using sampling Sample 1Kx1K strings, find their similar pairs (within distance k) Calculate maximum of their new distances w Define “Cost” of finding a similar pair: # of similar pairs # of pairs within distance w Cost=
19 Choose Dimensionality d d=15 ~ 25
20 Choose new threshold k’ Closely related to the mapping property Ideally, if ed(r,s) <= k, the Euclidean distance between two corresponding points <= k’. Choose k’ using sampling — Sample 1Kx1K strings, find similar pairs — Calculate their maximum new distance as k’ — repeat multiple times, choose their maximum
21 New threshold k’ in step 2 d=20
22 Step 2: Similarity Join Input: Two sets of points in Euclidean space. Output: Pairs of two points whose distance is less than new threshold k’. Many join algorithms can be used
23 Example Adopted an algorithm by Hjaltason and Samet. — Building two R-Trees. — Traverse two trees, find points whose distance is within k’. — Pruning during traversal (e.g., using MinDist).
24 Final processing Among the pairs produced from the similarity-join step, check their edit distance. Return those pairs satisfying the threshold k
25 Running time
26 Recall Recall: (#of found similar pairs)/(#of all similar pairs)
27 Outline Motivation of record linkage Single-attribute case: two-step approach Multi-attribute linkage Conclusion and related work
28 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:
29 Evaluation strategies Many ways to evaluate rules Finding an optimal one: NP-hard Heuristics: — Treat different conjuncts independently. Pick the “most efficient” attribute in each conjunct. — Choose the largest threshold for each attribute. Then choose the “most efficient” attribute among these thresholds.
30 Summary A novel two-step approach to record linkage. Many existing mapping and join algorithms can be adopted Applicable to many distance metrics. Time and space efficient. Multi-attribute case studied
31 Related work Learning similarity functions: [Sarawagi and Bhamidipaty, 2003] Efficient merge and purge: [Hernandez and Stolfo, 1995] String edit-distance join using DBMS: [Gravano et al, 2001]
32 The Flamingo Project on Data Cleansing