Download presentation
Presentation is loading. Please wait.
Published byHarriet Elliott Modified over 9 years ago
1
Record Linkage: A 10-Year Retrospective Chen Li and Sharad Mehrotra UC Irvine 1
2
2 Efficient Record Linkage in Large Data Sets Liang Jin, Chen Li, Sharad Mehrotra University of California, Irvine DASFAA, Kyoto, Japan, March 2003
4
How was the paper written? Two faculty working on different areas, plus 1 st year PhD student
5
Chen’s Story: 2001 … 5
6
Data Integration Problems? Talking to medical doctors… 6
7
Example NameSSNAddr Jack Lemmon430-871-8294Maple St Harrison Ford292-918-2913Culver Blvd Tom Hanks234-762-1234Main St ……… Table R NameSSNAddr Ton Hanks234-162-1234Main Street Kevin Spacey928-184-2813Frost Blvd Jack Lemon430-817-8294Maple Street ……… Table S Q: Find records from different datasets that could be the same entity 7Chen Li
8
Sharad’s research 8Chen Li
9
Liang’s story 1 st -year PhD student at UC Irvine 9Chen Li
10
Challenges How to define good similarity functions? How to do matching efficiently? 10Chen Li
11
11 Nested-loop? Not desirable for large data sets 5 hours for 30K strings!
12
12 Our 2-step approach Step 1: map strings (in a metric space) to objects in a Euclidean space Step 2: do a similarity join in the Euclidean space
13
13 Advantages Applicable to many metric similarity functions — E.g.: Edit distance Open to existing algorithms — Mapping techniques — Join techniques
14
14 Step 1 Map strings into a high-dimensional Euclidean space Metric Space Euclidean Space
15
15 Use data set 1 (54K names) as an example k=2, d=20 — Use k’=5.2 to differentiate similar and dissimilar pairs. Can it preserve distances?
16
16 Multi-attribute linkage Example: title + name + year Different attributes have different similarity functions and thresholds Consider merge rules in disjunctive format:
17
17 Secret of the paper …
18
18
19
19 Work since then … Chen: efficiency Sharad: quality
20
20 Chen’s Work on Efficiency Gram-based algorithms — Indexing — Selection algorithms — Join algorithms — Variable-length grams — Selectivity estimation Trie-based algorithms — Instant search
21
The Flamingo Package http://flamingo.ics.uci.edu/
22
22 Follow-up work in the community Significant amount of work on approximate string queries — Selection — Join
23
Make an impact? 23
24
UCI People Search 24Chen Li
25
Psearch (2008) : 2 stories 25Chen Li
26
Fuzzy search 26
27
www.omniplaces.com Location-based search 27
28
Research commercialization 28Chen Li
29
Lesson learned: Hands-on experiences important! 29Chen Li
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.