Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD
Research Goals Understand the current “state of the art” in record linkage Understand the interplay between record linkage and disclosure limitation problems –More generally, understand the interplay between record linkage and analysis of linked data
Initial research question: W hat’s the state of the art in record linkage? Same/related problems studied (in statistics, database, artificial intelligence) variously as: –Merge-purge, duplicate detection, de-duping, database hardening, field-matching, object identity problem, object identification, object consolidation, identity uncertainty, reference resolution, co-reference resolution, reference matching, name matching, … Very few comparative studies across areas Very few studies on multiple datasets –Importance of problem-specific tuning unclear
Initial research question: W hat’s the state of the art in record linkage? Test suite of 14 (small) linkage problems “SecondString”: open-source, Java toolkit implementing: –Edit distance: Levenshtein, Needleman- Wunch, Smith-Waterman, “Monge-Elkan” –Jaro-like: Jaro measure, Jaro-Winkler –Token-based: Jaccard, TFIDF, Jensen- Shannon (smoothed w/ Dirichlet, Jelenik-Mercer) –Hybrid: Monge-Elkans “Level 2”, SoftTFIDF (TFIDF-Jaro hybrid)
Initial research question: W hat’s the state of the art in record linkage? “SecondString” supports: –Comparing methods on multiple datasets Methodology from information retrieval 11-pt interpolated precision –Easily implemented novel hybrid methods –Combining methods (via learned SVM) Labeled data; proxy for hand-tuning on task Different distance metrics for the same field 2.6*TFIDF(x,y) + 0.4*Levenshtein(x,y) + 1.2*Jaro(x,y) Same method on different fields 1.3*dist(x-addr,y-addr) + 2.7*dist(x-lname,y-lname)
Comparison: 7 methods vs 11 datasets SoftTFIDF is best on average
Comparison: 5 edit-distance like metrics on 11 datasets Monge-Elkan is best on average
Comparison: 5 metrics, 11 datasets Monge-Elkan may not be best choice on a particular dataset
Levenshtein vs SoftTFIDF Compare best average performer with one of the worst Not strictly better! Solution: look at learning best (combination of) methods. Training data proxy for hand-tuning to a problem
Research Goals Understand the current “state of the art” in record linkage Understand the interplay between record linkage and disclosure limitation problems –More generally, understand the interplay between record linkage and analysis of linked data
Initial Research Goals SecondString & experiments –Used by researchers at U Washington, elsewhere –Additional code release coming –Still need to implement/evaluate some advanced models (Cohen, Ravikumar, Fienberg, 2003a) A Comparison of String Distance Metrics for Name-Matching Tasks (IIWeb workshop at IJCAI-03) (Cohen, Ravikumar, Fienberg, 2003b) A Comparison of String Distance Metrics for Matching Names and Records (Data Cleaning workshop at KDD-03) (Bilenko, Mooney, Cohen, Ravikumar, Fienberg, 2003) Adaptive name-matching in information integration, (IEEE Intelligent Systems, to appear) (Ravikumar, Cohen, Fienberg, 2004?) More extensive survey paper, in preparation…
Current Research Goals Understand the interplay between record linkage and disclosure limitation problems (more generally, analysis of linked data) Draft paper formalizing –Disclosure control for data A: A A’ so only Pr(A|A’) is available –Disclosure policy (attack) as preventing (attempting) inference of: Pr( PRIVATE | A’, OutsideInfo) –Linkage attack as using A’, B, joint Pr(A,B)
Current Research Goals Understand the interplay between record linkage and disclosure limitation Draft paper Data selected for initial analysis ( NLTCS ) Linkage and analysis: –Analytic linkage: given (X,Y) and (X’,Z) where X and X’ can be linked, find links from X X’ and Pr(Y,Z) using a sort of bootstrap procedure Pr(Y,Z) constrains possible links –How to modify this if Pr(Y,Z) is the important output? What if we only care about some property of Pr(Y,Z), e.g. estimating z = f(y) ?