Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.

Slides:



Advertisements
Similar presentations
Characterization and Management of Multiple Components of Cost and Risk in Disclosure Protection for Establishment Surveys Discussion of Advances in Disclosure.
Advertisements

Some thoughts on density surface updating 1.Major Updates every X years: refitting models (perhaps new kinds of models) to accumulated data over large.
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Physical Unclonable Functions and Applications
1 A Common Measure of Identity and Value Disclosure Risk Krish Muralidhar University of Kentucky Rathin Sarathy Oklahoma State University.
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
NetSci07 May 24, 2007 Entity Resolution in Network Data Lise Getoor University of Maryland, College Park.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Foundations of Comparative Analytics for Uncertainty in Graphs Lise Getoor, University of Maryland Alex Pang, UC Santa Cruz Lisa Singh, Georgetown University.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Search Engines and Information Retrieval
DATABASE INTRO What is it? What does it do? Information Technology University of Massachusetts at Boston ©2009 William Holmes 1.
Unsupervised Information Extraction from Unstructured, Ungrammatical Data Sources on the World Wide Web Mathew Michelson and Craig A. Knoblock.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Writing Good Software Engineering Research Papers A Paper by Mary Shaw In Proceedings of the 25th International Conference on Software Engineering (ICSE),
Scalable Text Mining with Sparse Generative Models
MACHINE LEARNING AND ARTIFICIAL NEURAL NETWORKS FOR FACE VERIFICATION
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Query Rewriting for Extracting Data Behind HTML Forms Xueqi Chen Department of Computer Science Brigham Young University March 31, 2004 Funded by National.
This is a work of the U.S. Government and is not subject to copyright protection in the United States. The OWASP Foundation OWASP AppSec DC October 2005.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
Tomer Sagi and Avigdor Gal Technion - Israel Institute of Technology Non-binary Evaluation for Schema Matching ER 2012 October 2012, Florence.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Distance functions and IE -2 William W. Cohen CALD.
12th of October, 2006KEG seminar1 Combining Ontology Mapping Methods Using Bayesian Networks Ontology Alignment Evaluation Initiative 'Conference'
URI Disambiguation in the Context of Linked Data Afraz Jaffri, Hugh Glaser, Ian MillardECS, University of Southampton
Record Linkage Everything Data CompSci Spring 2014.
COSTOC Olivier MestreMétéo-FranceFrance Ingebor AuerZAMGAustria Enric AguilarU. Rovirat i VirgiliSpain Paul Della-MartaMeteoSwissSwitzerland Vesselin.
WHIRL – summary of results. WHIRL project ( ) WHIRL initiated when at AT&T Bell Labs AT&T Research AT&T Labs - Research AT&T.
® Microsoft Office 2010 Access Tutorial 3 Maintaining and Querying a Database.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
An Overview of Intrusion Detection Using Soft Computing Archana Sapkota Palden Lama CS591 Fall 2009.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Cohesion and Coupling CS 4311
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Distance functions and IE – 4? William W. Cohen CALD.
1 Open source software: a way to enrich local solutions N. Cibella, M. Scannapieco, T. Tuoto Italian National Statistical Office, Italy.
Introduction of Geoprocessing Lecture 9. Geoprocessing  Geoprocessing is any GIS operation used to manipulate data. A typical geoprocessing operation.
Today’s Goals Answer questions about homework and lecture 2 Understand what a query is Understand how to create simple queries using Microsoft Access 2007.
BSBPMG404A Apply Quality Management Techniques Apply Quality Management Techniques Project Quality Processes C ertificate IV in Project Management
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Friday, 14 November 2003 William.
ISCTSC Workshop A7 Best Practices in Data Fusion.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Information Resource Stewardship A suggested approach for managing the critical information assets of the organization.
Abdul Rahim Ahmad MITM 613 Intelligent System Chapter 10: Tools.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Downscaling of European land use projections for the ALARM toolkit Joint work between UCL : Nicolas Dendoncker, Mark Rounsevell, Patrick Bogaert BioSS:
CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Link mining ( based on slides.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
ESSNET Data Integration - Rome, January 2010 ESSNET on Statistical Disclosure Control Daniela Ichim.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
VOCAB REVIEW. A field that can be computed from other fields Calculated field Click for the answer Next Question.
Distance functions and IE - 3 William W. Cohen CALD.
Learning Bayesian Networks for Complex Relational Data
Scalable Person Re-identification on Supervised Smoothed Manifold
Information Security, Theory and Practice.
Lecture 9: Entity Resolution
Implementation of Relational Operations
Leverage Consensus Partition for Domain-Specific Entity Coreference
WHIRL – Reasoning with IE output
CS639: Data Management for Data Science
Statistical Relational AI
Presentation transcript:

Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD

Research Goals Understand the current “state of the art” in record linkage Understand the interplay between record linkage and disclosure limitation problems –More generally, understand the interplay between record linkage and analysis of linked data

Initial research question: W hat’s the state of the art in record linkage? Same/related problems studied (in statistics, database, artificial intelligence) variously as: –Merge-purge, duplicate detection, de-duping, database hardening, field-matching, object identity problem, object identification, object consolidation, identity uncertainty, reference resolution, co-reference resolution, reference matching, name matching, … Very few comparative studies across areas Very few studies on multiple datasets –Importance of problem-specific tuning unclear

Initial research question: W hat’s the state of the art in record linkage? Test suite of 14 (small) linkage problems “SecondString”: open-source, Java toolkit implementing: –Edit distance: Levenshtein, Needleman- Wunch, Smith-Waterman, “Monge-Elkan” –Jaro-like: Jaro measure, Jaro-Winkler –Token-based: Jaccard, TFIDF, Jensen- Shannon (smoothed w/ Dirichlet, Jelenik-Mercer) –Hybrid: Monge-Elkans “Level 2”, SoftTFIDF (TFIDF-Jaro hybrid)

Initial research question: W hat’s the state of the art in record linkage? “SecondString” supports: –Comparing methods on multiple datasets Methodology from information retrieval 11-pt interpolated precision –Easily implemented novel hybrid methods –Combining methods (via learned SVM) Labeled data; proxy for hand-tuning on task Different distance metrics for the same field 2.6*TFIDF(x,y) + 0.4*Levenshtein(x,y) + 1.2*Jaro(x,y) Same method on different fields 1.3*dist(x-addr,y-addr) + 2.7*dist(x-lname,y-lname)

Comparison: 7 methods vs 11 datasets SoftTFIDF is best on average

Comparison: 5 edit-distance like metrics on 11 datasets Monge-Elkan is best on average

Comparison: 5 metrics, 11 datasets Monge-Elkan may not be best choice on a particular dataset

Levenshtein vs SoftTFIDF Compare best average performer with one of the worst Not strictly better! Solution: look at learning best (combination of) methods. Training data proxy for hand-tuning to a problem

Research Goals Understand the current “state of the art” in record linkage Understand the interplay between record linkage and disclosure limitation problems –More generally, understand the interplay between record linkage and analysis of linked data

Initial Research Goals SecondString & experiments –Used by researchers at U Washington, elsewhere –Additional code release coming –Still need to implement/evaluate some advanced models (Cohen, Ravikumar, Fienberg, 2003a) A Comparison of String Distance Metrics for Name-Matching Tasks (IIWeb workshop at IJCAI-03) (Cohen, Ravikumar, Fienberg, 2003b) A Comparison of String Distance Metrics for Matching Names and Records (Data Cleaning workshop at KDD-03) (Bilenko, Mooney, Cohen, Ravikumar, Fienberg, 2003) Adaptive name-matching in information integration, (IEEE Intelligent Systems, to appear) (Ravikumar, Cohen, Fienberg, 2004?) More extensive survey paper, in preparation…

Current Research Goals Understand the interplay between record linkage and disclosure limitation problems (more generally, analysis of linked data) Draft paper formalizing –Disclosure control for data A: A  A’ so only Pr(A|A’) is available –Disclosure policy (attack) as preventing (attempting) inference of: Pr( PRIVATE | A’, OutsideInfo) –Linkage attack as using A’, B, joint Pr(A,B)

Current Research Goals Understand the interplay between record linkage and disclosure limitation Draft paper Data selected for initial analysis ( NLTCS ) Linkage and analysis: –Analytic linkage: given (X,Y) and (X’,Z) where X and X’ can be linked, find links from X  X’ and Pr(Y,Z) using a sort of bootstrap procedure Pr(Y,Z) constrains possible links –How to modify this if Pr(Y,Z) is the important output? What if we only care about some property of Pr(Y,Z), e.g. estimating z = f(y) ?