A Comparison of Record Linkage Techniques

A Comparison of Record Linkage Techniques
Lowell G. Mason Economist Employment Research and Program Development Staff Bureau of Labor Statistics August 2, 2018

Outline Overview of the record linkage process
Description of the record linkage problem: Merging BEA employer-level data with BLS establishment-level data Evaluation of several record linkage classification techniques: Probabilistic (Fellegi-Sunter), Supervised (Logistic, SVM, Random Forests, Gradient Boosting) Conclusion

What is Record Linkage Record linkage is the process of joining the observations in k datasets in the absence of reliable unique identifiers. For k = 1, 2, …, n. For simplicity, we assume k = 2 with datasets A and B of size n and m, respectively. Also referred to as data matching, entity resolution, co-reference resolution, or deduplication.

Record Linkage Goals Record linkage goals:
Accurately and reliably model linkages that fit the data while accounting for uncertainty, and, Do this efficiently.

Why Record Linkage is Challenging
Contradictory goals: Accurate and reliable → examining all possible pairs? Pairwise comparison is O(nm). Hard to evaluate: Without knowing the true linkage status, it is difficult to assess linkage quality and completeness.

Why Record Linkage is Challenging, Cont.
Obtaining labeled data to evaluate linkage quality and completeness is expensive: “Gold-standard” labels are double-blind coded, with disagreements adjudicated by a third coder.

Overview of Record Linkage Process
Database A 1. Data preprocessing Clerical review Potential matches 2. Indexing 3. Candidate pair comparison 4. Classification Non- matches 5. Evaluation Matches Database B 1. Data preprocessing From Christen (2012)

1. Data Pre-processing Concerned with data quality and consistency between sources: Imputation Standardization Pre-processing for names and addresses: Removing unwanted characters and tokens Standardization and tokenization Parsing into multiple output fields

Data Pre-processing Example
Original name Standardized name Parsed name COMTRECK SOLUTIONS, INC comtreck solutions, inc Name: comtreck solutions Type: inc Location: NULL Comtrek Solutions (US) Inc. comtrek solutions (us) inc Name: comtrek solutions Location: us Database A Similarity score*: Similarity score: 0.67 Similarity score: 0.84 Database B *Normalized Cosine similarity using character 3-grams

2. Indexing Concerned with reducing the search space for the remaining record linkage steps: Particularly the record pair classification step. Also called blocking.

Deterministic Indexing Methods
Partition the search space into blocks by requiring subsets of the features of each dataset to match according to some function, for example: Require the first four letters of company name pairs be equal, or, Require the normalized Cosine similarity of company name pairs be greater than 0.75. May still requires pair-wise comparisons: In the second example, for example.

Deterministic Indexing Example
Databases A and B with m and n observations each. The search space is mn pairs, (ai,bj), before indexing. After indexing, the space is only 16% the size of the original space (assuming a 1-to-1 correspondence between area and the number of observations). Before Indexing b1 … bn-2 bn-1 bn a1 a2 a3 am-2 am-1 am After Indexing b1 … bn-2 bn-1 bn a1 a2 a3 am-2 am-1 am

Probabilistic Indexing Methods
Locality Sensitive Hashing*: Compresses the search space such that similar pairs are mapped to the same sub-space with high probability: For a fixed record in A, the sets of pairs, (a1,bj), in which a1 and bj map to the same sub-space for all j are the candidate pairs for the record a1 in A. Determining set membership is linear. *Stoerts, et al. (2014)

3. Candidate Pair Comparison
Concerned with comparing like-features between the candidate pairs. Even after preprocessing, it is unlikely that the features in the datasets are directly comparable.

Candidate Pair Comparison, Cont.
Original name Standardized name Parsed name COMTRECK SOLUTIONS, INC comtreck solutions, inc Name: comtreck solutions Type: inc Location: NULL Comtrek Solutions (US) Inc. comtrek solutions (us) inc Name: comtrek solutions Location: us Database A Similarity score*: Similarity score: 0.67 Similarity score: 0.84 Similarity score: 1.0 Database B *Normalized Cosine similarity using character 3-grams

Candidate Pair Comparison, Cont.
Compare the candidate pair features using similarity measures: For pair (ai,bj) with feature f, s = sim(aif,bjf): s = 1: if the feature f is exactly the same for both pairs s = 0: if the feature f is completely different 0 < s < 1: if the feature is approximately equal Relationship to distance metrics: For a normalized distance metric, d, the corresponding similarity measure is s = 1 – d.

Similarity Measures Similarity measures include: Set-based:
Hamming similarity Jaccard similarity Metric based: 1 – normalized Euclidean distance Cosine similarity

4. Classification Concerned with predicting the linkage type for the candidates pairs, (ai,bj). Depending on the methodology used, linkage type can contain 2 or 3 class labels: Match Non-match Potential match

Classification Methods
Probabilistic: Fellegi-Sunter (1969) Supervised Learning: Logistic Support Vector Machines (SVM) Random Forests Unsupervised: Clustering Bayesian: Steorts (2015)

5. Evaluation Concerned with evaluating the performance of the classification step. For this step, we will assume labeled data are available.

Confusion Matrix TPR = TP TP+FN FPR = FN TP+FN TNR = TN TN+FP
Actual Linkage Match Non-match Predicted Linkage True Positive (TP) False Positive (FP) False Negative (FN) True Negative (TN) TPR = TP TP+FN FPR = FN TP+FN TNR = TN TN+FP FNR = FP TN+FP

Evaluation Measures Based on Confusion Matrix
Accuracy = TP+TN TP+TN+FP+FN Recall = TP TP+FN Precision = TP TP+FP F1 = 2∙ precision ∙ recall precision + recall

Evaluation Measures Based on Confusion Matrix, Cont.
A Receiver Operator Curve plots the True and False Positive Rates AUC = area under a ROC Curve

Additional Evaluation Measures
Measures involving time: Processing time Time to implement Parameter tuning Measures for indexing: Reduction rate: the relative reduction in the search space from the indexing step In the deterministic indexing example, the reduction rate was 0.84.

Record Linkage Type Record Linkage may be:
one-to-one, one-to-many, many-to-many Ideally in the case of one-to-many and many-to-many linkages, the classification step would be collective. Alternatively, a post-classification step is needed.

QCEW Overview Quarterly Census of Employment and Wages (QCEW) program:
Conducted by the Bureau of Labor Statistics Quarterly census of more than 95 percent of U.S. jobs 2017Q4: 9.96 million establishments

Establishments in the QCEW
The basic unit of measurement in the QCEW: An establishment is a single physical location where one predominant industrial activity occurs. Data are collected quarterly, including names, addresses, ownership and industry classification, and employment and compensation totals. Establishments (a, b, c, d, e): Establishment (a) Establishment (b) Establishment (c) Establishment (d) Establishment (e)

UI Employer Groups Establishments are identified from State Unemployment Insurance (UI) filings. These can be used to aggregate establishments into groups of employers by state. UI employer groups (1, 2, 3, 4): Employer 1 (UI 1) Employer 2 (UI 2) Employer 3 (UI 3) Employer 4 (UI 4) Establishments (a, b, c, d, e): Establishment 1.a (a) Establishment 2.b (b) Establishment 2.c (c) Establishment 3.d (d) Establishment 4.e (e)

EIN Employer groups UI Employer groups report the Employer Identification number (EIN) as part of their State UI filings. EINs are assigned by the IRS. They are not unique to any one state. These can be used to aggregate establishments into EIN employer groups. This is the highest level of aggregation possible in the QCEW.

EIN Employer groups, cont.
groups (A, B): Employer A (EIN A) Employer B (EIN B) Employer C (EIN C) UI employer groups (1, 2, 3, 4): Employer A.1 (UI 1) Employer A.2 (UI 2) Employer A.3 (UI 3) Employer C.4 (UI 4) Establishments (a, b, c, d, e): Establishment A.1.a (a) Establishment A.2.b (b) Establishment A.2.c (c) Establishment B.3.d (d) Establishment C.4.e (e)

Single and Multi-Unit Establishment Employers
Establishments can be classified by whether there exists in the aggregated employer groups (UI, EIN): A single establishment, or Multiple establishments.

Single and Multi-Unit Establishments, Cont.
EIN employer groups (A, B): Employer A (EIN A) Employer B (EIN B) Employer C (EIN C) UI employer groups (1, 2, 3, 4): Employer A.1 (UI 1) Employer A.2 (UI 2) Employer A.3 (UI 3) Employer C.4 (UI 4) Establishments (a, b, c, d, e): Establishment A.1.a (a) Establishment A.2.b (b) Establishment A.2.c (c) Establishment B.3.d (d) Establishment C.4.e (e)

Firm Structure in the QCEW
Firms may report under one or a number of EINs. Firm structure is unknown within the QCEW.

Firm Structure in the QCEW, Cont.
(I, II): Firm I Firm II EIN employer groups (A, B): Employer I.A (EIN A) Employer I.B (EIN B) Employer II.C (EIN C) UI employer groups (1, 2, 3, 4): Employer I.A.1 (UI 1) Employer I.A.2 (UI 2) Employer I.B.3 (UI 3) Employer II.C.4 (UI 4) Establishments (a, b, c, d, e): Establishment I.A.1.a (a) Establishment I.A.2.b (b) Establishment I.A.2.c (c) Establishment I.B.3.d (d) Establishment II.C.4.e (e)

Inward FDI Overview Inward Foreign Direct Investment (FDI):
Conducted by the Bureau of Economic Analysis: Benchmark surveys are conducted every 5 years. Annual updates for a subset of the data otherwise. The data used in the this study are from the 2012 benchmark survey. QCEW data are from the same period.

U.S. Affiliates of Foreign Owned Firms
The unit of measurement for BEA Inward FDI data is an affiliate: A single U.S.-based employer group that has at least 10% foreign ownership. Data include balance sheets, income statements, goods and services supplied, and employment and compensation.

Structure of U.S. Affiliates of Foreign Owned Firms
Foreign firms: (I, II, III) Foreign Firm I Affiliate I.A (A) Affiliate II.B (B) Foreign Firm II Affiliate III.C (C) Foreign Firm III US Affiliates (A, B, C) Unlike the QCEW, structure in BEA Inward FDI is from the top, down: The unit of measurement is at the affiliate level. This is the equivalent of a U.S. firm.

Structure of U.S. Affiliates of Foreign Owned Firms, Cont.
For large affiliates, subsidiary data are collected: Subsidiary name Subsidiary EIN For all affiliates, employment is broken out by state.

Structure of U.S. Affiliates of Foreign Owned Firms, Cont.
Foreign firms: (I, II, III) Subsidiary I.A.1 (EIN 1) Subsidiary I.A.2 (EIN 2) Foreign Firm I Affiliate I.A (A) Subsidiary II.B.3 (EIN 3) Affiliate II.B (B) Foreign Firm II Affiliate III.C (C) Foreign Firm III US Affiliates (A, B, C) EIN employer groups (1, 2, 3):

Inward FDI to QCEW Linkage
Unique identifiers are available: EINs. However, the are very noisy. They are not used in this study. The type of linkage is one-to-many: One BEA FDI affiliate may link to many QCEW establishments. This can be managed to a degree by aggregating the QCEW to the EIN employer group level.

Inward FDI to QCEW Linkage, Cont.
It is still a one-to-many linkage, but at least we are comparing similar things: the subsidiaries of affiliates to QCEW EIN employer groups. Features of the two datasets that are comparable are: Employer names and addresses Contact info (contact name and contact phone) Distribution of employment across states Distribution of employment across industrial sectors

Inward FDI to QCEW Linkage, Cont.
Aggregation necessitates reducing the one-to-many record pair comparisons to a one-to-one comparison: For example, compare the BEA affiliate name to all establishments names for an EIN, calculating similarity measures for each pair. Use the maximum value over all the pairs as the measure of similarity between the BEA affiliate and the QCEW EIN employer group.

Why Merge BLS QCEW and BEA FDI Data?
Leverages existing data, reducing respondent burden BLS establishment data augment BEA enterprise data

Overview of Inward FDI to QCEW Record Linkage Process
Pre-processing: Aggregating QCEW data to the EIN level. Breaking out BEA affiliate subsidiary data to the EIN level. Storing the data in such a way that it could be processed efficiently.

Overview of Inward FDI to QCEW Record Linkage Process, Cont.
Indexing: Locality Sensitive Hashing of company names (using Cosine similarity and TF-IDF) for different partitions of the underlying data: Single establishment versus multi-unit establishment For single establishments, by state

Candidate pair comparisons: Max similarity for comparable features For supervised learning methods, features particular to BEA affiliates and QCEW EINs aggregations

Classification: Probabilistic: Fellegi-Sunter Supervised Learning: Logistic, SVM, Random Forests, Gradient Boosting Evaluation

Classification Method
Comparison of Results Measure Classification Method Fellegi-Sunter Logistic SVM Random Forest Gradient Boosting Accuracy 0.7564 0.8018 0.8641 0.9087 0.8752 Recall 0.8661 0.8463 0.8875 0.9437 0.8980 Precision 0.6080 0.7407 0.8360 0.8706 0.8483 F1 0.7145 0.7900 0.8610 0.9057 0.8724 ROC-AUC 0.7672 0.8794 0.9326 0.9638 0.9432

Comparison of Results, Cont.

Comparison of Results, Cont.
Measure Classification Method Fellegi-Sunter Logistic SVM Random Forest Gradient Boosting Processing time (in seconds) 1.52 41.84 1,360.13 310.82 16.32 Implementation time Low Medium High Number of hyper-parameters to tune 1 (1) 2-3 6-8 (2) 14 (5) 16 Sensitivity to hyper-parameters

Concluding Remarks Supervised learning techniques are more involved than probabilistic techniques, but have several advantages: Better performance The ability to account for features particular to each data source BEA affiliate characteristics EIN characteristics

Concluding Remarks, Cont.
However, neither technique collectively classify linkages. The technique introduced by Steorts (2015) may have the ability to do this.

References Christen, P. (2012). Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Springer. Steorts, R. C., Ventura, S., Sadinle, M., and Feinberg, S. (2014). “A Comparison of Blocking Methods for Record Linkage.” In Privacy in Statistical Databases, Springer. Steorts, R. C. (2015). “Entity Resolution with Empirically Motivated Priors.” Bayesian Analysis, 10, Number 4, pp

Lowell G. Mason Economist OEUS/ERPDS

A Comparison of Record Linkage Techniques

Similar presentations

Presentation on theme: "A Comparison of Record Linkage Techniques"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Comparison of Record Linkage Techniques

Similar presentations

Presentation on theme: "A Comparison of Record Linkage Techniques"— Presentation transcript:

Similar presentations

About project

Feedback