Using Reported Data as Matching Variables in Record Linkage By Bill Iwig, Kara Daniel, Tom Pordugal, and Stan Hoge National Agricultural Statistics Service
NASS Use of Record Linkage Match new list sources to the Farm Register Identify duplication within the Farm Register Match Area Frame records to the Farm Register for measuring coverage National Agricultural Statistics Service
Record Linkage Procedures Matching variables are divided into components Matching components are assigned agreement and disagreement weights Records are only compared within blocks Sum of agreement and disagreement weights compared to thresholds National Agricultural Statistics Service
Record Linkage System Enhancement Use data items as matching variables Provided through SuperMatch software feature Parameters allow “close” values to match and be assigned a reduced agreement weight National Agricultural Statistics Service
Identifying Duplication on 2002 Census of Agriculture Data File 2.85 million records on the Census Mail List Positive data for 1.1 million at the time of record linkage Numerous steps to eliminate duplication prior to data capture Duplication still exists! National Agricultural Statistics Service
National Agricultural Statistics Service Using Census Reported Data as Matching Variables to Identify Duplication 40 data items used “0” values not considered for matching Fewer than 10 positive values for most records National Agricultural Statistics Service
Initial Record Linkage Parameters Agreement weight = 1 Disagreement weight = 0 “Non-tolerable” percentage difference =11 Sum of weights threshold = 5 National Agricultural Statistics Service
Pro-rated Agreement Weight Examples A = 100, B = 95, Wt = .52 A = 20, B = 19, Wt = .52 A = 20, B = 18, Wt = 0 National Agricultural Statistics Service
National Agricultural Statistics Service Results Approximately 1500 potential duplicates identified Actual number of duplicates less than 500 National Agricultural Statistics Service
Recommendations for Effective Application of Data Matching Feature Evaluate distribution of response differences for true duplicates Evaluate handling of “0” values Highly correlated variables Edited and imputed variables Threshold value for matching National Agricultural Statistics Service