Election Donor Records Linkage Using R, SQL Server 2016 and C5 Jeff Winchell
Fundraisers need donor history Manual lookup is tedious Example: Federal Election Commission’s Contributor Search Form: FEC.Gov/finance/disclosure/norindsea.shtml
Record Linkage (Data Matching) Applications Master Data Management ETL Internet Search Historical Research /Family History Medical Research (Person-Oriented Health Statistics) Criminal Search
Accuracy Measures
Performance Considerations 30 second response time for a match request. Shorter is better. Databases with ~26 million records: Comparing every possible pair is 26M * 26M is over 300,000,000,000,000 comparisons of 500 byte records which is about 35 times the size of all webpages on the Internet.
Blocking and Query Optimization Generate likely record pair candidates Take advantage of decades of database performance optimization
SELECT Id,IsNickName FROM (SELECT Person_Name. Id, Person_Name SELECT Id,IsNickName FROM (SELECT Person_Name.Id, Person_Name.[First], Person_Name.[Last], IIF(Nick.[First] Is Not Null AND Person_Name.[First]<>@First,1,0) As IsNickName FROM dbo.Person_Name LEFT OUTER JOIN @Nick Nick ON Nick.[First]=Person_Name.[First] WHERE ((Person_Name.FirstInit=LEFT(@First,1) AND Person_Name.LenFirstName Between LEN(@First)-1 And Len(@First)+1) OR Nick.[First] Is NOT Null ) AND Person_Name.LastInit =LEFT(@Last,1) AND Person_Name.LenLastName Between LEN(@Last)-1 And Len(@Last)+1 ) Pot_Names1 WHERE dbo.Edit_Distance_Opt([Last],@Last,1)<=1 AND (dbo.Edit_Distance_Opt([First],@First,1)<=1 OR [First] In (SELECT [First] FROM @Nick))
Performance Optimizations In-Memory Tables Column-Store Indexes Indexed Views (Materialized Views) Indexes on user-defined functions Bit-Mapped Index Optimization B-Tree, Hash, Clustered, Unique, Filtered, Spatial, XML, Full-Text Decades of Query Optimizer Work 200+ Queries/sec on 10 TB Data Warehouse (TPC-H)
Supervised Machine Learning Labeled Data Challenge Getting representative labeled data So many possibilities Steering a model wrong (decision trees help address this)
Connecting the pairs
Connecting the pairs SQL not the best solution for this: sets vs hierarchies vs networks Connected Subgraphs R package (igraph) and its decompose method
C5 Decision Trees #1 in 2006 IEEE survey of top data mining algorithms ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners created list of 18 nominees IEEE and ACM award winners voted on the nominees Decision trees are intuitive to understand
1. C4.5 2. The k-Means algorithm 3. Support Vector Machines 4. The Apriori algorithm 5. Expectation-Maximization 6. PageRank 7. AdaBoost 8. k-Nearest Neighbor Classification 9. Naive Bayes 10. CART (Classification and Regression Trees)
80/20 Rule Practical Data Science Jeff Winchell’s addiction to forecasting led to his becoming a data scientist long before that term became fashionable. Besides his consulting he currently feeds his life-long learning habit by working on a master’s in the field from Harvard. He has employed a very wide range of data- related technologies. He has a wide range of interests outside of technology, but making his two adorable children smile is his favorite. Jeff_Winchell@g.Harvard.edu – (425) 628-0551