Presentation is loading. Please wait.

Presentation is loading. Please wait.

Election Donor Records Linkage

Similar presentations


Presentation on theme: "Election Donor Records Linkage"— Presentation transcript:

1 Election Donor Records Linkage
Using R, SQL Server 2016 and C5 Jeff Winchell

2 Fundraisers need donor history Manual lookup is tedious
Example: Federal Election Commission’s Contributor Search Form: FEC.Gov/finance/disclosure/norindsea.shtml

3 Record Linkage (Data Matching) Applications
Master Data Management ETL Internet Search Historical Research /Family History Medical Research (Person-Oriented Health Statistics) Criminal Search

4 Accuracy Measures

5 Performance Considerations
30 second response time for a match request. Shorter is better. Databases with ~26 million records: Comparing every possible pair is 26M * 26M is over 300,000,000,000,000 comparisons of 500 byte records which is about 35 times the size of all webpages on the Internet.

6 Blocking and Query Optimization
Generate likely record pair candidates Take advantage of decades of database performance optimization

7 SELECT Id,IsNickName FROM (SELECT Person_Name. Id, Person_Name
SELECT Id,IsNickName FROM (SELECT Person_Name.Id, Person_Name.[First], Person_Name.[Last], IIF(Nick.[First] Is Not Null AND As IsNickName FROM dbo.Person_Name LEFT OUTER JOIN @Nick Nick ON Nick.[First]=Person_Name.[First] WHERE AND Person_Name.LenFirstName Between And OR Nick.[First] Is NOT Null ) AND Person_Name.LastInit AND Person_Name.LenLastName Between And ) Pot_Names1 WHERE AND OR [First] In (SELECT [First]

8 Performance Optimizations
In-Memory Tables Column-Store Indexes Indexed Views (Materialized Views) Indexes on user-defined functions Bit-Mapped Index Optimization B-Tree, Hash, Clustered, Unique, Filtered, Spatial, XML, Full-Text Decades of Query Optimizer Work 200+ Queries/sec on 10 TB Data Warehouse (TPC-H)

9 Supervised Machine Learning Labeled Data Challenge
Getting representative labeled data So many possibilities Steering a model wrong (decision trees help address this)

10 Connecting the pairs

11 Connecting the pairs SQL not the best solution for this: sets vs hierarchies vs networks Connected Subgraphs R package (igraph) and its decompose method

12 C5 Decision Trees #1 in 2006 IEEE survey of top data mining algorithms
ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners created list of 18 nominees IEEE and ACM award winners voted on the nominees Decision trees are intuitive to understand

13 1. C4.5 2. The k-Means algorithm 3. Support Vector Machines 4. The Apriori algorithm 5. Expectation-Maximization 6. PageRank 7. AdaBoost 8. k-Nearest Neighbor Classification 9. Naive Bayes 10. CART (Classification and Regression Trees)

14 80/20 Rule Practical Data Science
Jeff Winchell’s addiction to forecasting led to his becoming a data scientist long before that term became fashionable. Besides his consulting he currently feeds his life-long learning habit by working on a master’s in the field from Harvard. He has employed a very wide range of data- related technologies. He has a wide range of interests outside of technology, but making his two adorable children smile is his favorite. – (425)


Download ppt "Election Donor Records Linkage"

Similar presentations


Ads by Google