Election Donor Records Linkage

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval
Advertisements

Collaborative Filtering in iCAMP Max Welling Professor of Computer Science & Statistics.
Using Structure Indices for Efficient Approximation of Network Properties Matthew J. Rattigan, Marc Maier, and David Jensen University of Massachusetts.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Virtual techdays INDIA │ 9-11 February 2011 SQL 2008 Query Tuning Praveen Srivatsa │ Principal SME – StudyDesk91 │ Director, AsthraSoft Consulting │ Microsoft.
Domain Name System ( DNS )  DNS is the system that provides name to address mapping for the internet.
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Oracle Data Mining Ying Zhang. Agenda Data Mining Data Mining Algorithms Oracle DM Demo.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Chapter 13 Genetic Algorithms. 2 Data Mining Techniques So Far… Chapter 5 – Statistics Chapter 6 – Decision Trees Chapter 7 – Neural Networks Chapter.
APPLICATIONS OF DATA MINING IN INFORMATION RETRIEVAL.
Data mining and machine learning A brief introduction.
Chapter 17 Domain Name System
Appendix: The WEKA Data Mining Software
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Database Management 9. course. Execution of queries.
Using SAS® Information Map Studio
Handling Large Datasets by Using Cross Tables “When Turning 11 million rows into 1 billion can be a good thing”
Some working definitions…. ‘Data Mining’ and ‘Knowledge Discovery in Databases’ (KDD) are used interchangeably Data mining = –the discovery of interesting,
SQL Server Indexes Indexes. Overview Indexes are used to help speed search results in a database. A careful use of indexes can greatly improve search.
Fast Packet Classification Using Bloom filters Authors: Sarang Dharmapurikar, Haoyu Song, Jonathan Turner, and John Lockwood Publisher: ANCS 2006 Present:
JSR 73: Data Mining API 資工三 B 林宗澤. Introduction In JDM, data mining [Mitchell1997, BL1997] includes the functional areas of classification, regression,
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Lecture 5 Cost Estimation and Data Access Methods.
Data Mining In contrast to the traditional (reactive) DSS tools, the data mining premise is proactive. Data mining tools automatically search the data.
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Windows 7 WampServer 2.1 MySQL PHP 5.3 Script Apache Server User Record or Select Media Upload to Internet Return URL Forward URL Create.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Classification using Decision Trees 1.Data Mining and Information 2.Data Mining and Machine Learning Techniques 3.Decision trees and C5 4.Applications.
Query Optimization Cases. D. ChristozovINF 280 DB Systems Query Optimization: Cases 2 Executable Block 1 Algorithm using Indices (if available) Temporary.
Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.
Christoph F. Eick: Final Words COSC Topics Covered in COSC 3480  Data models (ER, Relational, XML)  Using data models; learning how to store real.
Knowledge Discovery in a DBMS Data Mining Computing models and finding patterns in large databases current major challenge in database systems & large.
Domain Name System: DNS To identify an entity, TCP/IP protocols use the IP address, which uniquely identifies the Connection of a host to the Internet.
CPS216: Data-intensive Computing Systems
INLS 623– Database Systems II– File Structures, Indexing, and Hashing
Storage and Indexes Chapter 8 & 9
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Database Management  .
Database Performance Tuning and Query Optimization
Net 323 D: Networks Protocols
DATA ANALYTICS AND TEXT MINING
COSC 6340 Projects & Homeworks Spring 2002
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
Database Vs. Data Warehouse
כריית מידע -- מבוא ד"ר אבי רוזנפלד.
Physical Database Design
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Execution Plans Demystified
I don’t need a title slide for a lecture
Prepared by: Mahmoud Rafeek Al-Farra
Selected Topics: External Sorting, Join Algorithms, …
MANAGING DATA RESOURCES
COSC 4335: Other Classification Techniques
Objectives Data Mining Course
Chapter 11 Database Performance Tuning and Query Optimization
Chapter 3 Database Management
Kanchana Ihalagedara Rajitha Kithuldeniya Supun weerasekara
All about Indexes Gail Shaw.
Presentation transcript:

Election Donor Records Linkage Using R, SQL Server 2016 and C5 Jeff Winchell

Fundraisers need donor history Manual lookup is tedious Example: Federal Election Commission’s Contributor Search Form: FEC.Gov/finance/disclosure/norindsea.shtml

Record Linkage (Data Matching) Applications Master Data Management ETL Internet Search Historical Research /Family History Medical Research (Person-Oriented Health Statistics) Criminal Search

Accuracy Measures

Performance Considerations 30 second response time for a match request. Shorter is better. Databases with ~26 million records: Comparing every possible pair is 26M * 26M is over 300,000,000,000,000 comparisons of 500 byte records which is about 35 times the size of all webpages on the Internet.

Blocking and Query Optimization Generate likely record pair candidates Take advantage of decades of database performance optimization

SELECT Id,IsNickName FROM (SELECT Person_Name. Id, Person_Name SELECT Id,IsNickName FROM (SELECT Person_Name.Id, Person_Name.[First], Person_Name.[Last], IIF(Nick.[First] Is Not Null AND Person_Name.[First]<>@First,1,0) As IsNickName FROM dbo.Person_Name LEFT OUTER JOIN @Nick Nick ON Nick.[First]=Person_Name.[First] WHERE ((Person_Name.FirstInit=LEFT(@First,1) AND Person_Name.LenFirstName Between LEN(@First)-1 And Len(@First)+1) OR Nick.[First] Is NOT Null ) AND Person_Name.LastInit =LEFT(@Last,1) AND Person_Name.LenLastName Between LEN(@Last)-1 And Len(@Last)+1 ) Pot_Names1 WHERE dbo.Edit_Distance_Opt([Last],@Last,1)<=1 AND (dbo.Edit_Distance_Opt([First],@First,1)<=1 OR [First] In (SELECT [First] FROM @Nick))

Performance Optimizations In-Memory Tables Column-Store Indexes Indexed Views (Materialized Views) Indexes on user-defined functions Bit-Mapped Index Optimization B-Tree, Hash, Clustered, Unique, Filtered, Spatial, XML, Full-Text Decades of Query Optimizer Work 200+ Queries/sec on 10 TB Data Warehouse (TPC-H)

Supervised Machine Learning Labeled Data Challenge Getting representative labeled data So many possibilities Steering a model wrong (decision trees help address this)

Connecting the pairs

Connecting the pairs SQL not the best solution for this: sets vs hierarchies vs networks Connected Subgraphs R package (igraph) and its decompose method

C5 Decision Trees #1 in 2006 IEEE survey of top data mining algorithms ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners created list of 18 nominees IEEE and ACM award winners voted on the nominees Decision trees are intuitive to understand

1. C4.5 2. The k-Means algorithm 3. Support Vector Machines 4. The Apriori algorithm 5. Expectation-Maximization 6. PageRank 7. AdaBoost 8. k-Nearest Neighbor Classification 9. Naive Bayes 10. CART (Classification and Regression Trees)

80/20 Rule Practical Data Science Jeff Winchell’s addiction to forecasting led to his becoming a data scientist long before that term became fashionable. Besides his consulting he currently feeds his life-long learning habit by working on a master’s in the field from Harvard. He has employed a very wide range of data- related technologies. He has a wide range of interests outside of technology, but making his two adorable children smile is his favorite. Jeff_Winchell@g.Harvard.edu – (425) 628-0551