1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data 800-800-6245 Fuzzy.

Slides:



Advertisements
Similar presentations
BY LECTURER/ AISHA DAWOOD DW Lab # 3 Overview of Extraction, Transformation, and Loading.
Advertisements

CSCI 6962: Server-side Design and Programming Input Validation and Error Handling.
C6 Databases.
Moving Data Lesson 23. Skills Matrix Moving Data When populating tables by inserting data, you will discover that data can come from various sources.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
The Relational Database Model. 2 Objectives How relational database model takes a logical view of data Understand how the relational model’s basic components.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Overview of Search Engines
Working with SQL and PL/SQL/ Session 1 / 1 of 27 SQL Server Architecture.
ETL Design and Development Michael A. Fudge, Jr.
ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University.
Data Warehouse Tools and Technologies - ETL
Agenda 02/21/2013 Discuss exercise Answer questions in task #1 Put up your sample databases for tasks #2 and #3 Define ETL in more depth by the activities.
Overview of SQL Server Alka Arora.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
M icrosoft Data Warehousing - SQL Server State of the Technology Presentation by Sujata Angara Nakul Johri Sang Ho Park.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
Data Quality: Treasure in/Treasure Out Victoria Essenmacher, SPEC Associates Melanie Hwalek, SPEC Associates Portions of this presentation were created.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
DTS Conversion to SSIS Conversion Best Practices Mike Davis
The Relational Database Model
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1 Data Warehouses BUAD/American University Data Warehouses.
Fundamentals of Information Systems, Seventh Edition 1 Chapter 3 Data Centers, and Business Intelligence.
The Oracle9i Multi-Terabyte Data Warehouse Jeff Parker Manager Data Warehouse Development Amazon.com Session id:
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Post enumeration survey in the 2009 Pilot Census of Population, Households and Dwellings in Serbia Olga Melovski Trpinac.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
A337 - Reed Smith1 Structure What is a database? –Table of information Rows are referred to as records Columns are referred to as fields Record identifier.
7 Strategies for Extracting, Transforming, and Loading.
Building Data Integration Solutions with Integration Services Donald Farmer Group Program Manager Microsoft Corporation.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Student Centered ODS ETL Processing. Insert Search for rows not previously in the database within a snapshot type for a specific subject and year Search.
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
Fundamentals of Information Systems, Sixth Edition Chapter 3 Database Systems, Data Centers, and Business Intelligence.
Best Practices in Loading Large Datasets Asanka Padmakumara (BSc,MCTS) SQL Server Sri Lanka User Group Meeting Oct 2013.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Intro to Power BI Azhagappan Arunachalam.  Senior Database Architect   PowerBICentral.com  (blog on getting started.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
Know your data source well. Who am I? Nik – Shahriar Nikkhah Microsoft MVP 2010 – SQL Server MCITP SQL 2008 MCTS SQL 2008 and s:
Explore engage elevate Data Migration Without Tears Mike Feingold Empoint Ltd Tuesday 10th November 2015.
Database Planning Database Design Normalization.
1 Management Information Systems M Agung Ali Fikri, SE. MM.
Foundations of information systems : BIS 1202 Lecture 4: Database Systems and Business Intelligence.
Advanced Fuzzy Matching
ETL Design - Stage Philip Noakes May 9, 2015.
Fundamentals of Information Systems, Sixth Edition
The Relational Database Model
Metrics Replication Presentation for Maryland Staff September 26, 2002
Overview of MDM Site Hub
Advanced QlikView Performance Tuning Techniques
Fundamentals of Information Systems, Sixth Edition
Modern Systems Analysis and Design Third Edition
SQL Server Integration Services
Data Warehouse.
Jared Kuehn – Skyline Technologies
ECONOMETRICS ii – spring 2018
Using the Set Operators
Teaching slides Chapter 8.
Creating Tables & Inserting Values Using SQL
Introduction to Teradata
The ultimate in data organization
Data Warehousing Concepts
Design for Flexibility and Performance - ETL Patterns with SSIS and Beyond And without further ado, here is Daniel with Using SSIS to Prepare Data for.
Visual Data Flows – Azure Data Factory v2
Visual Data Flows – Azure Data Factory v2
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy Matching

Advanced Fuzzy Matching Agenda Overview (Matching in terms of Data Quality) The Problem Walk thru methodology Real implementation example in Microsoft SSIS Code and Samples Available 2

Data Quality “Data integration and data quality are fundamental prerequisites for the successful implementation of enterprise applications, such as CRM and ERP.” Gartner

Data Quality 1) Inaccurate and Inconsistent Data 2) Missing Data 3) Duplicates 3 Common Issues with Data Quality

5 5 6/11/2016 Data Quality as defined by Gartner

Scenario Database Open to Duplicates Inconsistencies in Data Ideally, you want all records to be unique

Fuzzy Matching Why do we need Fuzzy? Incoming records may not be identical to existing records Detect existing data and eliminate duplicates Handle Keyboard typing errors Misspellings Similar Names

Scenario Source Do these records already exist? Compare FUZZY YES NO Unique Duplicate

The Problem Every Record will be compared to every other record

10 The primary problem in string matching using Fuzzy algorithms 100,000 X 100,000 = 10,000,000, Billion Records to match

11 Consider this: You are watching a children's school concert and several dozen children are up on stage. Now, pick out the twins. You would probably start with looking for groups based on hair color, hair length, etc., long before you start comparing faces. This is, in essence, grouping or blocking. So, you line the blonds on the left and the brunettes on the right. You now have two blocks.

Data Blue Brown Grouping (Blocking Index)

x 100 = 10, X x 4 = 2,500 Comparisons Grouping (Blocking Index)

14 This is good news, for some additional back ground on the fast "blocking index strategies" for name and address, William(Bill) Winkler US Census and others have documented their research results. I posted a Melissa data blog earlier this year detailing the recommended "Blocking Indexes" with samples in SSIS. Record Linkage & Fuzzy Matching Part 2a (More on "Blocking" for Performance Improvement).Record Linkage & Fuzzy Matching Part 2a They are as follows: 1,3,11,9 and 8 are the top 5, per Bill. 1. Zip, 1st char surname 2. 1st char surname, 1st char first name, date-of-birth 3. phone (10 digits) 4. 1st three char surname, 1st three char phone, house number 5. 1st three char first name, 1st three char ZIP, house number 6. 1st three char last name, 1st three char ZIP, 1st three char phone 7. 1st char last name = 1st char first name (2-way switch) 1st three char ZIP, 1st three char phone 8. 1st three char ZIP, day-of-birth, month-of-birth 9. ZIP, house number 10. 1st three char last name, 1st three char first name, month-of-birth 11. 1st three char last name, 1st three char first name Proven Blocking Index Strategies

15 Record Linkage Approach

Cleansing and Standardization 16 Compare Source Special Characters Syntax Formatting Standardization

Fuzzy Matching Algorithms There is no single correct algorithm that accommodates to all data types and situations! Data Types Addresses Numbers Company Names People Names Addresses Dates Situations Call Center Phonetic Input Miss-Typed Form Inputs Nick Names Abbreviations

Fuzzy Matching Algorithms 18 Some algorithms makes more sense to use for certain situations and certain data types.

19 1.Cleansing and Standardization 1.Normalize data using rules, patterns and reference data 2.Group records 1.Divide the data into logical groupings (Blocking Index) 3.Split records 1.Create separate data streams to support parallel match processing 4.Compare records and determine scores 1.Fuzzy Matching will give you a match score of how close two compared records are 5.Split into separate match categories 1.Match, no match and possible matches 6.Analyze Results of Matches 1.Possible matches need to be reviewed for accuracy, this can be done with tools or in some cases manually 7.Evaluate using match tools to determine if best algorithms have been combined 1.Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively

20 A specific example in Microsoft SSIS of using Blocking Index

21 "At the core of SSIS is the data transformation pipeline. This pipeline has a buffer-oriented architecture that is extremely fast at manipulating row sets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data, although specific transformation or operational requirements, or indeed hardware may be a hindrance. Nevertheless, for maximum performance, the architecture avoids staging. Even copying the data in memory is avoided as far as possible. This is in contrast to traditional ETL tools, which often reque staging at almost every step of the warehousing and integration process. The ability to manipirulate data without staging extends beyond traditional relational and flat file data and beyond traditional ETL transformation capabilities. With SSIS, all types of data (structured, unstructured, XML, etc.) are converted to a tabular (columns and rows) structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at any step in the data-flow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage the data. It should also be noted though, that if staging is required for business or operational reasons, SSIS has good support for these implementations as well. This architecture allows SSIS to be used in a variety of data integration scenarios, ranging from traditional DW-oriented ETL to nontraditional information integration technologies." Pipeline architecture as defined by Microsoft

22 Basic Fuzzy Matching in SSIS

23 Microsoft Stock Fuzzy Grouping

24 Basic Fuzzy Matching in SSIS with “Blocking Index”

25 Splitting or Grouping Records “blockingindex”

26 Summary 1.Fuzzy Matching Important factor in Data Quality 2.Blocking Index and Parallel Processing Further optimizing performance 3.Cleansing Crucial step prior to Fuzzy Matching 4.Algorithms Different Situations and Data Types require different algorithms 5.Fuzzy Matching in SSIS High Level Implementation

27 Available code samples going deep into the “weeds”

Some Available Components for SSIS Microsoft Fuzzy Lookup * Microsoft Fuzzy Grouping * Melissa Data Fuzzy Matching Component (Free Community Edition) * Available only for SQL Server Developer or Enterprise editions. 28

29 Roll Your Own Fuzzy Match / Grouping – T-SQL Lots of discussion activity plus C# CLR Version, etc....

30 Roll Your Own SSIS Fuzzy Match / Grouping Jaro Winkler

31

32 Thank You