1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy Matching
Advanced Fuzzy Matching Agenda Overview (Matching in terms of Data Quality) The Problem Walk thru methodology Real implementation example in Microsoft SSIS Code and Samples Available 2
Data Quality “Data integration and data quality are fundamental prerequisites for the successful implementation of enterprise applications, such as CRM and ERP.” Gartner
Data Quality 1) Inaccurate and Inconsistent Data 2) Missing Data 3) Duplicates 3 Common Issues with Data Quality
5 5 6/11/2016 Data Quality as defined by Gartner
Scenario Database Open to Duplicates Inconsistencies in Data Ideally, you want all records to be unique
Fuzzy Matching Why do we need Fuzzy? Incoming records may not be identical to existing records Detect existing data and eliminate duplicates Handle Keyboard typing errors Misspellings Similar Names
Scenario Source Do these records already exist? Compare FUZZY YES NO Unique Duplicate
The Problem Every Record will be compared to every other record
10 The primary problem in string matching using Fuzzy algorithms 100,000 X 100,000 = 10,000,000, Billion Records to match
11 Consider this: You are watching a children's school concert and several dozen children are up on stage. Now, pick out the twins. You would probably start with looking for groups based on hair color, hair length, etc., long before you start comparing faces. This is, in essence, grouping or blocking. So, you line the blonds on the left and the brunettes on the right. You now have two blocks.
Data Blue Brown Grouping (Blocking Index)
x 100 = 10, X x 4 = 2,500 Comparisons Grouping (Blocking Index)
14 This is good news, for some additional back ground on the fast "blocking index strategies" for name and address, William(Bill) Winkler US Census and others have documented their research results. I posted a Melissa data blog earlier this year detailing the recommended "Blocking Indexes" with samples in SSIS. Record Linkage & Fuzzy Matching Part 2a (More on "Blocking" for Performance Improvement).Record Linkage & Fuzzy Matching Part 2a They are as follows: 1,3,11,9 and 8 are the top 5, per Bill. 1. Zip, 1st char surname 2. 1st char surname, 1st char first name, date-of-birth 3. phone (10 digits) 4. 1st three char surname, 1st three char phone, house number 5. 1st three char first name, 1st three char ZIP, house number 6. 1st three char last name, 1st three char ZIP, 1st three char phone 7. 1st char last name = 1st char first name (2-way switch) 1st three char ZIP, 1st three char phone 8. 1st three char ZIP, day-of-birth, month-of-birth 9. ZIP, house number 10. 1st three char last name, 1st three char first name, month-of-birth 11. 1st three char last name, 1st three char first name Proven Blocking Index Strategies
15 Record Linkage Approach
Cleansing and Standardization 16 Compare Source Special Characters Syntax Formatting Standardization
Fuzzy Matching Algorithms There is no single correct algorithm that accommodates to all data types and situations! Data Types Addresses Numbers Company Names People Names Addresses Dates Situations Call Center Phonetic Input Miss-Typed Form Inputs Nick Names Abbreviations
Fuzzy Matching Algorithms 18 Some algorithms makes more sense to use for certain situations and certain data types.
19 1.Cleansing and Standardization 1.Normalize data using rules, patterns and reference data 2.Group records 1.Divide the data into logical groupings (Blocking Index) 3.Split records 1.Create separate data streams to support parallel match processing 4.Compare records and determine scores 1.Fuzzy Matching will give you a match score of how close two compared records are 5.Split into separate match categories 1.Match, no match and possible matches 6.Analyze Results of Matches 1.Possible matches need to be reviewed for accuracy, this can be done with tools or in some cases manually 7.Evaluate using match tools to determine if best algorithms have been combined 1.Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively
20 A specific example in Microsoft SSIS of using Blocking Index
21 "At the core of SSIS is the data transformation pipeline. This pipeline has a buffer-oriented architecture that is extremely fast at manipulating row sets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data, although specific transformation or operational requirements, or indeed hardware may be a hindrance. Nevertheless, for maximum performance, the architecture avoids staging. Even copying the data in memory is avoided as far as possible. This is in contrast to traditional ETL tools, which often reque staging at almost every step of the warehousing and integration process. The ability to manipirulate data without staging extends beyond traditional relational and flat file data and beyond traditional ETL transformation capabilities. With SSIS, all types of data (structured, unstructured, XML, etc.) are converted to a tabular (columns and rows) structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at any step in the data-flow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage the data. It should also be noted though, that if staging is required for business or operational reasons, SSIS has good support for these implementations as well. This architecture allows SSIS to be used in a variety of data integration scenarios, ranging from traditional DW-oriented ETL to nontraditional information integration technologies." Pipeline architecture as defined by Microsoft
22 Basic Fuzzy Matching in SSIS
23 Microsoft Stock Fuzzy Grouping
24 Basic Fuzzy Matching in SSIS with “Blocking Index”
25 Splitting or Grouping Records “blockingindex”
26 Summary 1.Fuzzy Matching Important factor in Data Quality 2.Blocking Index and Parallel Processing Further optimizing performance 3.Cleansing Crucial step prior to Fuzzy Matching 4.Algorithms Different Situations and Data Types require different algorithms 5.Fuzzy Matching in SSIS High Level Implementation
27 Available code samples going deep into the “weeds”
Some Available Components for SSIS Microsoft Fuzzy Lookup * Microsoft Fuzzy Grouping * Melissa Data Fuzzy Matching Component (Free Community Edition) * Available only for SQL Server Developer or Enterprise editions. 28
29 Roll Your Own Fuzzy Match / Grouping – T-SQL Lots of discussion activity plus C# CLR Version, etc....
30 Roll Your Own SSIS Fuzzy Match / Grouping Jaro Winkler
31
32 Thank You