Download presentation
Presentation is loading. Please wait.
1
Advanced Fuzzy Matching
Ira Warren Whiteside Melissa Data BI Architect Advanced Fuzzy Matching Record Linkage & Fuzzy Matching Part 2a (More on "Blocking" for Performance Improvement)
2
Advanced Fuzzy Matching Agenda
Overview (Matching in terms of Data Quality) The Problem Walk thru methodology Real implementation example Live Demo in Microsoft SSIS Code and Samples Available
3
10 Billion Records to match
The primary problem in string matching using Fuzzy algorithms 10,000,000,000 100,000 X 100,000 = 10,000,000,000 10 Billion Records to match
4
Record Linkage Approach
5
Recommended Academic Papers
(See Melissa Data’s Data Quality Authority Blog) Over at the LinkedIn Group run by Henrik Liliendahl Sorensen for Data Matching, Bill Winkler, principal researcher at the us census bureau has shared several reference papers on "blocking." They are excellent and I wanted to share them with you. Chaudhuri, S., Gamjam, K., Ganti, V., and Motwani, R. (2003), "Robust and Efficient Match for On-Line Data Cleaning," ACM SIGMOD '03, , Baxter, R., Christen, P. and Churches, T. (2003), "A Comparison of Fast Blocking Methods for Record Linkage," Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC, August Winkler, W. E. (2004c), "Approximate String Comparator Search Strategies for Very Large Administrative Lists," Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also report 2005/06 at
6
Cleansing and Standardization
The steps are as follows: 1. Cleansing and Standardization+ a. Create common formats and patterns for data values b. Preferable data driven rules that can be shared and reused 2. Group records a. Choose single or multiple values b. Create a concatenated value free or spaces or special characters 3. Split records a. Create separate data streams to support parallel match processing 4. Compare records and determine scores a. Base on type of value name, product select appropriate algorithm b. We will discuss various algorithms in future post 5. Split into separate match categories a. Match, no match and possible matches 6. Analyze Results of Matches a. Matches need to reviewed for accuracy, this can be done with tools or in some cases manually 7. Evaluate using match tools to determine if best algorithms have been combined a. Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively Cleansing and Standardization Create common formats and patterns for data values Preferable data driven rules that can be shared and reused Group records Choose single or multiple values Create a concatenated value free or spaces or special characters Split records Create separate data streams to support parallel match processing Compare records and determine scores Base on type of value name, product select appropriate algorithm We will discuss various algorithms in future post Split into separate match categories Match, no match and possible matches Analyze Results of Matches Matches need to reviewed for accuracy, this can be done with tools or in some cases manually Evaluate using match tools to determine if best algorithms have been combined Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively
7
Consider this: You are watching a children's school concert and several dozen children are up on stage. Now, pick out the twins. You would probably start with looking for groups based on hair color, hair length, etc., long before you start comparing faces. This is, in essence, grouping or blocking. So, you line the blonds on the left and the brunettes on the right. You now have two blocks.
8
So, given that, we agree you need to leverage grouping or blocking
So, given that, we agree you need to leverage grouping or blocking. The next step in identifying the twins is to repeat the process for the group you created, but with a new group, until you have found the twins. Compare all blonds, then brunettes, and so on. Then, move on to short hair, long hair, and so on. Finally, move on to similar face shapes (Ahhh, FUZZY). Hair is blond or brunette; long or short, but faces are a collection of features, and have a pattern forming an image. Our brains will instinctively look for faces that are similar, and then compare more closely. The obvious point here is to only begin comparing faces once we have narrowed down the group of children to a few.
9
A specific example in Microsoft SSIS
10
Pipeline architecture as defined by Microsoft
"At the core of SSIS is the data transformation pipeline. This pipeline has a buffer-oriented architecture that is extremely fast at manipulating row sets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data, although specific transformation or operational requirements, or indeed hardware may be a hindrance. Nevertheless, for maximum performance, the architecture avoids staging. Even copying the data in memory is avoided as far as possible. This is in contrast to traditional ETL tools, which often require staging at almost every step of the warehousing and integration process. The ability to manipulate data without staging extends beyond traditional relational and flat file data and beyond traditional ETL transformation capabilities. With SSIS, all types of data (structured, unstructured, XML, etc.) are converted to a tabular (columns and rows) structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at any step in the data-flow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage the data. It should also be noted though, that if staging is required for business or operational reasons, SSIS has good support for these implementations as well. This architecture allows SSIS to be used in a variety of data integration scenarios, ranging from traditional DW-oriented ETL to nontraditional information integration technologies."
11
Basic Fuzzy Matching in SSIS
12
Proven Blocking Indexes
This is good news, for some additional back ground on the fast "blocking index strategies" for name and address, William(Bill) Winkler US Census and others have documented their research results. I posted a Melissa data blog earlier this year detailing the recommended "Blocking Indexes" with samples in SSIS. Record Linkage & Fuzzy Matching Part 2a (More on "Blocking" for Performance Improvement). They are as follows: 1,3,11,9 and 8 are the top 5, per Bill. 1. Zip, 1st char surname 2. 1st char surname, 1st char first name, date-of-birth 3. phone (10 digits) 4. 1st three char surname, 1st three char phone, house number 5. 1st three char first name, 1st three char ZIP, house number 6. 1st three char last name, 1st three char ZIP, 1st three char phone 7. 1st char last name = 1st char first name (2-way switch) 1st three char ZIP, 1st three char phone 8. 1st three char ZIP, day-of-birth, month-of-birth 9. ZIP, house number 10. 1st three char last name, 1st three char first name, month-of-birth 11. 1st three char last name, 1st three char first name _________________________________________
13
Basic Fuzzy Matching in SSIS with “Blocking Index”
14
Splitting or Grouping Records
“blockingindex”
15
Microsoft Stock Fuzzy Grouping
16
Live demo and available code samples
going deep into the “weeds”
17
Roll Your Own Fuzzy Match / Grouping – T-SQL
Lots of discussion activity plus C# CLR Version, etc
18
Roll Your Own SSIS Fuzzy Match / Grouping Jaro Winkler
22
Melissa Data Matching Tools
Discrete SSIS Matching Transforms JaroWinkler - Names n-Gram – Generic Strings n-Gram and JaroWinkler Comprehensive Matching Application(Available Standalone) MatchUP – Utilizes prebuilt Matchcodes and separate user interface for maintenance
23
Data Integration Data Quality MDM SQL Server 2005/2008 Data Quality
Gartner Data Integration Data Quality MDM SQL Server 2005/2008 Data Quality MDM Data Integration 23
24
Total Data Quality in SSIS
4/28/2017 24 24
25
Thank You
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.