Advanced Fuzzy Matching

Slides:



Advertisements
Similar presentations
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
Advertisements

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.
Organisation Of Data (1) Database Theory
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Moving Data Lesson 23. Skills Matrix Moving Data When populating tables by inserting data, you will discover that data can come from various sources.
Web Server Hardware and Software
Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.
Software Documentation Written By: Ian Sommerville Presentation By: Stephen Lopez-Couto.
Building a Data Warehouse with SQL Server Presented by John Sterrett.
MSF Testing Introduction Functional Testing Performance Testing.
ETL The process of updating the data warehouse.. Recent Developments in Data Warehousing: A Tutorial Hugh J. Watson Terry College of Business University.
TESTING STRATEGY Requires a focus because there are many possible test areas and different types of testing available for each one of those areas. Because.
Overview of SQL Server Alka Arora.
Rationale Aspiring Database Developers should be able to efficiently query and maintain databases. This module will help students learn the Structured.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
IST722 Data Warehousing Business Intelligence Development with SQL Server Analysis Services and Excel 2013 Michael A. Fudge, Jr.
Virtual techdays INDIA │ November 2010 PowerPivot for Excel 2010 and SharePoint 2010 Joy Rathnayake │ MVP.
INTRODUCTION TO DATA QUALITY SERVICES Presentation by Tim Mitchell (Artis Consulting)
Audio Dial In: or CRM to RM Visual CRM to MS-CRM 2007 Visual User Group Nov 21 st 2007.
DTS Conversion to SSIS Conversion Best Practices Mike Davis
Just as there are many human languages, there are many computer programming languages that can be used to develop software. Some are named after people,
Relational Databases (MS Access)
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
1.NET Web Forms Business Forms © 2002 by Jerry Post.
The Oracle9i Multi-Terabyte Data Warehouse Jeff Parker Manager Data Warehouse Development Amazon.com Session id:
Advanced ETL: Embedding Integration Services Ashvini Sharma Development Lead DAT411 Microsoft Corporation Sergei Ivanov Technical Lead DAT411 Microsoft.
WHAT IS A DATABASE? A DATABASE IS A COLLECTION OF DATA RELATED TO A PARTICULAR TOPIC OR PURPOSE OR TO PUT IT SIMPLY A GENERAL PURPOSE CONTAINER FOR STORING.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
CIS/SUSL1 Fundamentals of DBMS S.V. Priyan Head/Department of Computing & Information Systems.
Regional Seminar on Promotion and Utilization of Census Results and on the Revision on the United Nations Principles and Recommendations for Population.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Data analytics and mash-up Real time analytics of employment data Team Shadowfax 1/25/2016 CMPE Class Project 0.
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
Best Practices in Loading Large Datasets Asanka Padmakumara (BSc,MCTS) SQL Server Sri Lanka User Group Meeting Oct 2013.
Intro to Power BI Azhagappan Arunachalam.  Senior Database Architect   PowerBICentral.com  (blog on getting started.
Introduction to Core Database Concepts Getting started with Databases and Structure Query Language (SQL)
William Assaf and David Nguyen SQL Server Reporting Services (SSRS) 101.
The power of Power Pivot Cristian Nicola DynamicsBIGuide.com.
1 Record Linkage & Fuzzy Matching (More on "Blocking" for Performance Improvement) Joseph Vertido Melissa Data Fuzzy.
Some of the utilities associated with the development of programs. These program development tools allow users to write and construct programs that the.
Advanced Higher Computing Science The Project. Introduction Worth 60% of the total marks for the course Must include: An appropriate interface using input.
Advanced Higher Computing Science
Data Virtualization Tutorial: Introduction to SQL Script
Advanced QlikView Performance Tuning Techniques
Fundamentals of Information Systems, Sixth Edition
Understand Bi Business Intelligence the use of Example.
Software Documentation
CLR MD A New Swiss Army Knife tool for Advanced Debugging
SQL Server Integration Services
Presented by: Warren Sifre
Database Performance Tuning and Query Optimization
Jared Kuehn – Skyline Technologies
Swagatika Sarangi (Jazz), MDM Expert
Linda Nguyen, John Swinehart, Yiwen (Cathy) Sun, Nargiza Nosirova
What's New in eCognition 9
Enterprise Program Management Office
Smart Integration Express
Chapter 11 Database Performance Tuning and Query Optimization
Data Warehousing Concepts
What's New in eCognition 9
Data Wrangling as the key to success with Data Lake
What's New in eCognition 9
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Database management systems
Visual Data Flows – Azure Data Factory v2
Visual Data Flows – Azure Data Factory v2
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Advanced Fuzzy Matching Ira Warren Whiteside Melissa Data BI Architect ira@melissadata.com 414.702.3024 Advanced Fuzzy Matching Record Linkage & Fuzzy Matching Part 2a (More on "Blocking" for Performance Improvement)

Advanced Fuzzy Matching Agenda Overview (Matching in terms of Data Quality) The Problem Walk thru methodology Real implementation example Live Demo in Microsoft SSIS Code and Samples Available

10 Billion Records to match The primary problem in string matching using Fuzzy algorithms 10,000,000,000 100,000 X 100,000 = 10,000,000,000 10 Billion Records to match

Record Linkage Approach

Recommended Academic Papers (See Melissa Data’s Data Quality Authority Blog) http://blog.melissadata.com/data-quality-authority/ Over at the LinkedIn Group run by Henrik Liliendahl Sorensen for Data Matching, Bill Winkler, principal researcher at the us census bureau has shared several reference papers on "blocking." They are excellent and I wanted to share them with you. Chaudhuri, S., Gamjam, K., Ganti, V., and Motwani, R. (2003), "Robust and Efficient Match for On-Line Data Cleaning," ACM SIGMOD '03, 313-324, http://datamining.anu.edu.au/publications/2003/kdd03-6pages.pdf Baxter, R., Christen, P. and Churches, T. (2003), "A Comparison of Fast Blocking Methods for Record Linkage," Proceedings of the ACM Workshop on Data Cleaning, Record Linkage and Object Identification, Washington, DC, August 2003. Winkler, W. E. (2004c), "Approximate String Comparator Search Strategies for Very Large Administrative Lists," Proceedings of the Section on Survey Research Methods, American Statistical Association, CD-ROM (also report 2005/06 at http://www.census.gov/srd/papers/pdf/rrs2005-02.pdf.

Cleansing and Standardization The steps are as follows: 1.      Cleansing and Standardization+ a.      Create common formats and patterns for data values b.      Preferable data driven rules that can be shared and reused 2.      Group records a.      Choose single or multiple values b.      Create a concatenated value free or spaces or special characters 3.      Split records a.      Create separate data streams to support parallel match processing 4.      Compare records and determine scores a.      Base on type of value name, product select appropriate algorithm b.      We will discuss various algorithms in future post 5.      Split into separate match categories a.      Match, no match and possible matches 6.      Analyze Results of Matches a.      Matches need to reviewed for accuracy, this can be done with tools or in some cases manually 7.      Evaluate using match tools to determine if best algorithms have been combined a.      Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively Cleansing and Standardization Create common formats and patterns for data values Preferable data driven rules that can be shared and reused Group records Choose single or multiple values Create a concatenated value free or spaces or special characters Split records Create separate data streams to support parallel match processing Compare records and determine scores Base on type of value name, product select appropriate algorithm We will discuss various algorithms in future post Split into separate match categories Match, no match and possible matches Analyze Results of Matches Matches need to reviewed for accuracy, this can be done with tools or in some cases manually Evaluate using match tools to determine if best algorithms have been combined Possible matches need to be evaluated and analyzed literately to determine if additional cleansing or different matching algorithms could be utilized more effectively

Consider this: You are watching a children's school concert and several dozen children are up on stage. Now, pick out the twins. You would probably start with looking for groups based on hair color, hair length, etc., long before you start comparing faces. This is, in essence, grouping or blocking. So, you line the blonds on the left and the brunettes on the right. You now have two blocks.

So, given that, we agree you need to leverage grouping or blocking So, given that, we agree you need to leverage grouping or blocking. The next step in identifying the twins is to repeat the process for the group you created, but with a new group, until you have found the twins. Compare all blonds, then brunettes, and so on. Then, move on to short hair, long hair, and so on. Finally, move on to similar face shapes (Ahhh, FUZZY). Hair is blond or brunette; long or short, but faces are a collection of features, and have a pattern forming an image. Our brains will instinctively look for faces that are similar, and then compare more closely. The obvious point here is to only begin comparing faces once we have narrowed down the group of children to a few.

A specific example in Microsoft SSIS

Pipeline architecture as defined by Microsoft "At the core of SSIS is the data transformation pipeline. This pipeline has a buffer-oriented architecture that is extremely fast at manipulating row sets of data once they have been loaded into memory. The approach is to perform all data transformation steps of the ETL process in a single operation without staging data, although specific transformation or operational requirements, or indeed hardware may be a hindrance. Nevertheless, for maximum performance, the architecture avoids staging. Even copying the data in memory is avoided as far as possible. This is in contrast to traditional ETL tools, which often require staging at almost every step of the warehousing and integration process. The ability to manipulate data without staging extends beyond traditional relational and flat file data and beyond traditional ETL transformation capabilities. With SSIS, all types of data (structured, unstructured, XML, etc.) are converted to a tabular (columns and rows) structure before being loaded into its buffers. Any data operation that can be applied to tabular data can be applied to the data at any step in the data-flow pipeline. This means that a single data-flow pipeline can integrate diverse sources of data and perform arbitrarily complex operations on these data without having to stage the data. It should also be noted though, that if staging is required for business or operational reasons, SSIS has good support for these implementations as well. This architecture allows SSIS to be used in a variety of data integration scenarios, ranging from traditional DW-oriented ETL to nontraditional information integration technologies."

Basic Fuzzy Matching in SSIS

Proven Blocking Indexes This is good news, for some additional back ground on the fast "blocking index strategies" for name and address, William(Bill) Winkler US Census and others have documented their research results. I posted a Melissa data blog earlier this year detailing the recommended "Blocking Indexes" with samples in SSIS. Record Linkage & Fuzzy Matching Part 2a (More on "Blocking" for Performance Improvement). They are as follows: 1,3,11,9 and 8 are the top 5, per Bill. 1. Zip, 1st char surname 2. 1st char surname, 1st char first name, date-of-birth 3. phone (10 digits) 4. 1st three char surname, 1st three char phone, house number 5. 1st three char first name, 1st three char ZIP, house number 6. 1st three char last name, 1st three char ZIP, 1st three char phone 7. 1st char last name = 1st char first name (2-way switch) 1st three char ZIP, 1st three char phone 8. 1st three char ZIP, day-of-birth, month-of-birth 9. ZIP, house number 10. 1st three char last name, 1st three char first name, month-of-birth 11. 1st three char last name, 1st three char first name _________________________________________

Basic Fuzzy Matching in SSIS with “Blocking Index”

Splitting or Grouping Records “blockingindex”

Microsoft Stock Fuzzy Grouping

Live demo and available code samples going deep into the “weeds”

Roll Your Own Fuzzy Match / Grouping – T-SQL Lots of discussion activity plus C# CLR Version, etc .. . .

Roll Your Own SSIS Fuzzy Match / Grouping Jaro Winkler

Melissa Data Matching Tools Discrete SSIS Matching Transforms JaroWinkler - Names n-Gram – Generic Strings n-Gram and JaroWinkler Comprehensive Matching Application(Available Standalone) MatchUP – Utilizes prebuilt Matchcodes and separate user interface for maintenance

Data Integration Data Quality MDM SQL Server 2005/2008 Data Quality Gartner Data Integration Data Quality MDM SQL Server 2005/2008 Data Quality MDM Data Integration 23

Total Data Quality in SSIS 4/28/2017 24 24

Thank You ira@melissadata.com iwhiteside@iwhiteside.com