Jared Kuehn – Skyline Technologies When Low-Quality Data Strikes: Fuzzy Tools Provide clarity in Matching and deduplication Jared Kuehn – Skyline Technologies
About me Likes BLTs Male pattern baldness for a theater production Weird Al is my hero I like hats My daughter is adorable My dog is fuzzy
This tangent is too divergent Let’s get to our topic!
Today’s agenda What is Fuzzy logic? What are the typical matching approaches? Let’s see it in action! Demo, demo, demo!
What is Fuzzy logic? Stock photo I found online that clearly displays my point…kind of -Taking two pieces of information and identifying a match based on how similar they are.
Case study!!! Two datasets of people for your data warehouse Both contain names and demographic information One comes from your company’s main application Already in the warehouse. High-quality, managed well The other comes from a new application Data has been identified as low-quality Typos, blank fields, varied formatting A person can exist in both lists Goal is to merge the two lists into one master person dataset for your warehouse Minimize the number of duplicates without finding bad matches Here’s a second bullet point because I couldn’t think of a second point and I learned in high school that having only one sub bullet point is frowned upon
Approaches to matching Exact Match Fuzzy Match Manual Match Match Game
Exact Match Define columns that you want to compare Data in columns must match exactly to find matching records Strict rules result in more confidence in matches Can define multiple rules
Fuzzy Match Define which columns you want to compare Find matches based on similarity Faster to set up for complex, low-quality scenarios Better at handling low-quality data
Manual match Trust the human brain to find accurate matches Can account for any number of variances in data Most accurate form of matching
Still there? Good, cause it’s Demo time!!!!
Which Approach or Tool do I pick? How much time do you want to invest in finding accurate matches? What resources are available for you to use? Business users? Yet another second bullet point with no information. I really need to be better about this. Oh no, I did it again…
Fuzzy tool options I know of SQL Server Integration Services (SSIS) Versions 2005 and later Fuzzy Lookup and Fuzzy Grouping SQL Server Full Text Search Analyzes character patterns and linguistics Restricted to only text data Allows configuration for specific languages CLR functions Data Quality Services (DQS) and Master Data Services (MDS) DQS - Versions 2012 and later MDS – Versions 2008 R2 and later Engaging business users Business user friendly? Fuzzy Lookup for Excel Add-In (https://www.microsoft.com/en-us/download/details.aspx?id=15011)
Final thoughts Fuzzy logic is another tool that you can use. But it's still a tool Don't hammer a nail with a screwdriver Also, I need to improve my use of sub bullet points If you want to try it, plan some time to experiment with it Useful information to follow up on My email: jkuehn@skylinetechnologies.com Skyline blogs (https://www.skylinetechnologies.com/Blog) Fuzzy Lookup Excel Add-In (https://www.microsoft.com/en- us/download/details.aspx?id=15011) Check SQL Saturday website for script/SSIS packages
When your memory is fuzzy, stay fuzzy!