Data Student to Data Master

Data Student to Data Master
Tom Lovell & James Cotton Information Builders

Types of data in the organization
Unstructured Found in , white papers, magazine articles, corporate intranet portals, product specifications, marketing collateral, and PDF files Transactional Related to sales, deliveries, invoices, trouble tickets, claims, and other monetary and non-monetary interactions Metadata Data about other data and includes: report definitions, column descriptions in a database, log files, connections, and configuration files Hierarchical Stores the relationships between other data such as company organisational structures or product lines. Master Critical nouns of a business and fall generally into the groupings: people, places and things,

Understanding Master Data
Think of nouns and verbs Bob Smith buys a widget (SKU #A1234) and ships it to his home address The master data elements are the nouns and are people, things, and places The transactional data elements are verbs that describe what happens to those people, places, and things. Bob Smith widget (SKU #A1234) home address This may not seem like a big deal. After all, don’t all transactions contain this type of information? Why all this fuss about calling the customer name a “master data element”? What’s different is that these master data elements appear over and over again in many different information systems in the corporation. The customer name in the transaction record should also be the customer name in the marketing department’s mailing list. If the customer name is different in each database, there can be errors, waste, and other business problems. MDM is about making sure that these master data elements are the same across all systems that need them. Businesses need to manage master data to keep it consistent and clean across multiple databases and systems. CRM Marketing ERP WMS Financial

Creating a “Golden Record”
Name: Bob Smith Tel: DOB: 23/10/71 Gender: M Name: Bob Smith Tel: DOB: Gender: M Name: B Smith Tel: DOB: 23/10/71 Gender: M Name: Bob Smith Tel: DOB: 23/10/71 Gender: Name: Bob Smith Tel: DOB: Gender: Male Name: B Smith Tel: (0) DOB: 23-Oct-71 Gender: M Name: Smith, Bob Tel: (01283)56982 DOB: 23/10/1971 Gender: CRM Marketing ERP WMS Financial

Conceptual MDM Architecture
Discuss that the area highlighted will be subject of what we will be discussing today.

MDM Processes Source data layer Cleansed data layer
Data as it appears in source systems Canonical format (“common” format for all systems) No other transformations Cleansed data layer Cleansed and standardized data Data quality metadata Uses “common language” Matching data layer Identifies which instance records correspond to a single “thing” Based on cleansed data and the associated DQ metadata Master data layer Consolidated data from all records in the matching group

First the Cleansing Parsing
Data parsed into components (pattern based) E.G. Jim Smith -> Jim + Smith Validation of Data Quality Validation against business rules Validation against reference tables Enrichment Adding data Standardization Transformation into standard format (Jim Smith -> James Smith) Standard and nonstandard abbreviations (Str. -> Street) Language-specific replacements Domain oriented algorithms - examples: Name Address Credit Card number Bank account number Extension by custom validation steps Using complex function and rules including Levensthein distance SoundEx Industry standard functions Cleansing Parsing Validation Enrichment Explain why cleansing is important Explain that cleasing logic is reusable and we leverage the existing investment in data quality Standardisation

Then Scoring Scoring is heavily used during this process
Scores are just numbers but it helps keep track of which changes were made to what piece of data by which process along the way. We can score an entire record or parts of the records separately. By using this very granular approach we can use this information in a later stage to make educated choices regarding what parts of records to combine into our golden record. Cleansing Parsing Validation Enrichment Standardisation

Original data – before cleansing
Source data Name G SIN Birth Date Address Dr. John Smith F 12/16/1978 Ave Surrey V3R 2A9 Smith W. John M Surrey Ave John William Smith SIN 781612 25 Linden Str Toronto M4X 1V5 Dr. J.W. Smith 11/16/78 John Smith 8500 Leslie L3T 7M8 Toronto Smith John 8500 Leslie street Marham John Smiht Jane Watson 1982 Leslie str. Toronto L3T 7M8 Watson Jane 8500 Leslei street Toronto L3T 7M8 Jane Smith SIN J. Smith

Cleaning the data We find the following…

Before After Cleaning the data Name Data Name First M Last John Smith
Dr. John Smith After First M Last John Smith

Before After Cleaning the data Gender Data First M Last G John Smith F

Before After Cleaning the data Social Insurance Number Data SIN

Before After Cleaning the data Birth Date Data Birth Date Birth Date
12/16/1978 After Birth Date

Goal: Then Matching Identify groups of records that in reality
represent a single client or entity. Match & Merge This may not be so simple : Data comes from different sources Must handle data that is missing, wrong or conflicting (That’s why we cleanse) There’s no single ‘correct’ solution

Prepared data (after cleansing)
Cleansed data First Last G SIN Birth Date Address John Smith M V3R 2A9;BC;Surrey; Avenue M4X 1V5;ON;Toronto;25 Linden Street L3T 7M8;ON;Markham;8500 Leslie Str. Smiht Jane Watson F J.

Match Cleansed data First Last G SIN Birth Date Address John Smith M
V3R 2A9;BC;Surrey; Avenue M4X 1V5;ON;Toronto;25 Linden Street L3T 7M8;ON;Markham;8500 Leslie Str. Smiht Jane Watson F J. 25

Step 1 Candidate grouping
Candidate groups define the records that certainly don’t match. They are used to divide the data into subsets. If this is not done too many records need to be compared together. Records in separate Candidate groups will never be matched. Additional Candidate group fields are required for faster processing of large datasets.

Candidate grouping Uses a single key
Simple Key strategy Uses a single key Most effective when a strong primary key is present Easy to configure high speed strategy The Simple Key strategy is used in cases where we have one key strong enough to specify a group. Unification with this strategy is thus very strict. Note that single key does not necessary mean single column (we allow composite keys)! src_name+src_surname is as good key as for example src_social_security_nbr

Candidate grouping Simple Key = Social Insurance Number (SIN)
James Brown 921213/1943 John Smith 821213/0943 (null) (null) 921213/1943 John Smith (null) (null) Smith 821213/0943 Cindy Becker (null) (null) Smith (null) (null) (null) 821213/0943

Candidate grouping Simple Key = Firstname + Lastname James Brown
921213/1943 John Smith 821213/0943 (null) (null) 921213/1943 John Smith (null) (null) Smith 821213/0943 Note the NULL group being added together; this could be dangerous as no records match Cindy Becker (null) (null) Smith (null) (null) (null) 821213/0943

Candidate grouping Extends the simple key strategy
Hierarchical strategy Extends the simple key strategy Uses a secondary key to give records a ‘second chance’ to join an existing group Most effective when the data has holes Medium speed strategy Two keys are defined (both can be composite) called primary and secondary. Records belong to one group when: They have same primary key They have no primary key, same secondary key as another record in the group and there is no record with that secondary key in any other group. When there are more „primary-formed“ groups with some secondary key all records with this secondary key (and none primary) form their own group. Wider groups then Simple key (records have „second chance“ to join the group). Medium speed strategy. Hard to configure. Sufficient in almost any case.

Candidate grouping Hierarchical – extends Simple Key
Distribute records having no key according to „secondary“ key SSN First Name Last Name Candidate ID 12345 John Smith 1 Smiht 09876 Jane 2 Jay Mith Jack 3

Candidate grouping Hierarchical – SIN & Name Records Primary groups
John Smith null Jane Smith Jane Watson J Smith Janette Smith Records Primary groups Secondary groups Candidate groups Copyright 2007, Information Builders. Slide 32 Copyright 2007, Information Builders. Slide 32

Candidate grouping Union strategy Multi key strategy where for every record at least one additional record exists that matches on the same key Creates the largest candidate groups Most effective when no primary key can be defined Low speed strategy if badly configured

Candidate grouping Union – Sin – AddressID -- Lastname
These two are probably husband and wife. Fox 821213/0943 Hunter 431109/0099 (null) 821213/0943 (null) (null) Douglas 765213/1123 These two probably represent the same person (same SIN). Union strategy create the widest candidate groups. Its defined by set of keys. The candidate group consists of records where for every record exists at least one another with at least one same key. This relation is transitive (therefore such large groups). For math people, the Union strategy returns equivalence classes where equivalence relation is defined as: “Two records are equivalent if they match in at least one key” Null values are not treated as ordinary value, but instead as “no-key” (so two nulls does not necessary belong to same group). If badly configured can be VERY slow. Walker 450102/4449 Smith 765213/1123 This is probably a married woman who changed her name (and she moved to live with her husband).

Three people living in the same house.
Candidate grouping Union – Sin – AddressID -- Lastname Three people living in the same house. Fox 821213/0943 Hunter 431109/0099 (null) 821213/0943 (null) (null) Douglas 765213/1123 Walker 450102/4449 Smith 765213/1123 This is probably a married woman who changed her name (and she moved to live with her husband).

Three people living in the same house.
Candidate grouping Union – Sin – AddressID -- Lastname Three people living in the same house. Fox 821213/0943 Hunter 431109/0099 (null) 821213/0943 (null) (null) Douglas 765213/1123 Walker 450102/4449 Smith 765213/1123

These two records have the same AddressId. Fox 821213/0943 Hunter 431109/0099 (null) 821213/0943 (null) (null) Douglas 765213/1123 These two records have the same SIN. Walker 450102/4449 Smith 765213/1123

Fox 821213/0943 Hunter 431109/0099 (null) 821213/0943 (null) (null) Douglas 765213/1123 And even contains records that does not match ANY key. This is because of the transitivity. Those two records have same name. Those three records have same AddressId Douglas 450102/4449 Smith 765213/1123

Candidate grouping Primary: Simple Key Secondary: Union
Hierachical Union strategy Primary: Simple Key Secondary: Union

Candidate grouping Primary: Simple Key Secondary: Union
Hierachical Union strategy Primary: Simple Key Secondary: Union Don’t worry; I can’t draw graphics this complicated

Step 2 Client grouping Client groups are defined by selecting individual records in a candidate group. These records are sorted and defined as ‘center’ records. For each record in a candidate group the ‘distance’ to the ‘center’ is measured. This distance determines whether a record is actually a match or duplicate.

Client grouping We have (with some grouping strategy) created candidate groups, but remember that they are just a step towards our final goal – Client groups. Single Candidate group can contain from one to size of that group Client groups. First we select Central master record in each candidate group Then take next record Measure distance between this and central master If OK, then add the record to the client group If not make the record a new Iterative master All other records will be compared with Central master All iterative masters

Client grouping Measuring ‘distance’ Records Candidate groups
4 3 Center (M) 1 Client groups 2 5

Finally Merging Creating the Golden Record
Can cherry pick the best fields or even the best record Using rules to determine the best field/record for example: The one from the ‘reference system’ The newest one The one of highest quality Aggregation functions SQL-like: count, sum, minimum, maximum, average Modus, concatenate Match & Merge

Merge Cleansed data First Last G SIN Birth Date Address Golden record
John Smith M V3R 2A9;BC;Surrey; Avenue M4X 1V5;ON;Toronto;25 Linden Street Golden record First Last G SIN Birth Date Address John Smith M V3R 2A9;BC;Surrey; Avenue M4X 1V5;ON;Toronto;25 Linden Street Newest Address Most Frequent Address 45

Matching & Merging really depends on your style…
Consolidated Master is Single Version of Truth Data Quality at Master Updates occur at Sources Updates propagated to Master Coexistence Master is Single Version of Truth Data Quality is on-going Updates occur at Sources or Master Updates propagated to other Sources Registry Multiple Versions of Truth Data Quality is on-going Updates occur at Sources Keys & Metadata in Registry Updates optionally propagated to other Sources Centralized Master is Single Version of Truth Data Quality at Master Updates occur at Master Updates propagated to Sources

Mastering is one step & Information Builders can help with it all
Master Data Management Lifecycle

Thank You!

Appendix

Data Student to Data Master

Similar presentations

Presentation on theme: "Data Student to Data Master"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Student to Data Master

Similar presentations

Presentation on theme: "Data Student to Data Master"— Presentation transcript:

Similar presentations

About project

Feedback