Data Quality David Loshin Knowledge Integrity Inc. www.knowledge-integrity.com Knowledge Integrity Incorporated
Course Structure Overview of Data Quality Dimensions of Data Quality Data Ownership and Data Roles Cost Analysis of Poor Data Quality Dimensions of Data Quality Data models, Data values, Presentation Data Analysis Techniques Data Analysis Tools
Course Structure (2) Data Quality Improvement Metadata and Enterprise Reference Data Domains and Mappings Data Quality Rules Definition Data Quality Rule Discovery
Course Structure (3) Data Profiling Using Data Quality Rules Data Transformation Data Cleansing Ongoing Validation
Course Structure (4) Data Correction Data Cleansing Scalability Issues Data Parsing Standardization Linkage Duplicate Elimination Approximate Searching Scalability Issues
Assignments 4 Assignments “Handy Tools” for data analysis Domain Analysis Data Parsing Data Linkage Assignments to be programmed using Perl or Java
Some Examples Frequent Flyer Miles and Long-Distance Service Corporate Credit Card Direct Marketing Event CD Club Scam
What is Data? Working definitions: Data: arbitrary values (with their own representation) Information: data within a context Knowledge: Understanding of information within its context Metadata: data about data
Data Contexts Static flat file data Static databases Dynamic data flows Message passing
Who Owns Data? Important question, because the answers indicate where responsibility for data quality lies Data quality can be difficult to effect because of complicating notions Data Processing as an “Information Factory” Actors in the information factory and their roles
Actors and Their Roles Supplier Acquirer Creator Processor Packager Delivery Agent Consumer Middle Manager Senior Manager Decision-maker
Ownership Responsibilities Definition of data Authorization and Security User support Data packaging and delivery Maintenance Data quality Management of business rules Management of metadata Standards management Supplier management
Ownership Paradigms Creator Consumer Compiler Enterprise Funder Decoder Packager Reader Subject Purchaser Everyone
Complicating Notions Ownership is affected by: The value of data Privacy Turf Fear Bureaucracy
The Data Ownership Policy Order of enforcement Identify stakeholders Identify data sets Allocation of ownership Ownership roles and responsibilities Dispute Resolution
The Data Ownership Policy (2) Maintain a metadata database for data ownership Parties table Data set table Roles and responsibilities Policies (i.e., dispute resolution, communication, etc.)
Ownership Roles CIO CKO Trustee Policy Manager Registrar Steward Custodian Data Administrator Security Administrator Information Flow Information Processing Application development Data Provider Data Consumer
Map the Flow of Information Data processing can be likened to an “information factory” Data sets from multiple sources are used as “raw input” Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated
Stages in the Information Map Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data Consumption Knowledge Integrity Incorporated
Directed Information Channels Indicates the flow of information from one processing stage to another Example: supplier data is delivered to an acquisition stage through an information channel Directed indicates the direction in which data flows This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated
Example: Credit Approval
Example: Hotel Reservation Process
Example: Catalog Sales
What is Data Quality? “Fitness for Use” Different rules for different data sets Includes: Data profiling Domain and cross-attribute analysis Discovery of business rules Data cleansing Standardization Deduplification and Merge-purge
Lather, Rinse, Repeat Data quality is a process: Assess the current state of the quality of data Determine the area that needs most improvement Determine success criteria Implement the improvement Measure against success threshold If successful: goto 2
Data Quality is Hard to Do No one wants to admit mistakes Denial of responsibility Lack of understanding “Dirty work” Lack of recognition
Steps to Data Quality Training Data ownership policy Economic model of data quality Current state assessment and requirements analysis Project selection and implementation
Simple Tools Goal: To look for simple patterns that indicate a problem that needs to be addressed Grouping and Linking Frequency Analysis Pattern Analysis
Grouping Try to make similar items gravitate together Joining data instances based on business rules Simple methods: Attribute selection Sorting Hashing
Frequency Analysis Look for insights in numbers Simple methods: Counting Hashing
Pattern Analysis Looking to distinguish between what is expected and what is not expected Attempt to find outliers and nonconformities
Example