Download presentation
1
Data Quality David Loshin Knowledge Integrity Inc.
Knowledge Integrity Incorporated
2
Course Structure Overview of Data Quality Dimensions of Data Quality
Data Ownership and Data Roles Cost Analysis of Poor Data Quality Dimensions of Data Quality Data models, Data values, Presentation Data Analysis Techniques Data Analysis Tools
3
Course Structure (2) Data Quality Improvement
Metadata and Enterprise Reference Data Domains and Mappings Data Quality Rules Definition Data Quality Rule Discovery
4
Course Structure (3) Data Profiling Using Data Quality Rules
Data Transformation Data Cleansing Ongoing Validation
5
Course Structure (4) Data Correction Data Cleansing Scalability Issues
Data Parsing Standardization Linkage Duplicate Elimination Approximate Searching Scalability Issues
6
Assignments 4 Assignments
“Handy Tools” for data analysis Domain Analysis Data Parsing Data Linkage Assignments to be programmed using Perl or Java
7
Some Examples Frequent Flyer Miles and Long-Distance Service
Corporate Credit Card Direct Marketing Event CD Club Scam
8
What is Data? Working definitions:
Data: arbitrary values (with their own representation) Information: data within a context Knowledge: Understanding of information within its context Metadata: data about data
9
Data Contexts Static flat file data Static databases
Dynamic data flows Message passing
10
Who Owns Data? Important question, because the answers indicate where responsibility for data quality lies Data quality can be difficult to effect because of complicating notions Data Processing as an “Information Factory” Actors in the information factory and their roles
11
Actors and Their Roles Supplier Acquirer Creator Processor Packager
Delivery Agent Consumer Middle Manager Senior Manager Decision-maker
12
Ownership Responsibilities
Definition of data Authorization and Security User support Data packaging and delivery Maintenance Data quality Management of business rules Management of metadata Standards management Supplier management
13
Ownership Paradigms Creator Consumer Compiler Enterprise Funder
Decoder Packager Reader Subject Purchaser Everyone
14
Complicating Notions Ownership is affected by: The value of data
Privacy Turf Fear Bureaucracy
15
The Data Ownership Policy
Order of enforcement Identify stakeholders Identify data sets Allocation of ownership Ownership roles and responsibilities Dispute Resolution
16
The Data Ownership Policy (2)
Maintain a metadata database for data ownership Parties table Data set table Roles and responsibilities Policies (i.e., dispute resolution, communication, etc.)
17
Ownership Roles CIO CKO Trustee Policy Manager Registrar Steward
Custodian Data Administrator Security Administrator Information Flow Information Processing Application development Data Provider Data Consumer
18
Map the Flow of Information
Data processing can be likened to an “information factory” Data sets from multiple sources are used as “raw input” Final products are created in the form of business processes, information products, strategic reports, etc. Knowledge Integrity Incorporated
19
Stages in the Information Map
Data Supply Data Acquisition Data Creation Data Processing Data Packaging Decision Making Decision Implementation Data Delivery Data Consumption Knowledge Integrity Incorporated
20
Directed Information Channels
Indicates the flow of information from one processing stage to another Example: supplier data is delivered to an acquisition stage through an information channel Directed indicates the direction in which data flows This effectively maps all points at which a data fault or nonconformance may appear Knowledge Integrity Incorporated
21
Example: Credit Approval
22
Example: Hotel Reservation Process
23
Example: Catalog Sales
24
What is Data Quality? “Fitness for Use”
Different rules for different data sets Includes: Data profiling Domain and cross-attribute analysis Discovery of business rules Data cleansing Standardization Deduplification and Merge-purge
25
Lather, Rinse, Repeat Data quality is a process:
Assess the current state of the quality of data Determine the area that needs most improvement Determine success criteria Implement the improvement Measure against success threshold If successful: goto 2
26
Data Quality is Hard to Do
No one wants to admit mistakes Denial of responsibility Lack of understanding “Dirty work” Lack of recognition
27
Steps to Data Quality Training Data ownership policy
Economic model of data quality Current state assessment and requirements analysis Project selection and implementation
28
Simple Tools Goal: To look for simple patterns that indicate a problem that needs to be addressed Grouping and Linking Frequency Analysis Pattern Analysis
29
Grouping Try to make similar items gravitate together
Joining data instances based on business rules Simple methods: Attribute selection Sorting Hashing
30
Frequency Analysis Look for insights in numbers Simple methods:
Counting Hashing
31
Pattern Analysis Looking to distinguish between what is expected and what is not expected Attempt to find outliers and nonconformities
32
Example
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.