Chapter 1 Introduction to Data Quality
Data Quality Characteristics Data quality affects several attributes associated with data: Accuracy–Is it realistic or believable? Integrity–Is it structured and managed? Consistency–Is it consistently defined and maintained? Validity–Is the data valid, based on business or industry rules and standards?
What Causes Poor Data Quality? These factors can contribute to poor data quality: Business rules do not exist or there are no standards for data capture. Standards may exist but are not enforced at the point of data capture. Inconsistent data entry (incorrect spelling, use of nicknames, middle names, or aliases) occurs. Data entry mistakes (character transposition, misspellings, and so on) happen. Integration of data from systems with different data standards is present. Data quality issues are perceived as time-consuming and expensive to fix.
Primary Sources of Data Quality Problems Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002
How Is Clean Data Achieved? Clean data is the result of a combination of efforts: making sure that data entered into the system is clean cleaning up problems after the data is accepted.
Typical Data Quality Issues The most common processes in a data quality initiative are Data Analysis and Standardization –consistency analysis –standardization schemes –gender analysis –entity analysis –data parsing and casing. continued...
Typical Data Quality Issues The most common processes in a data quality initiative are Matching and Merging –de-duplication –householding Address Verification – against a CASS certified database Geocoding – data enrichment using third-party data elements.
... Analysis and Standardization Example Who is the biggest supplier? Anderson Construction$ 2, Briggs,Inc$ 8, Brigs Inc.$12, Casper Corp.$27, Caspar Corp$ 6, Solomon Industries$43, The Casper Corp$11,500.00
... Standardization Scheme Briggs, Inc Brigs Inc. Briggs Inc. Casper Corp. Casper Corp. Caspar Corp The Casper Corp
Supplier Spending 0 10,000 20,000 30,000 40,000 50,000 $ Spent Casper Corp. Solomon Ind. Briggs Inc. Anderson Cons.
... Operational System of Records Data Warehouse 01Mark Carver SAS SAS Campus Drive Cary, N.C. 02Mark W. Craver 03Mark Craver Systems Engineer SAS Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark Craver Systems Engineer SAS Data Matching Example
Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C Data Quality Process Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark Craver Systems Engineer SAS Operational System of Records Data Warehouse DQ