Understanding Data Quality Issues: Finding Data Inaccuracies Art DeMaio Evoke Software VP Technical Sales Support
Agenda Why is Understanding Data Important Methodology for Assessing Data –Defining –Weighting –Profiling –Revisiting –Finding –Addressing –Maintaining What is Profiling Benefits of the Assessment
What the Experts say… “Information quality is not an esoteric notion;it directly affects the effectiveness and efficiency of business processes. Information quality also plays a major role in customer satisfaction.” - Larry P. English
What the Experts say… “Poor data quality is costly. It lowers customer satisfaction, adds expense, and makes it more difficult to run a business and pursue tactical improvements such as data warehouses and re-engineering.” - Thomas C. Redman
What’s in Your DATA… “…three-quarters (of participating companies) reported significant problems as a result of defective data, with a third failing to bill or collect receivables as a result.” - In a PricewaterhouseCoopers survey of 600 CIOs, IT directors or similar executives
What is Data Quality? Accuracy of Content Structure Completeness Timeliness Presentation
Assessing Your Data 2-Weight /Impact 3-Profile Data 6-Address Source Data 7-Maintain 4-Revisit Definitions, Weights 5-Findings 1-Define Issues
Defining Issues Standard list Key requirements Content Structure Completeness Update list by project or source Source Data 1-Define Issues
Defining Issues-sample Source Data 1-Define Issues
Weight Impact After the issues are initially identified: Some issues are more critical than others Weights are not priorities Assign a weighting factor (1-5) Weighting factors SHOULD change by project 2-Weight /Impact Source Data 1-Define Issues
Profile Data What does Data Profiling mean? 2-Weight /Impact 3-Profile Data Source Data 1-Define Issues
What is Data Profiling? The use of analytical techniques on data for the purpose of developing a thorough knowledge of its content, structure and quality. A process of developing information about data instead of information from data.
Information About Data: (Data Profiling) 30% of entries in SUPPLIER_ID are blank the range of values in UNIT_PRICE is 5.99 to there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows Information FROM Data: (not Data Profiling) Texas auto buyers buy more Cadillacs per capita than any other state The average mortgage amount increased last year by 6% 10% of last year's customers did not buy anything this year What is Data Profiling?
Profile Data This is multi-step process Collect documentation Review the DATA itself Compare data to documentation Identify and detail specific issues 2-Weight /Impact 3-Profile Data Source Data 1-Define Issues
Revisit Review the issues and weights Should there be more or less issues What are they? Are the relative importance of each issue different? 2-Weight /Impact 3-Profile Data Source Data 4-Revisit Definitions, Weights 1-Define Issues
Findings Your findings tell others about the data Documented reports and/or charts Results database Quality Assessment Score 2-Weight /Impact 3-Profile Data Source Data 4-Revisit Definitions, Weights 5-Findings 1-Define Issues
Findings-Chart
Weighted Issue Rate % Weighted Assessment Score %
Address the Issues Addressing your findings Actual vs. Potential Subject Matter Expertise Cleansing Requirements 2-Weight /Impact 3-Profile Data 6-Address Source Data 4-Revisit Definitions, Weights 5-Findings 1-Define Issues
Maintain Vigilance Maintain Complete the cycle Periodic review Document score changes 2-Weight /Impact 3-Profile Data 6-Address Source Data 7-Maintain 4-Revisit Definitions, Weights 5-Findings 1-Define Issues
Why Do The Assessment? Quantify the quality issues Isolate true problems Proactive review –reduces the cost of resolving issues –reduces the risk of customer dissatisfaction Define the scope of issues Determine the resources required to address issues
Why Do The Assessment? Project Timeline When you find an Issue Cost to Address an Issue Project Costs
Why should it be done TIME Pay me now or Pay me later
When Should It Be Done? Every IT data project –Warehousing –CRM –ERP –EAI –M&A Ongoing based on –Criticality of the system –Current status (score) –Need to re-purpose data
Bibliography Larry P. English: Improving Data Warehouse and Business Information Quality, John Wiley & Sons Inc., 1999 Jack Olson, Data Profiling: The Accuracy Dimension, Morgan Kaufmann, 2002 Thomas C. Redman: Data Quality for the Information Age, Artech House, 1996 PricewaterhouseCoopers, “Global Data Management Survey”, 2001