Download presentation
Presentation is loading. Please wait.
Published byEmily Brown Modified over 9 years ago
1
Profiling: The First Step to Understanding Your Data
Chris Bevilacqua iWay Solutions Architect Fall 2010
2
What are the Typical Measures of Data Quality?
What is Data Quality? Accuracy Completeness Consistency Uniqueness Timeliness Validity What are the Typical Measures of Data Quality? Within enterprise business systems, insufficient data quality is notorious for compromising timely and accurate customer communication. Internal systems often contribute to the problem by implementing: incomplete or conflicting business rules, different scoring and validation methods, or data requirements that cannot adapt to specific challenges such as those posed by foreign person attributes. A reliable identification process must be adaptable, and can be simple or complex. Data Accuracy dimension of Data Quality: Accuracy of data is the degree to which data correctly reflects the real world object OR an event being described. Examples of Data Accuracy The address of customer in the customer database is the real address. The temperature recorded in the thermometer is the real temperature. The bank balance in the customer's account is the real value customer deserves from the Bank. Data Completeness dimension of Data Quality Completeness of data is the extent to which the expected attributes of data are provided. For example, a customer data is considered as complete if: All customer addresses, contact details and other information are available. Data of all customers is available. Data Completeness definition is the 'expected completeness'. It is possible that data is not available, but it is still considered completed, as it meets the expectations of the user. Every data requirement has 'mandatory' and 'optional' aspects. For example customer's mailing address is mandatory and it is available and because customer’s office address is optional, it is OK if it is not available. Data can be complete, but inaccurate: All the customers' addresses are available, but many of them are not correct. The health records of all patients have 'last visit' date, but some of it contains the future dates. Data Consistency dimension of quality of data Consistency of Data means that data across the enterprise should be in synch with each other. Examples of data in-consistency are: An agent is inactive, but he still has his disbursement account active. A credit card is cancelled, and inactive, but the card billing status shows 'due'. Data can be accurate (i.e., it will represent what happened in real world), but still inconsistent. An Airline promotion campaign closure date is Jan 31, and there is a passenger ticket booked under the campaign on Feb. 2. Data is inconsistent, when it is in synch in the narrow domain of an organization, but not in synch across the organization. For example: Collection management system has the Cheque status as 'cleared', but in the accounting system, the money is not shown being credited to the bank account. Reason for this kind of inconsistency is that system interfaces are synchronized during the end-of-day batch runs. Data can be complete, but inconsistent Data for all the packets dispatched from New York to Chicago are available., but some of the packages are also shown as 'under bar-coding' status. Data Timeliness 'Data delayed' is 'Data Denied' The timeliness of data is extremely important. This is reflected in: Companies are required to publish their quarterly results with in a given frame of time. Customers service providing up-to date information to the customers. Credit system checking on the credit card account activity. The timeliness depends on user expectation. An online availability of data could be required for room allocation system in Hospitality, but an overnight data is fine for a billing system. Example of Data not being timely: The courier package status is delivered, but it will be updated in the system only in the night batch run. This means that online status will not be available. The financial statements of a company are published one month after the year-end. The census data is available two years after the census is done. Data Auditability Auditability means that any transaction, report, accounting entry, bank statement etc. can be tracked to its originating transaction. This would need a common identifier, which should stay with a transaction as it undergoes Transformation, aggregation and reporting. Examples of non-auditable data: A car chassis number cannot be linked to the part number supplied by an ancillary. A surgery report cannot be linked to the Doctor ID of preliminary diagnosis OR the pathologist ID.
3
The Foundation of Accurate BI
Data Quality is… Pretty crazy when you think about it….we have sold BI and integration for years and NOW the industry thinks – HMMM maybe doing all this w/Clean data might make a difference…funny and sad at the same time. So far Companies have spent significant amount IT budget to integrate desperate application , creating data ware house in order to get better Business Intelligence. However, many companies overlook the fact that, at the end of the day, it is the underlying data that matters. All of the pretty screens and reports in the world would not make a difference if the data that resides in the system is full of errors, inconsistent and redundant. In order to achieving successful business intelligence companies need to tackle the data quality problem first. The Foundation of Accurate BI Copyright 2007, Information Builders. Slide 3 3
4
BI on Bad Data is a Problem!
Stated Another Way: BI on Bad Data is a Problem! I am not using the Race Car graphic as that would be a to much marketing spin for a UG Copyright 2007, Information Builders. Slide 4
5
The Cost of Bad Data Poor quality customer data costs U.S. businesses $611 billion a year in postage, printing, and staff overhead. [1] Poor data quality costs the typical company at least 10% of revenue; 20% is probably a better estimate. [2] Gartner estimates that more than 25% of critical data within large businesses is somehow inaccurate or incomplete. [3] 1. Wayne W. Eckerson, “Data Quality and the Bottom Line,” TDWI Report Series, 2002. 2. Thomas C. Redman, “Data: An Unfolding Quality Disaster,” DM Review, August 2004. 3. Rick Whiting, “Hamstrung By Defective Data,” InformationWeek, May 2006. [1] Wayne W. Eckerson, “Data Quality and the Bottom Line,” TDWI Report Series, 2002. [2] Thomas C. Redman, “Data: An Unfolding Quality Disaster,” DM Review, August 2004. [3] Rick Whiting, “Hamstrung By Defective Data,” InformationWeek, May 2006.
6
Here are a Few Approaches to Solving the Data Quality Problem….
Data-wise - What Do You Do When… Need to add a new data stream or source to the report (or HOLD file, data warehouse/mart)? Users complain that the data in the reports just doesn’t seem right? The data ages / changes / moves? Fill in the blank… Here are a Few Approaches to Solving the Data Quality Problem….
7
We’ll get there…someday…
Whether or not you believe in climate change, the business will change…
8
We have One System…..To Rule Them All!
But when (really) will ALL your data be in the “One System” ?
9
We’ll build it ourselves… what we need, when we need it.
What? It works! (but ends up being pretty messy… and did we mention change?)
10
A Holistic Approach to Data Quality
What we have learned about data governance is that it takes a cooperative effort between IT staff and their business partners to make it work. Let’s walk through the process of creating roles, responsibilities, and rules. [Click] The first step is to gain an understanding of the data itself. Business “users” should have the knowledge to gain that understanding. We do that through data profiling exercises designed to identify those data elements that are not correct or inconsistent. Business people can also analyze the impact of bad data on their organization and provide suggestions (rules) as to what the data needs to look like. These rules are then passed to IT professionals so that they can apply some technology to cleansing the data based on what the business professionals suggest. IT professionals create the quality plans and the content (or rule) based cleansing to improve the integrity of the asset. IT professionals can then take the cleansed data and enhance by applying data standardization rules, de-duplicate the data where necessary, and enrich it with any additional information before it goes to the sourcing (or targeted) system. While business professionals constantly monitor and report the results of the various applications of rules and patterns, the roles, responsibilities, and rules are fully implemented. What is critical is that any data governance initiative will abjectly fail if this cooperation between business and IT does not exist. It is the responsibility of everyone in the organization to make sure the information assets are at their highest integrity.
11
Data Profiling Demonstration
12
iWay Data Profiler - Web based profile sharing
13
iWay Data profiler – uses WF to render shareable profiles
14
iWay Data Profiler – compare profiles over time
15
iWay Data Profiler
16
Enable Continuous Data Quality Improvement
Compare and trend analysis of information quality over time Ideally, you will have your processes and people in place to maximize availability, improve the integrity, and assign accountabilities for your information assets. You will begin to see the important, or master, information in your organization more clearly and drive your business to those assets. But the process is a cycle and there is always room for improvement! Monitoring the information assets over time will give you a clear picture how your initiatives are performing and will provide you with a way to graphically depict both successes and failures in the process. With the processes already in place, correcting the failures can be done very quickly. At this point, ask if anyone has any questions.
17
Data Quality Challenge
Your Data Data Quality Profile No COST, No kidding!
18
Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.