Download presentation
Presentation is loading. Please wait.
Published byBruce Watts Modified over 6 years ago
1
Review: Simulation (see handout) Teradata & SQL accounts LMS GIGO
2
Introduction to Data Quality & Data Cleaning
Data and information is not static, it flows in a data collection and usage process Data gathering Data delivery Data storage Data integration Then what?
3
Data Quality Characteristics
Data quality affects several attributes associated with data: Accuracy – Is it realistic or believable? Integrity – Is it structured and managed? Consistency – Is it consistently defined and maintained? Validity – Is the data valid, based on business or industry rules and standards?
4
What Causes Poor Data Quality?
These factors can contribute to poor data quality: Business rules do not exist or there are no standards for data capture. Standards may exist but are not enforced at the point of data capture. Inconsistent data entry (incorrect spelling, use of nicknames, middle names, or aliases) occurs. Data entry mistakes (character transposition, misspellings, and so on) happen. Integration of data from systems with different data standards is present. Data quality issues are perceived as time-consuming and expensive to fix. There are many attributes which can contribute to poor data quality, including: A lack of business rules (or standards) at the point of data capture. **This is the cheapest place to deal with data quality. If it were dealt with at the point of data capture, it would not have to be included “downstream” during the data warehousing or decision support processes. Standards not being enforced at the point of data capture. This is extremely prevalent in situations where you have divisions and departments all running their own systems, and none can agree on what standards should be used/enforced when capturing data. Inconsistencies in data entry (across different users, different departments, or even across multiple operational systems) Common data entry mistakes Pulling data together from multiple systems, each with a different standard. This is an issue with pulling in historical data that may have been entered via an old legacy system. Also, with mergers and acquisitions going on these days, it has become more and more of an issue. The misperception that data quality issues are time consuming and expensive to fix.
5
Primary Sources of Data Quality Problems
This chart represents the percentage of data quality problems that are caused by these sources. Why do these problems happen? Typos and misspellings when data is being entered by employees. **Sometimes even valid phone numbers and addresses are not necessarily correct! Changes in source systems that cause problems when the data is propagated downstream in the warehousing process. Data migration and conversion of data from multiple operational sources. The data was entered with different standards, and therefore does not necessarily merge together cleanly. Differing levels of expectations and standards by the users. What is good enough for one user may not necessarily be good enough for another one. **This is especially common in a situation where you have many different groups of users within an organization using the same data, but each with its own business issues and concerns (i.e. how do you define a customer, supplier, product, etc.). Source: The Data Warehousing Institute, Data Quality and the Bottom Line, 2002
6
How Is Clean Data Achieved?
Clean data is the result of a combination of efforts: making sure that data entered into the system is clean GIGO cleaning up problems after the data is accepted.
7
Some errors are easier to fix than others…
If we have data for Male and Female (‘M’,‘F’) then is it a reasonable assumption that ‘f’ is supposed to be ‘F’? How about age, is 135 valid? Can we fix it? What if we had DOB? What about subject codes: BI is , what if we find , can we fix it? This is a common data entry problem called transposition
8
Analysis and Standardization Example
Who is the biggest supplier? Anderson Construction $ 2,333.50 Briggs,Inc $ 8,200.10 Brigs Inc. $12,900.79 Casper Corp. $27,191.05 Caspar Corp $ 6,000.00 Solomon Industries $43,150.00 The Casper Corp $11,500.00 The example shows that the vendor Briggs Incorporated has two incomplete expenditure summations and Casper Corporation has three incomplete expenditure summations (with one entry listed near the end of the report as The Casper Corp). This is a sample from 27,000 entries. ... ...
9
Standardization Scheme
Briggs, Inc Brigs Inc. Briggs Inc. Casper Corp. Caspar Corp The Casper Corp Casper Corp. Using dfPower Studio to identify the different permutations of each company name, you can build a scheme and use it to standardize the company name so that each vendor has one unique representation within the database. ... ...
10
Supplier Spending $ Spent 50,000 40,000 30,000 20,000 10,000
10,000 20,000 30,000 40,000 50,000 $ Spent Casper Corp. Solomon Ind. Briggs Inc. Anderson Cons. The Cost by Vendor report will be more organized, useful and accurate, because vendor expenditure data is now consolidated. A section of the final report appears to the left.
11
Data Matching Example ... Operational System of Records Data Warehouse
Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver M Craver Systems Engineer 01 Mark Carver SAS SAS Campus Drive Cary, N.C. 02 Mark W. Craver Consider the situation where you have records representing an employee stored across 3 different operational systems. The goal is to merge the records together into one master record for the employee. The problem is that the employee’s name is stored slightly differently in each of the data sources. An attempt to merge the records together based on employee name does not yield the desired results. 03 M Craver Systems Engineer SAS ... ...
12
Data Quality Process ... Operational System of Records Data Warehouse
Mark Carver SAS SAS Campus Drive Cary, N.C. Mark W. Craver Mark Craver Systems Engineer 01 Mark Craver Systems Engineer SAS SAS Campus Drive Cary, N.C. 27513 DQ If, however, we were able to add a data cleansing process that applied fuzzy logic to the records such that they can be merged, we can attain the desired results. This is the type of functionality the algorithms in the SAS Data Quality Solution make possible. ... ...
13
ETL What is ETL? E T L How important is it in BI?
Exercise one: data cleaning with SAS an MS Access Run SAS 9.1 and follow the demo, answer the questions as you go Exercise two: OLAP When you are finished you can try eTrainer
14
Exercise two: OLAP Answer the following on the back of the handout.
Query 1: How many Sales Districts are there? Which district is consistently the top performer? Which quarter had the best revenue (use totals)? What’s the total revenue for 2004? Which region would you be concerned about and why? What’s the total revenue for 2003? Query 2: How many Sales Regions are there? How many Sales Districts in the USA? How many Sales Reps in NE USA? How many Sales Reps in the USA? Best and Worst Sales Reps? Query 3: Graph the results like this, what’s missing? How can you fix this? What’s happened in Q1 2005?
15
Query 1
18
Query 4 Create this report (Sales Reps by Revenue)
Look for the SQL that is created and run for you, copy and save it (in a text file)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.