These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

It will blow you away..... Click to proceed......
Configuration management
Designing tables from a data model (Chapter 6) One base table for each entity.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
©Silberschatz, Korth and Sudarshan4.1Database System Concepts Lecture-1 Database system,CSE-313, P.B. Dr. M. A. Kashem Associate. Professor. CSE, DUET,
Annoucements  Next labs 9 and 10 are paired for everyone. So don’t miss the lab.  There is a review session for the quiz on Monday, November 4, at 8:00.
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
NMED 3850 A Advanced Online Design February 25, 2010 V. Mahadevan.
ETEC 100 Information Technology
Today’s Goals Concepts  I want you to understand the difference between  Data  Information  Knowledge  Intelligence.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Database Design Concepts INFO1408 Term 2 week 1 Data validation and Referential integrity.
Concepts of Database Management Seventh Edition
Data Mining: A Closer Look
Logical Database Design Nazife Dimililer. II - Logical Database Design Two stages –Building and validating local logical model –Building and validating.
Client Case Studies Revenue Assurance & Revenue Intelligence.
Objectives of the Lecture :
How Do I Find a Job to Apply to?
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
® IBM Software Group © IBM Corporation IBM Information Server Understand - Information Analyzer.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Database Design - Lecture 1
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
Relational Database Concepts. Let’s start with a simple example of a database application Assume that you want to keep track of your clients’ names, addresses,
David Corne, Heriot-Watt University - These slides and related resources: Data Mining.
PHP meets MySQL.
MIS 301 Information Systems in Organizations Dave Salisbury ( )
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
(Business) Process Centric Exchanges
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Cleansing Ola Ekdahl IT Mentors 9/12/08.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
M1G Introduction to Database Development 2. Creating a Database.
Concepts of Database Management Sixth Edition Chapter 6 Database Design 2: Design Method.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Entity-Relationship (ER) Modelling ER modelling - Identify entities - Identify relationships - Construct ER diagram - Collect attributes for entities &
5 - 1 Copyright © 2006, The McGraw-Hill Companies, Inc. All rights reserved.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
1 Database & DBMS The data that goes into transaction processing systems (TPS), also goes to a database to be stored and processed later by decision support.
M1G Introduction to Database Development 4. Improving the database design.
ITGS Databases.
Chapter 9 Logical Database Design : Mapping ER Model To Tables.
Database Management Systems (DBMS)
Data Modelling and Cleaning CMPT 455/826 - Week 8, Day 2 Sept-Dec 2009 – w8d21.
Data Preprocessing Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Object storage and object interoperability
Information Integration 15 th Meeting Course Name: Business Intelligence Year: 2009.
Software Quality Assurance and Testing Fazal Rehman Shamil.
Data Profiling 13 th Meeting Course Name: Business Intelligence Year: 2009.
THRio Database Linkage and THRio Database Issues.
Data Warehousing 101 Howard Sherman Director – Business Intelligence xwave.
Database Planning Database Design Normalization.
Data Mining What is to be done before we get to Data Mining?
Data Understanding, Cleaning, Transforming. Recall the Data Science Process Data acquisition Data extraction (wrapper, IE) Understand/clean/transform.
Data CLEANSING Getting Data Ready.
Data Staging Data Staging Legacy System Data Warehouse SQL Server
Data Mining (and machine learning)
Presented to:- Dr. Dibyojyoti Bhattacharjee
Semantic Interoperability and Data Warehouse Design
Teaching slides Chapter 8.
Data Understanding, Cleaning, Transforming
Presentation transcript:

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Advanced Database Systems F24DS2 / F29AT2 Data Quality and Data Cleaning 2

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Acknowledgements I adapted this material from various sources, most notably: A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me … all this: Consistency constraints are often not applied, or are applied! – suppose height is not allowed to go over 2 metres in a school student DB –My postcode problem The data are just too numerous, complex and ill- understood. `Cleaning it’ would cost too much! Undetectable problems: incorrect values, missing entries Metadata not maintained properly

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Single Source / Schema level examples ScopeProblemUncleanNotes attributeIllegal values DoB= Values out of range record Violated attribute dependencies Car-owner = No, make = Toyota Make should clearly have a Null value here. Record type Uniqueness violations Name= Jo Smith, NUS no. = 3067 Name= Ed Brown, NUS no. = 2124 NUS no.s should be unique Source Referential integrity violation Name= D Corne, Office = EM G.92 Where is G. 92 ?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Single Source / Instance level examples ScopeProblemUncleanNotes attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed = 0 mph, Title = Dark Side of the Moan FullName = J. Smith Colour = 160mph Phone = “Dan Jones ” Dummy entries – values unavailable at entry time, human error record Violated attribute dependencies City = Edinburgh. Postcode = EX6 Record type Word transposition, Duplicates, contradictions Name= Jo Smith, Name = Carr, Jim Name= J. Smith, Name = Joe Smith Name = Jo Smith, DoB = 17/12/62 Name = Jo Smith, DoB = 17/11/62 Source Wrong references Name= D Corne, Office = EM G.46 EM G. 46 exists, but is not my office.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: Customer ID NameStreetCitySex 102 Luke Smith5 Chewie Rd Dawlish, Devon Leia SmithChewie St, 5 Dawlish 1 They also buy books and music online from company B: Client ID LastNameOther namesPhone Gender 23 SmithLuke Michael Male 35 SmithLeia S. +44(0) F

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person?? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me A Special but Common type of Problem: Semantic Complexity Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E.g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person.The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E.g. Suppose the histogram of value types in mobile phone no. field is:

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, for type B and for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Cleaning: Phases Phases in DC: Analysis: to detect errors and inconsistencies in the DB needs detailed analysis, involving both manual inspection and automated analysis programs. This reveals where (most of) the problems are. Defining transformation and mapping rules: Having found the problems, this next phase is concerned with defining the way you are going to automate solutions to clean the data

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Data Cleaning: phases continued Verification: In this phase we test and evaluate the transformation plans we made in stage 2; without this, we may end up making the data dirtier rather than cleaner. Transformation: Do the transformation, now that you’re sure it will be done correctly. Backflow of cleaned data: Do what we can to ensure that cleaned data percolates to various repositories that may still harbour errors.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Data Analysis Data Profiling: examine the instances to see how the attributes vary. E.g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What problems does this analysis alert us to?

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Data Mining Data Mining is simply about more advanced forms of data analysis. We talk about that next week.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Defining Data Transformation Rules As a result of the analysis phase, you will find various problems that translate to a list of actions, such as: –Remove all entries for J. Smith (duplicates of John Smith) –Find entries with `bule’ in colour field and change these to `blue’. –Output a list of all records where the Phone number field does not match the pattern (NNNNN NNNNNN) (further steps required to then cleanse these data) –Find all entries where the Name field contains a potential DoB string, and the DoB field is NULL, and then repair these entries. –Etc …

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Verification This speaks for itself! Data transformation is the main step that actually changes the data itself – so you need to be sure you will do it correctly. So, test and examine the transformation plans very carefully. It is easy to mess the data up even more if you have a faulty transformation plan. –I have a very thick C++ book where it says strict in all the places where it should say struct

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Transformation Go ahead and do it. For large DBs, this task is supported by a variety of tools (as also is data analysis, often in the same tool). The list is growing. E.g. DATACLEANSER is a specialist tool for identifying and eliminating duplicates. TRILLIUM focuses on cleaning name/address data. Such tools use a huge built-in library of rules for dealing with the common problems. Alternatively or additionally you can write your own code for specialised bits of cleaning (and then verify it!).

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me Phases in DC: Backflow Once the `master’ source of data – perhaps a newly integrated DB, is `cleaned’, there is the opportunity to fix errors that may have spread beyond the DB before it was cleaned. This will be a very different and varied process in every case, and the results of the first Analysis stage should start to provide clues about what could be done here. Examples of such backflow can vary between: Refunding 1 customer 12p because he was mischarged for postage owing to a faulty postcode entry Removing £1,000,000,000’s worth of a brand of olive oil from supermarket shelves across Europe, since a DB (and hence the label) did not correctly indicate that it contains something dangerous to those with nut allergies.

These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to contact me What this lecture was about Why DBs are almost always not `clean’ A single source/multi-source and instance level/schema level classification of errors Semantic Complexity Five Phases in a corporate Data Cleaning process