David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Organisation Of Data (1) Database Theory
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
These slides are at: David Corne, room EM G.39, x 3410, / any questions, feel free to.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
“What do you want me to do now?”
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
ENTITY RELATIONSHIP MODELLING
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
ETEC 100 Information Technology
Today’s Goals Concepts  I want you to understand the difference between  Data  Information  Knowledge  Intelligence.
Case-based Reasoning System (CBR)
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Database Design Concepts INFO1408 Term 2 week 1 Data validation and Referential integrity.
“DOK 322 DBMS” Y.T. Database Design Hacettepe University Department of Information Management DOK 322: Database Management Systems.
8/28/97Information Organization and Retrieval Files and Databases University of California, Berkeley School of Information Management and Systems SIMS.
Chapter 4 Relational Databases Copyright © 2012 Pearson Education, Inc. publishing as Prentice Hall 4-1.
BUSINESS DRIVEN TECHNOLOGY
Database Design Concepts Info 1408 Lecture 2 An Introduction to Data Storage.
Data Mining: A Closer Look
Chapter 4 Relational Databases Copyright © 2012 Pearson Education 4-1.
Chapter 1 Database Systems. Good decisions require good information derived from raw facts Data is managed most efficiently when stored in a database.
Topics Covered: Data preparation Data preparation Data capturing Data capturing Data verification and validation Data verification and validation Data.
5.1 © 2007 by Prentice Hall 5 Chapter Foundations of Business Intelligence: Databases and Information Management.
Introduction to Databases
Database Design - Lecture 1
6-1 DATABASE FUNDAMENTALS Information is everywhere in an organization Information is stored in databases –Database – maintains information about various.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Relational Database Concepts. Let’s start with a simple example of a database application Assume that you want to keep track of your clients’ names, addresses,
24 GOLDEN COINS, 1 IS FAKE ( WEIGHS LESS). DATABASE CONCEPTS Ahmad, Mohammad J. CS 101.
David Corne, Heriot-Watt University - These slides and related resources: Data Mining.
PHP meets MySQL.
Lecturer: Gareth Jones. How does a relational database organise data? What are the principles of a database management system? What are the principal.
1 n 1 n 1 1 n n Schema for part of a business application relational database.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
Storing Organizational Information - Databases
1 Database Concepts 2 Definition of a Database An organized Collection Of related records.
M1G Introduction to Database Development 2. Creating a Database.
Concepts of Database Management Sixth Edition Chapter 6 Database Design 2: Design Method.
(Spring 2015) Instructor: Craig Duckett Lecture 10: Tuesday, May 12, 2015 Mere Mortals Chap. 7 Summary, Team Work Time 1.
Preprocessing for Data Mining Vikram Pudi IIIT Hyderabad.
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
Entity-Relationship (ER) Modelling ER modelling - Identify entities - Identify relationships - Construct ER diagram - Collect attributes for entities &
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
CHAPTER 3 DATABASES AND DATA WAREHOUSES. 2 OPENING CASE STUDY Chrysler Spins a Competitive Advantage with Supply Chain Management Software Chapter 2 –
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
240-Current Research Easily Extensible Systems, Octave, Input Formats, SOA.
ITGS Databases.
XP New Perspectives on Microsoft Access 2002 Tutorial 1 1 Microsoft Access 2007.
© All Rights Reserved Module Information and the Organisation Well Designed Interfaces.
Database Management Systems (DBMS)

Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Flat Files Relational Databases
Session 1 Module 1: Introduction to Data Integrity
Chapter 7 What Can Computers Do For Me?. How important is the material in this chapter to understanding how a computer works? 4.
McGraw-Hill/Irwin © 2008 The McGraw-Hill Companies, All Rights Reserved Chapter 7 Storing Organizational Information - Databases.
David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:
Database (Microsoft Access). Database A database is an organized collection of related data about a specific topic or purpose. Examples of databases include:
1 Section 1 - Introduction to SQL u SQL is an abbreviation for Structured Query Language. u It is generally pronounced “Sequel” u SQL is a unified language.
Introduction To DBMS.
Databases.
Data Mining (and machine learning)
Presented to:- Dr. Dibyojyoti Bhattacharjee
Data Analysis.
Data Quality By Suparna Kansakar.
Database Design Hacettepe University
Spreadsheets, Modelling & Databases
Presentation transcript:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Data Mining (and machine learning) DM Lecture 2: Data Cleaning

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Overview of My Lectures All at: 25/9 Overview of DM (and of these 8 lectures) 02/10: Data Cleaning - usually a necessary first step for large amounts of data 09/10 Basic Statistics for Data Miners - essential knowledge, and very useful 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry NO THURSDAY LECTURE OCTOBER 23rd 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data NO THURSDAY LECTURE November 6th 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data 20/11: Regression - the simplest algorithm for predicting data/class values 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Acknowledgements I adapted this material from various sources, most notably: A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database?’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning …

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? BarrattJohn22MathsBScMale BurnsRobert24CSBScMale CarterLaura20PhysicsMScFemale DaviesMichael12CSBScMale

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E.g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC TV Licensing DB should contain an entry for every address in the country. Does it?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only once. How many `Data Quality errors’ can you find in the following table, and what types are they? SurnameFirstnameDoBDriving test passed: SmithJ.17/12/8517/12/05 SmithJack17/12/8517/12/2005 SmithJock17/12/9517/12/2005

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up to date. The post office recently changed my postcode from EX6 8RA to EX6 8NU. Why does this make it difficult for me to get a sensible quote for home insurance or car insurance? Can you think of a DB where it doesn’t matter whether or not the data are kept up to date??

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical errors or impossibilities. It makes sense in and of itself. Why is the following mini DB inconsistent? DateSalesReturnsNet income 23 rd Nov£25,609£1,003£24, th Nov£26,202£1,601£24, th Nov£28,936£1,178£25,758

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Note: This definition of data quality is not much use, since there is no way to measure DQ sensibly Completeness: How will we know?? Uniqueness: It is hard to tell whether two entries are similar, or duplicates! Up-to-date-ness: How do we know? Consistent: consistency errors can be very hard to find, especially in a very large DB The database research `world’ is actively engaged in finding ways to measure data quality sensibly. In the meantime, we just use common sense to avoid dirty data at all points of the DQ continuum

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The Data Quality Continuum It’s rare that a datum is entered once into a DB and then left alone. Usually, a datum has a long and varied life, into which errors can arise at each and every stage. The continuum is: –Data gathering –Data delivery –Data storage –Data integration –Data retrieval –Data analysis So, if we want to monitor DQ, we need to monitor it at each of these stages

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DQ Continuum: Example This is an example I am familiar with, helping to illustrate the DQ continuum. The International Seismological Centre (ISC) is in Thatcham, in Berkshire. It’s a charity funded by various governments. Their role is to be the repository for recording all earthquake events on the planet.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DQ Continuum: ISC example: gathering ISC Data gathering centres

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: See 2006 earthquake data via my dmml page

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG centres. or ftp to ISC; some centres provide raw data, some provide interpreted data (e.g. maybe won’t send some data if they believe it in error in the first place)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where and when the Earth tremors were (there are hundreds per month) based on reverse engineering from seismograph readings. They integrate the raw data and attempt to do this, largely by hand and brain, and record their findings in archival CDs

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for the earth tremor activity on any particular day. I’m not sure whether you can get the raw data from them. Naturally, you can analyse the data and see if you can find inconsistencies or errors.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness? Consisency? What else is important in this case?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Where DQ problems occur (gathering) Manual data entry (how can we improve this?) Lack of uniform standards for format and content. Duplicates arising from parallel entry Approximations, alternatives, entries altered in order to cope with s/w and/or h/w constraints. Measurement errors.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Where DQ problems occur (delivery) Multiple hops from source to DB – problems can happen anywhere Inappropriate pre-processing (e.g. removing some `small’ seismograph readings before sending on to ISC; rounding up or down, when the destination needs more accurate data). Transmission problems: buffer overflows, checks (did all files arrive, and all correctly?)

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Where DQ problems occur (storage) Poor, out of date or inappropriate metadata Missing timestamps conversion to storage format (e.g. to excel files, to higher/lower precision

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Where DQ problems occur (integration) This is the business of combining datasets – e.g. from different parts of a company, from (previously) different companies following an acquisition; from different government agencies, etc. Different keys, different fields, different formats Different definitions (`customer’, `income’, …) Sociological factors: reluctance to share!

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs that store the retrieved data, or the use of the retrieved data in general. Problems arise because: The source DB is not properly understood! Straightforward mistakes in the queries that retrieve the relevant data. E.g. A database of genes contains entries that indicate whether or not each gene has a known or suspected link with cancer. A retrieval/analysis task leads to publishing a list of genes that are not relevant to cancer. What is the problem here?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: … all this: Consistency constraints are often not applied, or are applied! – suppose height is not allowed to go over 2 metres in a school student DB –My postcode problem The data are just too numerous, complex and ill- understood. `Cleaning it’ would cost too much! Undetectable problems: incorrect values, missing entries Metadata not maintained properly

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Single Source / Schema level examples ScopeProblemUncleanNotes attributeIllegal values DoB= Values out of range record Violated attribute dependencies Car-owner = No, make = Toyota Make should clearly have a Null value here. Record type Uniqueness violations Name= Jo Smith, NUS no. = 3067 Name= Ed Brown, NUS no. = 2124 NUS no.s should be unique Source Referential integrity violation Name= D Corne, Office = EM G.92 Where is G. 92 ?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Single Source / Instance level examples ScopeProblemUncleanNotes attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed = 0 mph, Title = Dark Side of the Moan FullName = J. Smith Colour = 160mph Phone = “Dan Jones ” Dummy entries – values unavailable at entry time, human error record Violated attribute dependencies City = Edinburgh. Postcode = EX6 Record type Word transposition, Duplicates, contradictions Name= Jo Smith, Name = Carr, Jim Name= J. Smith, Name = Joe Smith Name = Jo Smith, DoB = 17/12/62 Name = Jo Smith, DoB = 17/11/62 Source Wrong references Name= D Corne, Office = EM G.46 EM G. 46 exists, but is not my office.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: Customer ID NameStreetCitySex 102 Luke Smith5 Chewie Rd Dawlish, Devon Leia SmithChewie St, 5 Dawlish 1 They also buy books and music online from company B: Client ID LastNameOther namesPhone Gender 23 SmithLuke Michael Male 35 SmithLeia S. +44(0) F

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person?? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: A Special but Common type of Problem: Semantic Complexity Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E.g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person.The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E.g. Suppose the histogram of value types in mobile phone no. field is:

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, for type B and for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Cleaning via basic data analysis Data Profiling: examine the instances to see how the attributes vary. E.g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: What problems does this analysis alert us to?

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources: Which brings us to “basic statistics for data miners”, next week …