Presentation is loading. Please wait.

Presentation is loading. Please wait.

David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:

Similar presentations


Presentation on theme: "David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:"— Presentation transcript:

1 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Data Mining (and machine learning) DM Lecture 2: Data Cleaning

2 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Overview of My Lectures All at: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.htmlhttp://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html 25/9 Overview of DM (and of these 8 lectures) 02/10: Data Cleaning - usually a necessary first step for large amounts of data 09/10 Basic Statistics for Data Miners - essential knowledge, and very useful 16/10 Basket Data/Association Rules (A Priori algorithm) - a classic algorithm, used much in industry NO THURSDAY LECTURE OCTOBER 23rd 30/10 Cluster Analysis and Clustering - simple algs that tell you much about the data NO THURSDAY LECTURE November 6th 13/11: Similarity and Correlation Measures - making sure you do clustering appropriately for the given data 20/11: Regression - the simplest algorithm for predicting data/class values 27/11: A Tour of Other Methods and their Essential Details - every important method you may learn about in future

3 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Acknowledgements I adapted this material from various sources, most notably: A ppt presentation called `Data Quality and Data Cleaning: An Overview’ by Tamrapani Dasu and Theodore Johnson, at AT & T Labs A paper called `Data Cleaning: Problems and Current Approaches’, by Erhard Rahm and Hong Hai Do, University of Leipzig, Germany. My thanks to these researchers for making their materials freely available online.

4 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html On Data Quality Suppose you have a database sitting in front of you, and I ask ``Is it a good quality database?’’ What is your answer? What does quality depend on? Note: this is about the data themselves, not the system in use to access it.

5 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Conventional Definition of Data Quality Good quality data are: Accurate, Complete, Unique, Up-to-date, and Consistent ; meaning …

6 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Conventional Definition of Data Quality, continued … Accurate: This refers to how the data were recorded in the first place. What might be the inaccurately recorded datum in the following table? BarrattJohn22MathsBScMale BurnsRobert24CSBScMale CarterLaura20PhysicsMScFemale DaviesMichael12CSBScMale

7 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Conventional Definition of Data Quality, continued … Complete: This refers to whether or not the database really contains everything it is supposed to contain. E.g. a patient’s medical records should contain references to all medication prescribed to date for that patient. The BBC TV Licensing DB should contain an entry for every address in the country. Does it?

8 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Conventional Definition of Data Quality, continued … Unique: Every separate datum appears only once. How many `Data Quality errors’ can you find in the following table, and what types are they? SurnameFirstnameDoBDriving test passed: SmithJ.17/12/8517/12/05 SmithJack17/12/8517/12/2005 SmithJock17/12/9517/12/2005

9 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Conventional Definition of Data Quality, continued … Up-to-date: The data are kept up to date. The post office recently changed my postcode from EX6 8RA to EX6 8NU. Why does this make it difficult for me to get a sensible quote for home insurance or car insurance? Can you think of a DB where it doesn’t matter whether or not the data are kept up to date??

10 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Conventional Definition of Data Quality, continued … Consistent: The data contains no logical errors or impossibilities. It makes sense in and of itself. Why is the following mini DB inconsistent? DateSalesReturnsNet income 23 rd Nov£25,609£1,003£24,506 24 th Nov£26,202£1,601£24,601 25 th Nov£28,936£1,178£25,758

11 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Note: This definition of data quality is not much use, since there is no way to measure DQ sensibly Completeness: How will we know?? Uniqueness: It is hard to tell whether two entries are similar, or duplicates! Up-to-date-ness: How do we know? Consistent: consistency errors can be very hard to find, especially in a very large DB The database research `world’ is actively engaged in finding ways to measure data quality sensibly. In the meantime, we just use common sense to avoid dirty data at all points of the DQ continuum

12 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html The Data Quality Continuum It’s rare that a datum is entered once into a DB and then left alone. Usually, a datum has a long and varied life, into which errors can arise at each and every stage. The continuum is: –Data gathering –Data delivery –Data storage –Data integration –Data retrieval –Data analysis So, if we want to monitor DQ, we need to monitor it at each of these stages

13 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html DQ Continuum: Example This is an example I am familiar with, helping to illustrate the DQ continuum. The International Seismological Centre (ISC) is in Thatcham, in Berkshire. It’s a charity funded by various governments. Their role is to be the repository for recording all earthquake events on the planet.

14 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html DQ Continuum: ISC example: gathering ISC Data gathering centres

15 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html See 2006 earthquake data via my dmml page

16 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html DQ Continuum: ISC example: delivery Raw seismograph data from local collection points to DG centres. Email or ftp to ISC; some centres provide raw data, some provide interpreted data (e.g. maybe won’t send some data if they believe it in error in the first place)

17 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html DQ Continuum: ISC example: integration The ISC’s role is actually to figure out where and when the Earth tremors were (there are hundreds per month) based on reverse engineering from seismograph readings. They integrate the raw data and attempt to do this, largely by hand and brain, and record their findings in archival CDs

18 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html DQ Continuum: ISC example: retrieval/analysis You can get a CD from ISC anytime, for the earth tremor activity on any particular day. I’m not sure whether you can get the raw data from them. Naturally, you can analyse the data and see if you can find inconsistencies or errors.

19 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html The ISC DQ Continuum Where might there occur errors, of: Accuracy? Completeness? Uniqueness? Timeliness? Consisency? What else is important in this case?

20 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Where DQ problems occur (gathering) Manual data entry (how can we improve this?) Lack of uniform standards for format and content. Duplicates arising from parallel entry Approximations, alternatives, entries altered in order to cope with s/w and/or h/w constraints. Measurement errors.

21 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Where DQ problems occur (delivery) Multiple hops from source to DB – problems can happen anywhere Inappropriate pre-processing (e.g. removing some `small’ seismograph readings before sending on to ISC; rounding up or down, when the destination needs more accurate data). Transmission problems: buffer overflows, checks (did all files arrive, and all correctly?)

22 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Where DQ problems occur (storage) Poor, out of date or inappropriate metadata Missing timestamps conversion to storage format (e.g. to excel files, to higher/lower precision

23 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Where DQ problems occur (integration) This is the business of combining datasets – e.g. from different parts of a company, from (previously) different companies following an acquisition; from different government agencies, etc. Different keys, different fields, different formats Different definitions (`customer’, `income’, …) Sociological factors: reluctance to share!

24 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Where DQ problems occur (retrieval/analysis) The problem here is usually the quality of DBs that store the retrieved data, or the use of the retrieved data in general. Problems arise because: The source DB is not properly understood! Straightforward mistakes in the queries that retrieve the relevant data. E.g. A database of genes contains entries that indicate whether or not each gene has a known or suspected link with cancer. A retrieval/analysis task leads to publishing a list of genes that are not relevant to cancer. What is the problem here?

25 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What Keeps DBs Dirty A good DBMS will have built in tools for: Consistency in data types Consistency in field values Constraints and checks that deal with Null values, Outliers, Duplication. Automatic timestamps Powerful query language (makes retrieval logic errors less likely) … so, why are you refused a loan, have mail delivered to the wrong address, and get charged too much for your mobile calls?

26 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html … all this: Consistency constraints are often not applied, or are applied! – suppose height is not allowed to go over 2 metres in a school student DB –My postcode problem The data are just too numerous, complex and ill- understood. `Cleaning it’ would cost too much! Undetectable problems: incorrect values, missing entries Metadata not maintained properly

27 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Single Source vs Multiple Source Schema Level v Instance Level One useful way to categorize problems, independent of how we did so in the last lecture, is according to whether the problems are the sort we can get if we have just one source of data, or whether the problem arises directly from trying to combine data from multiple sources Problems can also be schema level or instance level

28 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Single Source / Schema level examples ScopeProblemUncleanNotes attributeIllegal values DoB=30.02.88 Values out of range record Violated attribute dependencies Car-owner = No, make = Toyota Make should clearly have a Null value here. Record type Uniqueness violations Name= Jo Smith, NUS no. = 3067 Name= Ed Brown, NUS no. = 2124 NUS no.s should be unique Source Referential integrity violation Name= D Corne, Office = EM G.92 Where is G. 92 ?

29 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Single Source / Instance level examples ScopeProblemUncleanNotes attribute missing values, mis-spellings, abbreviations, Misfields, Embedded vals Top speed = 0 mph, Title = Dark Side of the Moan FullName = J. Smith Colour = 160mph Phone = “Dan Jones 0207 308653” Dummy entries – values unavailable at entry time, human error record Violated attribute dependencies City = Edinburgh. Postcode = EX6 Record type Word transposition, Duplicates, contradictions Name= Jo Smith, Name = Carr, Jim Name= J. Smith, Name = Joe Smith Name = Jo Smith, DoB = 17/12/62 Name = Jo Smith, DoB = 17/11/62 Source Wrong references Name= D Corne, Office = EM G.46 EM G. 46 exists, but is not my office.

30 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Multiple Source Problems/ Instance and Schema level examples The Smiths buy books and music online from company A: Customer ID NameStreetCitySex 102 Luke Smith5 Chewie Rd Dawlish, Devon 0 175 Leia SmithChewie St, 5 Dawlish 1 They also buy books and music online from company B: Client ID LastNameOther namesPhone Gender 23 SmithLuke Michael 378988 Male 35 SmithLeia S. +44(0)1626 378988 F

31 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html When Companies A and B merge, various problems arise when they merge their DBs Combining customer fields and client fields – are they really the same things? How to ensure that Company A’s customer 37 and Company B’s client 37 get separate entries in the new DB. Are Luke Smith and Luke Michael Smith the same person?? Do Luke and Leia live at the same address? Etc … A forced `fast resolution’ to these problems will usually lead to errors in the new `integrated’ DB

32 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html A Special but Common type of Problem: Semantic Complexity Semantic Complexity (SC) is the state of play where different users of a DB have different conceptions of what the data represent. E.g. Local Police DB keep record of all crimes in an area, where the key is the victim’s name. When someone who was a victim moves to a different area, they remove all records relating to that person.The local council use this DB to produce a report of the total amount of crime every month. Why does it give figures that are too low?

33 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Semantic Complexity: Missing/Default Values One source of semantic complexity is the different meanings that missing values can have. E.g. Suppose the histogram of value types in mobile phone no. field is:

34 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What does NULL mean? A. This record is of someone who does not have a mobile phone? B. This record is of someone who has a mobile phone, but chose not to supply the number? C. This record is of someone who has a mobile phone, but who forgot to supply the number, or it was hard to decipher and recorded as NULL? Maybe some are of type A and some are of type B and some are of type C. For some applications/analyses, we may wish to know the breakdown into types. What about the All zero and All nine entries? Precisely the same can be said of them. Or, perhaps the protocols for recording the entries indicated NULL for type A, 0000000 for type B and 9999999 for type C. The above relate to a quite simple form of semantic complexity – but what if someone uses this DB to estimate the proportion of people who have never had a mobile phone?

35 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Cleaning via basic data analysis Data Profiling: examine the instances to see how the attributes vary. E.g. Automatically generate a histogram of values for that attribute. How does the histogram help us in finding problems in this case?

36 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html What problems does this analysis alert us to?

37 David Corne, and Nick Taylor, Heriot-Watt University - dwcorne@gmail.com These slides and related resources: http://www.macs.hw.ac.uk/~dwcorne/Teaching/dmml.html Which brings us to “basic statistics for data miners”, next week …


Download ppt "David Corne, and Nick Taylor, Heriot-Watt University - These slides and related resources:"

Similar presentations


Ads by Google