Download presentation
Presentation is loading. Please wait.
Published byDoris O’Connor’ Modified over 9 years ago
1
Why clean data? Data quality is important.
2
Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the data quality Improves the quality of the outputs
3
What to look for Non-response an item non response eg missing data Erroneous data Can negatively affect data and resulting quality Suspicious data
4
Data gathering problems Manual entry (input error) For example, switching numbers around, missing numbers, inputting two responses into one field Duplicates For example, submit button hit more than once Measurement errors For example, using inches instead of cm, reading scales incorrectly Non uniform standards for content and format For example, people using different units - some giving index finger lengths in cm, some in mm.
5
Process of cleaning data Detect Resolve Treat
6
Detect Identify erroneous or suspicious data Graph or sort data - look at outliers I have a student who throws ten dice and records the number of sixes. They recorded: (2, 0, 3, 12, 2, 0, 1, 1, 3, 1, 4). What is wrong? What do you think is the cause of it?
7
Detect Consider the data points 3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92 “92” is suspicious - an outlier Outliers: are potentially legitimate (correct) can be data or model glitches can be a data miners dream, for example, a highly profitable customer Outlier - “departure from the expected”
8
Resolve Deciding if erroneous or suspicious data should be corrected or amended Deciding on the action to “treat” the data
9
Treat Leave as is Change Impute: determine replacement value Replacement value is obtained from a similar record from the “clean” respondents from the data at hand Remove
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.