Presentation is loading. Please wait.

Presentation is loading. Please wait.

Why clean data? Data quality is important.. Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the.

Similar presentations


Presentation on theme: "Why clean data? Data quality is important.. Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the."— Presentation transcript:

1 Why clean data? Data quality is important.

2 Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the data quality Improves the quality of the outputs

3 What to look for Non-response  an item non response  eg missing data Erroneous data  Can negatively affect data and resulting quality Suspicious data

4 Data gathering problems Manual entry (input error)  For example, switching numbers around, missing numbers, inputting two responses into one field Duplicates  For example, submit button hit more than once Measurement errors  For example, using inches instead of cm, reading scales incorrectly Non uniform standards for content and format  For example, people using different units - some giving index finger lengths in cm, some in mm.

5 Process of cleaning data Detect Resolve Treat

6 Detect Identify erroneous or suspicious data  Graph or sort data - look at outliers I have a student who throws ten dice and records the number of sixes. They recorded: (2, 0, 3, 12, 2, 0, 1, 1, 3, 1, 4).  What is wrong?  What do you think is the cause of it?

7 Detect Consider the data points  3, 4, 7, 4, 8, 3, 9, 5, 7, 6, 92  “92” is suspicious - an outlier Outliers:  are potentially legitimate (correct)  can be data or model glitches  can be a data miners dream, for example, a highly profitable customer Outlier - “departure from the expected”

8 Resolve Deciding if erroneous or suspicious data should be corrected or amended Deciding on the action to “treat” the data

9 Treat Leave as is Change  Impute: determine replacement value Replacement value is obtained from a similar record from the “clean” respondents from the data at hand Remove


Download ppt "Why clean data? Data quality is important.. Cleaning data Makes the data fit for purpose/plausible Reduces the negative impact of errors Improves the."

Similar presentations


Ads by Google