Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data entry and preparation for analysis (data cleaning)

Similar presentations


Presentation on theme: "Data entry and preparation for analysis (data cleaning)"— Presentation transcript:

1 Data entry and preparation for analysis (data cleaning)
Stats Club 2: Dec 2016 Marnie Brennan (and Natalie Robinson)

2 References for today Petrie and Sabin - Medical Statistics at a Glance: Chapters 2 and 3 Van den Broeck, et al. (2005) Data cleaning: Detecting, diagnosing and editing data abnormalities. PLoS Med 2(10): e267 Thrusfield, M. (2007) Veterinary Epidemiology: Third Edition: Chapter 9 Dohoo et al. (2010) Veterinary Epidemiologic Research. Chapter 30

3 Terminology Data coding Data entry/input
Thinking how you are going to represent variables in your dataset E.g. Sex (M/F) – coded as 1 (M) and 2(F) Data entry/input Manually entering data into a database Checks to make sure it is correct Data cleaning/verification/processing Checking to see that your data is ‘right’ and represents the information correctly

4 Data coding How are you going to ‘represent’ your data?
Need to work this out first before anything else happens Write it down Preferably in a lab notebook, a research diary, a version of your questionnaire etc. Use different coloured pen Are you going to use numbers or letters E.g. if you are coding neuter status – are you going to use 1, 2, 3 and 4 OR MN, ME, FN, FE? Some statistical packages don’t like letters If you are using data collected by someone else - how has it been ‘represented’? Do you know the codes and what all the columns and rows mean? Never assume anything!

5 Data coding

6 Data coding How might you code this?
From all the journal sources listed below, please indicate those that you read (Please mark all that apply) Cattle Practice Equine Veterinary Journal In Practice The Veterinary Record Pig Journal British Poultry Science Journal

7 JourCP JourEVJ JourIP JourVR JourPJ JourBPJ
1

8 Guide to data collected by someone else

9 Data entry As per last month, think about how you are going to analyse your data before you input it into a spreadsheet or statistical package What are you going to do about missing values? Usually written as 999 or variations on that theme (use something that will never come up in your actual dataset) Other options? Try not to use 0 if you can – 0 can be an answer too (i.e. they didn’t tick that variable) Some statistical packages don’t like blanks

10 Data entry Numerical variables – enter them with the same precision as they are measured, and use a consistent unit of measurement If you are measuring kilograms E.g. record 5.3kg, not just 5kg Stick with kilograms, and convert pounds to kilograms If you have to use more than one table, make sure you have the same unique identifying number in each table, or make sure they are linked Large quantities of multilevel data – make sure you use a hierarchical database software, or separate files for data at each level e.g. herd file, cow file and merge later

11

12 Data entry How do you avoid making mistakes? 4 main types of mistake
Insertion – extra characters Deletion – missing characters Substitution – wrong characters Transposition – characters in the wrong order First two easy to pick up generally with data cleaning Ways to avoid: Double entry and comparison using computer programs Checking a proportion and looking at percentage error - if it is large, go through all entries E.g. Checked 10% of records – if error rate high, do it again! Use an automated specialised software/capture system/form E.g. Survey Monkey, EpiInfo, Teleform, EpiData Can still get errors though!

13 Taken from Petrie and Sabin – Medical Statistics at a Glance

14 “Garbage in = garbage out”
Data cleaning “Garbage in = garbage out” Also referred to as data verification You should have a plan for this before you start your analysis – cleaning often takes longer than the analysis! Prioritise fields which are: Important (ones you will use for comparison with others, key population indicators etc.) Prone to error Errors of sex, age, date etc often important

15 Important steps Keep a copy of raw data
Check the original when an error is found Save a new version with each change made Keep a record of all versions/changes Make sure you can retrace your steps if necessary!

16 Data cleaning tips For continuous variables:
Identify missing values by using sorting functions Check the minimum/maximum values – histogram, scatter plot Prepare a histogram to check the distribution For categorical variables: Calculate frequencies to see if the counts look reasonable for each category (pivot tables in Excel) Check for any unexpected categories

17 Data cleaning tips (cont.)
When writing a manuscript: Describe data cleaning in methods Report error rates and types Purpose – remember from before: Trying to detect Outliers/Impossible value Missing data Inconsistencies Transcription errors

18 Outliers/impossible values

19 Outliers/impossible values
90 and 87 year old cats! Check original form – entered incorrectly Change to correct value Save as a new version Record change made and named new version If correctly entered, leave as is, or remove (dependent on analysis)

20 Missing data

21 Missing data Missing data for breed/age from records
Check original form Not recorded - true missing data Can now code as missing data e.g. 999

22 Inconsistencies

23 Inconsistencies Sex listed as FN according to records, MN according to owner Check original form Consistent with data in Access – left as is Or could remove the information Which way is the error?

24 Transcription errors Covered already in data entry: Insertion Deletion
Substitution Transposition

25 Data cleaning Taken from Petrie and Sabin – Medical Statistics at a Glance

26 Reporting of data cleaning
Include what you did in your methods! Van den Broeck et al. (2005) talks about including your approach to data entry and cleaning in your methods A brilliant idea! Transparency is the key If you have taken the time to explain your ‘thoroughness’, it improves readers’ perceptions as to whether they can trust your results or not!

27 Next month Basic Excel skills


Download ppt "Data entry and preparation for analysis (data cleaning)"

Similar presentations


Ads by Google