Data entry and preparation for analysis (data cleaning)

Slides:



Advertisements
Similar presentations
INFORMATION TECHNOLOGY Software Applications. WORD PROCESSING WP is the most commonly used package in business. A large number of documents are produced.
Advertisements

John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
McGraw-Hill/Irwin McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved.
A Simple Guide to Using SPSS© for Windows
Chapter 7 Data Management. Agenda Database concept Import data Input and edit data Sort data Function Filter data Create range name Calculate subtotal.
Accounting Chapter 4 Unit 7 Notes Posting to the Ledger
Basic Concept of Data Coding Codes, Variables, and File Structures.
Access 2007 ® Use Databases How can Access help you to find and use information?
Chapter Sixteen Starting the Data Analysis Winston Jackson and Norine Verberg Methods: Doing Social Research, 4e.
McGraw-Hill/Irwin © 2004 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 9 Processing the Data.
Medical Statistics (full English class) Ji-Qian Fang School of Public Health Sun Yat-Sen University.
Chapter 8: Systems analysis and design
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Copyright © 2010 Pearson Education, Inc. Chapter 6 The Standard Deviation as a Ruler and the Normal Model.
DATABASE. Computer-based filing systems Information in computer-based filing systems are stored in DATA FILES. A FILE is a collection of RELATED RECORDS.
Checking data Chapter 7 Prepared by:Sir Mazhar Javed.
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
Databases. What is a database?  A database is used to store data. The word DATA is actually Latin for FACTS. A database is, therefore, a place, or thing.
ISU Basic SAS commands Laboratory No. 1 Computer Techniques for Biological Research Animal Science 500 Ken Stalder, Professor Department of Animal Science.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
Chapter Fifteen. Preliminary Plan of Data Analysis Questionnaire Checking Editing Coding Transcribing Data Cleaning Selecting a Data Analysis Strategy.
Chapter Fifteen Chapter 15.
RESEARCH METHODS Lecture 29. DATA ANALYSIS Data Analysis Data processing and analysis is part of research design – decisions already made. During analysis.
Verification & Validation. Batch processing In a batch processing system, documents such as sales orders are collected into batches of typically 50 documents.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Creating a data set From paper surveys to excel. STEPS 1.Order your filled questionnaires 2.Number your questionnaires 3.Name your variables. 4.Create.
Coding Preparing The Research for Data Entry. Coding (defined) Coding is the process of converting questionnaire responses into a form that a computer.
Data Entry, Coding & Cleaning SPSS Training Thomas Joshua, MS July, 2008.
Section 3 Computing with confidence. The purpose of this section The purpose of this section is to develop your skills to achieve two goals: 1-Becoming.
Saving Everyone’s Time and Energy: Practical Tips for Database Design Cynthia Wilson Garvan PhD Statistics, MA Mathematics College of Nursing
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 6- 1.
DATA TYPES.
PROCESSING DATA.
Session 5 – Questionnaire Checklists
AP CSP: Cleaning Data & Creating Summary Tables
Introduction to Marketing Research
DATA INPUT AND OUTPUT.
Practical Office 2007 Chapter 10
August 25, 2015 Please turn in any forms or assignments from yesterday. Take out a sheet of paper and something to write with for notes.
Maths Information Evening
DATA ENTRY Prof. Dr. Hamit ACEMOĞLU 1.
The general journal is a permanent record organized by account number
Title of your science project
Posting Journal Entries to General Ledger Accounts
Key points.
Databases.
Data quality 1: Individual records
Normality or not? Different distributions and their importance
Stats Club Marnie Brennan
2018 NM Community Survey Data Entry Training
Central tendency and spread
Stats Club Marnie Brennan
Natalie Robinson Centre for Evidence-based Veterinary Medicine
Planning your research & types of data
Objectives TO UNDERSTAND THAT CAPTURING DATA IS VALIDATED AND VERIFIED TO CHECK THAT IT IS REASONABLE AND CORRECT.
TRAINING OF FOCAL POINTS on the CountrySTAT SYSTEM based on FENIX
2 Handling Data Basic Medical Statistics Course October 2010
Chapter Fourteen Data Preparation.
Spreadsheets, Modelling & Databases
Presenting Data in Tables
Home.
PA430 - Data coding March 7/8, 2000.
By A.Arul Xavier Department of mathematics
Indicator 3.05 Interpret marketing information to test hypotheses and/or to resolve issues.
Presentation transcript:

Data entry and preparation for analysis (data cleaning) Stats Club 2: Dec 2016 Marnie Brennan (and Natalie Robinson)

References for today Petrie and Sabin - Medical Statistics at a Glance: Chapters 2 and 3 Van den Broeck, et al. (2005) Data cleaning: Detecting, diagnosing and editing data abnormalities. PLoS Med 2(10): e267 Thrusfield, M. (2007) Veterinary Epidemiology: Third Edition: Chapter 9 Dohoo et al. (2010) Veterinary Epidemiologic Research. Chapter 30

Terminology Data coding Data entry/input Thinking how you are going to represent variables in your dataset E.g. Sex (M/F) – coded as 1 (M) and 2(F) Data entry/input Manually entering data into a database Checks to make sure it is correct Data cleaning/verification/processing Checking to see that your data is ‘right’ and represents the information correctly

Data coding How are you going to ‘represent’ your data? Need to work this out first before anything else happens Write it down Preferably in a lab notebook, a research diary, a version of your questionnaire etc. Use different coloured pen Are you going to use numbers or letters E.g. if you are coding neuter status – are you going to use 1, 2, 3 and 4 OR MN, ME, FN, FE? Some statistical packages don’t like letters If you are using data collected by someone else - how has it been ‘represented’? Do you know the codes and what all the columns and rows mean? Never assume anything!

Data coding

Data coding How might you code this? From all the journal sources listed below, please indicate those that you read (Please mark all that apply) Cattle Practice Equine Veterinary Journal In Practice The Veterinary Record Pig Journal British Poultry Science Journal

JourCP JourEVJ JourIP JourVR JourPJ JourBPJ 1

Guide to data collected by someone else

Data entry As per last month, think about how you are going to analyse your data before you input it into a spreadsheet or statistical package What are you going to do about missing values? Usually written as 999 or variations on that theme (use something that will never come up in your actual dataset) Other options? Try not to use 0 if you can – 0 can be an answer too (i.e. they didn’t tick that variable) Some statistical packages don’t like blanks

Data entry Numerical variables – enter them with the same precision as they are measured, and use a consistent unit of measurement If you are measuring kilograms E.g. record 5.3kg, not just 5kg Stick with kilograms, and convert pounds to kilograms If you have to use more than one table, make sure you have the same unique identifying number in each table, or make sure they are linked Large quantities of multilevel data – make sure you use a hierarchical database software, or separate files for data at each level e.g. herd file, cow file and merge later

Data entry How do you avoid making mistakes? 4 main types of mistake Insertion – extra characters Deletion – missing characters Substitution – wrong characters Transposition – characters in the wrong order First two easy to pick up generally with data cleaning Ways to avoid: Double entry and comparison using computer programs Checking a proportion and looking at percentage error - if it is large, go through all entries E.g. Checked 10% of records – if error rate high, do it again! Use an automated specialised software/capture system/form E.g. Survey Monkey, EpiInfo, Teleform, EpiData Can still get errors though!

Taken from Petrie and Sabin – Medical Statistics at a Glance

“Garbage in = garbage out” Data cleaning “Garbage in = garbage out” Also referred to as data verification You should have a plan for this before you start your analysis – cleaning often takes longer than the analysis! Prioritise fields which are: Important (ones you will use for comparison with others, key population indicators etc.) Prone to error Errors of sex, age, date etc often important

Important steps Keep a copy of raw data Check the original when an error is found Save a new version with each change made Keep a record of all versions/changes Make sure you can retrace your steps if necessary!

Data cleaning tips For continuous variables: Identify missing values by using sorting functions Check the minimum/maximum values – histogram, scatter plot Prepare a histogram to check the distribution For categorical variables: Calculate frequencies to see if the counts look reasonable for each category (pivot tables in Excel) Check for any unexpected categories

Data cleaning tips (cont.) When writing a manuscript: Describe data cleaning in methods Report error rates and types Purpose – remember from before: Trying to detect Outliers/Impossible value Missing data Inconsistencies Transcription errors

Outliers/impossible values

Outliers/impossible values 90 and 87 year old cats! Check original form – entered incorrectly Change to correct value Save as a new version Record change made and named new version If correctly entered, leave as is, or remove (dependent on analysis)

Missing data

Missing data Missing data for breed/age from records Check original form Not recorded - true missing data Can now code as missing data e.g. 999

Inconsistencies

Inconsistencies Sex listed as FN according to records, MN according to owner Check original form Consistent with data in Access – left as is Or could remove the information Which way is the error?

Transcription errors Covered already in data entry: Insertion Deletion Substitution Transposition

Data cleaning Taken from Petrie and Sabin – Medical Statistics at a Glance

Reporting of data cleaning Include what you did in your methods! Van den Broeck et al. (2005) talks about including your approach to data entry and cleaning in your methods A brilliant idea! Transparency is the key If you have taken the time to explain your ‘thoroughness’, it improves readers’ perceptions as to whether they can trust your results or not!

Next month Basic Excel skills