Best practices in R scripting

Slides:



Advertisements
Similar presentations
Strategies for solving scientific problems using computers.
Advertisements

Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
Data Management: Documentation & Metadata Types of Documentation.
SPSS 1: An Introduction to the Statistical Package SPSS Suzie Cro MRC Clinical Trials Unit.
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
Introduction to SPSS (For SPSS Version 16.0)
Version Control with git. Version Control Version control is a system that records changes to a file or set of files over time so that you can recall.
Merging census aggregate statistics with postal code-based microdata Laine Ruus University of Toronto. Data Library Service ,
Biostatistics Analysis Center Center for Clinical Epidemiology and Biostatistics University of Pennsylvania School of Medicine Minimum Documentation Requirements.
SHARP KPAYGL5C File May 17, 2012 Payroll Services, Office of General Services.
Why you should be using Version Control. Matt Krass Electrical/Software Engineer November 22, 2014.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
Introduction to R Part 2. Working Directory The working directory is where you are currently saving data in R. What is the current working directory?
Computer Literacy BASICS: A Comprehensive Guide to IC 3, 5 th Edition Lesson 19 Organizing and Enhancing Worksheets 1 Morrison / Wells / Ruffolo.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
Copyright © Software Carpentry 2011 This work is licensed under the Creative Commons Attribution License See
Computer Science, Algorithms, Abstractions, & Information CSC 2001.
Chapter 1: Overview of SAS System Basic Concepts of SAS System.
TIMOTHY SERVINSKY PROJECT MANAGER CENTER FOR SURVEY RESEARCH Data Preparation: An Introduction to Getting Data Ready for Analysis.
14b. Accessing Data Files in SAS ®. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2 Training.
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
1 Data Manipulation (with SQL) HRP223 – 2009 October 12, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.
 Are two random variables related to each other ?  What does it mean if the data are independent?  What is meant by the term covariance?  What does.
Chapter 6: Modifying and Combining Data Sets  The SET statement is a powerful statement in the DATA step DATA newdatasetname; SET olddatasetname;.. run;
Working Efficiently with Large SAS® Datasets Vishal Jain Senior Programmer.
The Reproducible Research Advantage Why + how to make your research more reproducible Presentation for the Center for Open Science June 17, 2015 April.
ANALYSIS TRAIN ON THE GRID Mihaela Gheata. AOD production train ◦ AOD production will be organized in a ‘train’ of tasks ◦ To maximize efficiency of full.
Advanced Higher Computing Science
Tidy data, wrangling, and pipelines in R
Session 15 Merging Data in SPSS
Spreadsheet Engineering
Discussion #11 11/21/16.
Subject Name: File Structures
Reviewing Code A guide to smelling another developer’s source code.
A Simple Introduction to Git: a distributed version-control system
Discussion 11 Final Project / Git.
Chapter 6: Modifying and Combining Data Sets
Texas Assessment Management System Student Directory and Send Files
A video coding and data visualization tool
Getting your data into R
Development History Granularity Transformations
University of Washington Computer Programming I
ECONOMETRICS ii – spring 2018
Data Entry and Managment
Data Analysis in Particle Physics
Lesson 1: Introduction to Trifacta Wrangler
SPSS Overview COM 631/731.
R Data Manipulation Bootstrapping
Lesson 1: Introduction to Trifacta Wrangler
Lesson 1: Introduction to Trifacta Wrangler
Python I/O.
Programming Logic and Design Fourth Edition, Comprehensive
Introduction to Stata Spring 2017.
Git Best Practices Jay Patel Git Best Practices.
This is where R scripts will load
Lesson 2 – Chapter 2A CHAPTER 2A – CREATING A DATASET
Lesson 1 – Chapter 1C Trifacta Interface Navigation
Tidy data, wrangling, and pipelines in R
Chapter 1 Introduction(1.1)
ICT Spreadsheets Lesson 1: Introduction to Spreadsheets
(UCCSC, 2017) Vince Kellen CIO, UC San Diego
Stata Basic Course Lab 2.
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
This is where R scripts will load
Lesson 3: Trifacta Basics
Lesson 3: Trifacta Basics
Data Manipulation (with SQL)
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Presentation transcript:

Best practices in R scripting R Bootcamp 2017 Michael Hallquist

Overview An important goal (of this bootcamp) is to disentangle you from visual data management (e.g., spreadsheet format in SPSS or Excel) Visual data management has critical flaws: Temptation to paste new variables next to existing variables, rather than a formal match merge Fallacy that one has conducted full QA by looking through dataset Doesn't scale well to large datasets Doesn't promote a "zoom and filter" strategy (Shneiderman) whereby one focuses on a specific data subset (e.g., all records for a given family) and filters out irrelevant observations to allow for closer inspection. Manual edits typically degrade fidelity and source memory for data (people rarely keep detailed records of their edits).

Storage Use an open format that collaborators can understand and is not software dependent. Suggestion: .csv.gz. CSV (comma-separated values) files can be read by virtually any data program. Adding the gzip compression makes file sizes much smaller and on par with proprietary formats (e.g., .sav or .sas7bdat). Bonus: Base R functions read and write it: write.csv(iris, gzfile("iris.csv.gz")) For more complex datasets that don't make sense as a 2-D data.frame-style object, I would suggest keeping them in an .RData object. This maintains the original form so it can be reshaped for accessibility later, and it is a non- proprietary format (though not well supported by other software). Always keep a completely unadulterated copy of the dataset. Back it up to cloud storage that will keep the original even if you change it.

Storage Make effective use of "readme" files (kind of like a data blog) Data dictionary: file names, variable names, descriptions, missing codes Summary of directory and file structure Answers to any data detective work you or someone else might puzzle over later (e.g., why is ID 9005 missing the BDI when that was filled out in the same session as all the other self-reports?) Any changes to raw data that lead to a divergence from the upstream provider/source Name variables something intuitive, not opaque Write a script that parses, merges, manipulates, and edits the dataset into a format that you consider minimally usable.

Project name/ Code/ Wrangling.R Visualization.R Analysis.R Data/ Raw/ BDI.csv Demograph.csv EMA_3Weeks.csv.gz Proc/ SelfReport_EMA_Merged_23Aug2016.RData Figures/ ToPublish/ 01 - BPD Group Boxplots.pdf 01 - Snapshot.RData 02 - BPD AI Corrs.pdf RandomFig.png Output/ SelfReport QA 18Aug2016.txt EMA Missing Report 29Aug2016.txt Filesystem structure Use a reasonable directory structure for a project. Although there is not a one- size-fits-all solution, it does pay to have your own conventions across projects. Be careful about leaving temporary files behind that you won't recall later (e.g., Rplots.pdf, or Fig1.png) File naming: Descriptive, intuitive names Use numbering to help keep files sorted When possible, date important files If something changes (e.g., you get 10 more subjects), keep (and date) old outputs.

Versioning Your data management, visualization, and analysis code will change over time. You will make errors, question why something used to work, or your collaborator will accidentally delete something. Version control tracks changes in your project over time. It supports good reproducibility practices (public code, snapshots, etc.). It allows you to go on a wild goose chase, then undo your changes when you break things. It allows multiple people to work on a codebase with less concern about collision. Popular version control service: Github (https://github.com). One can also version data or an entire project. Example: Open science framework (https://osf.io). Increasing in popularity for pre-registration.

Coding sidebar Comment your code so you and others understand it! Use descriptive variable names (e.g., narc_sx_wave1) Format your code in a readable way. Suggestions: http://adv- r.had.co.nz/Style.html. Use auto-formatting in Rstudio (Code > Reformat Code). Write your script to be self-standing and run from top to bottom. Set the working directory Load required libraries Load datasets Transform and merge data Analysis etc. Make use of functions for repeated/reusable code, and split out subparts of data management and analysis stream to separate files. Be pragmatic and don't over-engineer things (unless you are fairly confident that you'll be repeating the process for several projects)