Best practices in R scripting

Best practices in R scripting
R Bootcamp 2017 Michael Hallquist

Overview An important goal (of this bootcamp) is to disentangle you from visual data management (e.g., spreadsheet format in SPSS or Excel) Visual data management has critical flaws: Temptation to paste new variables next to existing variables, rather than a formal match merge Fallacy that one has conducted full QA by looking through dataset Doesn't scale well to large datasets Doesn't promote a "zoom and filter" strategy (Shneiderman) whereby one focuses on a specific data subset (e.g., all records for a given family) and filters out irrelevant observations to allow for closer inspection. Manual edits typically degrade fidelity and source memory for data (people rarely keep detailed records of their edits).

Storage Use an open format that collaborators can understand and is not software dependent. Suggestion: .csv.gz. CSV (comma-separated values) files can be read by virtually any data program. Adding the gzip compression makes file sizes much smaller and on par with proprietary formats (e.g., .sav or .sas7bdat). Bonus: Base R functions read and write it: write.csv(iris, gzfile("iris.csv.gz")) For more complex datasets that don't make sense as a 2-D data.frame-style object, I would suggest keeping them in an .RData object. This maintains the original form so it can be reshaped for accessibility later, and it is a non- proprietary format (though not well supported by other software). Always keep a completely unadulterated copy of the dataset. Back it up to cloud storage that will keep the original even if you change it.

Storage Make effective use of "readme" files (kind of like a data blog) Data dictionary: file names, variable names, descriptions, missing codes Summary of directory and file structure Answers to any data detective work you or someone else might puzzle over later (e.g., why is ID 9005 missing the BDI when that was filled out in the same session as all the other self-reports?) Any changes to raw data that lead to a divergence from the upstream provider/source Name variables something intuitive, not opaque Write a script that parses, merges, manipulates, and edits the dataset into a format that you consider minimally usable.

Project name/ Code/ Wrangling.R Visualization.R Analysis.R Data/ Raw/ BDI.csv Demograph.csv EMA_3Weeks.csv.gz Proc/ SelfReport_EMA_Merged_23Aug2016.RData Figures/ ToPublish/ 01 - BPD Group Boxplots.pdf 01 - Snapshot.RData 02 - BPD AI Corrs.pdf RandomFig.png Output/ SelfReport QA 18Aug2016.txt EMA Missing Report 29Aug2016.txt Filesystem structure Use a reasonable directory structure for a project. Although there is not a one- size-fits-all solution, it does pay to have your own conventions across projects. Be careful about leaving temporary files behind that you won't recall later (e.g., Rplots.pdf, or Fig1.png) File naming: Descriptive, intuitive names Use numbering to help keep files sorted When possible, date important files If something changes (e.g., you get 10 more subjects), keep (and date) old outputs.

Versioning Your data management, visualization, and analysis code will change over time. You will make errors, question why something used to work, or your collaborator will accidentally delete something. Version control tracks changes in your project over time. It supports good reproducibility practices (public code, snapshots, etc.). It allows you to go on a wild goose chase, then undo your changes when you break things. It allows multiple people to work on a codebase with less concern about collision. Popular version control service: Github ( One can also version data or an entire project. Example: Open science framework ( Increasing in popularity for pre-registration.

Coding sidebar Comment your code so you and others understand it!
Use descriptive variable names (e.g., narc_sx_wave1) Format your code in a readable way. Suggestions: r.had.co.nz/Style.html. Use auto-formatting in Rstudio (Code > Reformat Code). Write your script to be self-standing and run from top to bottom. Set the working directory Load required libraries Load datasets Transform and merge data Analysis etc. Make use of functions for repeated/reusable code, and split out subparts of data management and analysis stream to separate files. Be pragmatic and don't over-engineer things (unless you are fairly confident that you'll be repeating the process for several projects)

Best practices in R scripting

Similar presentations

Presentation on theme: "Best practices in R scripting"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Best practices in R scripting

Similar presentations

Presentation on theme: "Best practices in R scripting"— Presentation transcript:

Similar presentations

About project

Feedback