R for Epi Workshop Module 1: Learning Base R

R for Epi Workshop Module 1: Learning Base R
Sara Levintow, MSPH PhD Candidate in Epidemiology UNC Gillings School of Global Public Health

Outline Background on R Installation of R and RStudio
Tour of RStudio programming environment Intro to R code syntax Intro to R objects Intro to functions

Code Along Type out or copy/paste code from the slides into R.
Run the code and check your work with the output included in slides. Raise your hand if you have questions or get stuck! Important note about copying and pasting quotation marks: PowerPoint likes to make quotation marks curly: “ ” R needs quotation marks to be straight: " " Retype quotation marks in R to change from curly to straight!

1. Background on R

Introducing R Open source programming language and software environment for statistical computing and graphics Created by Ross Ihaka and Robert Gentleman (University of Auckland, New Zealand) Currently supported by the R Foundation for Statistical Computing (Vienna, Austria) More info on the history of R at

Introducing R Freely available software, no cost to users
Wide variety of software facilities for manipulating and analyzing data Crowd-sourcing results in state-of-the-art functionality Highly extensible (via packages) Easy to generate well-designed, publication-quality plots “Environment” = fully planned and coherent system Not an accumulation of specific, inflexible tools

Rising Popularity of R From “The Popularity of Data Science Software”
By Robert A. Muenchen

For SAS Users: Key Differences
SAS “gives you everything” vs. R “builds from the bottom” R considered to be modern computer science language based on functions, objects, and abstraction. SAS is older, more traditional card language. R is constantly adding new technologies and statistical methods via packages. Takes longer for SAS to incorporate, requires a new version roll-out. Graphical data exploration and visualization are more powerful in R, but there is a learning curve.

Base R and Packages Module 1 is an introduction to base R code, without packages. Packages extend the capabilities of base R, developed by the community. Contain additional functions, documentation, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). Modules 2 and 3 will introduce you to a set of packages known as the “tidyverse” – popular tools for data manipulation and visualization.

2. Installation of R and RStudio

Installing R Go to https://cloud.r-project.org/
Click on download link for your operating system: Follow instructions for download and installation. I recommend installing the most recent version: R “Great Truth” released on 3/11/2019. You must install a version of R at least as recent as R (which dates back to 2013) in order to use RStudio.

Introducing RStudio Integrated development environment (IDE) for R
A layer that goes over the top of R, changing its appearance Makes R easier to use, looks more like other interfaces you’ve used (SAS, Stata, etc.) RStudio Desktop is free, open-source Built-in tools: syntax highlighting, code completion, smart indentation, quick access to function definitions and R help, interactive debugger, extensive package development tools, versioning tools, many more… More info here:

Installing RStudio Go to Click on download link for RStudio Desktop: Click on the installer link for your operating system: Follow instructions for download and installation.

3. Tour of RStudio

Tour of RStudio

Script editor

Console

Workspace management

Tools for plotting, packages, help

Creating your first script in RStudio
Using icons at top left of Editor: Create new R script Open existing R script Save current R script

Creating your first script in RStudio
Add a header at the top of your script: ################################################## ## Code from IPH Workshop: R for Epi ## Name ## Date Save the script “workshop_code.R” to the folder on your computer you are using for this workshop (e.g., where you saved the births data). This folder is your working directory (we’ll come back to this).

RStudio IDE Cheat Sheet
Excellent resource for getting started in the RStudio IDE. Includes comprehensive list of keyboard shortcuts (for both Windows/Linux and Mac). See here: IDE-cheatsheet.pdf

From Cheat Sheet:

4. Intro to R Code Syntax

Syntax Basics Use <- or = as assignment operator (reads as “gets”)
new_variable <- code Use # for comments (ignored when executing your code) Case-sensitive No command terminator required (e.g., semicolon in SAS) In general, executing code is not sensitive to indentation, single vs. double quotes, or spacing (some exceptions). However, important for readability to use consistent formatting. Style guides from Google, the tidyverse

Syntax Basics Type the following into your script: # my first R code
x <- rnorm(1000, mean = 1.2, sd = 3) summary(x) hist(x)

Running Code To run a line, put your cursor on the line with the code, and click Run or press Control+Enter (Command+Enter) To run several lines, highlight the lines of code, and click Run or press Control+Enter (Command+Enter) To run the entire script, click Source or Source with Echo

Running Code Run the following: What do you see in the Console?
# my first R code x <- rnorm(1000, mean = 1.2, sd = 3) summary(x) hist(x) What do you see in the Console? Commands (after prompt < ) Results What about in the Environment and Plots windows?

Editor Environment Plots Console

Getting Help Click on Help tab or run code ?function or use keyboard shortcut (F1). ?rnorm ?summary ?hist

Reading in and Saving Files
Set your working directory: The folder on your computer where you are currently working. Read in data stored in this location. Save files (data, plots, etc.) to this location. getwd() #find the current working directory setwd("~/Documents/R Workshop/")

Read in the births dataset (from your working directory) using the read.csv() function: births <- read.csv("births2012.csv", stringsAsFactors = F, header = T)

There are many base R functions and package-specific functions for reading in data from other programs: Base R: read.csv(), read.xls(), read.spss(), read.dta(), read.table(), read.delim()... readR package: read_csv(), read_delim(), read_fwf(), read_tsv()... haven package: read_sas(), read_dta(), read_sav()...

You can also use the GUI for R to generate the import code for you:

Save datasets to your working directory using similar code with “write” instead of “read”: Base R: write.csv(), write.table()… readR package: write_csv(), write_delim()... haven package: write_sas(), write_dta(), write_sav()...

Read in and save R datasets as RDS files (R-specific file type that is efficient at compressing the rectangular datasets we typically use in public health): Base R: readRDS(), saveRDS() readR package: read_rds(), write_rds() births <- readRDS("births.Rds") #if already saved as RDS file # ... # data cleaning and recoding saveRDS(births, file = "births_final.Rds") #save new version

View Data View entire dataset in a separate window by clicking on its name in the Environment tab, or by running: View(births) View selected observations in the console: head(births, n=10) # print first 10 rows tail(births, n=5) # print last 5 rows Dimensions of dataset nrow(births) #number of rows (observations) ncol(births) #number of columns (variables)

View Data View variable names in the console:
names(births) Change variable names from uppercase to lowercase: names(births) <- tolower(names(births)) To reference a specific variable in a dataset, use the dollar sign ($): bir #start typing ‘births’....what happens? births$ #what happens when you type this code? births$wksgest #see this variable’s values printed to console

Variable Example: Weeks of Gestation
Get summary statistics on weeks of gestation (wksgest) across all observations: summary(births$wksgest)

Based on the data dictionary, this variable is coded corresponding to weeks of gestation, with 99 for missing. To omit 99s from any numeric calculations, we need to recode as R’s missing value: NA

Prior to recoding, you can save the originally coded version of wksgest as a new variable in births: wksgest_99 births$wksgest_99 <- births$wksgest Then, for all observations where wksgest is equal to 99, recode as NA: births$wksgest[births$wksgest==99] <- NA

Prior to recoding, you can save the originally coded version of wksgest as a new variable in births: wksgest_99 births$wksgest_99 <- births$wksgest Then, for all observations where wksgest is equal to 99, recode as NA: births$wksgest[births$wksgest==99] <- NA This is our first example of indexing [using the brackets] – very powerful in R. Rows (observations) and columns (variables) in a dataset can be referenced using the brackets to perform functions on specific rows or columns of interest.

The distribution of weeks of gestation now has a maximum of 45, with missing values for observations previously coded as 99:

Use the table function to get a contingency table of counts for a single variable or to cross-classify two or more variables. #simple example – frequency table for values of recoded wksgest table(births$wksgest, useNA = "always")

Use the table function to get a contingency table of counts for a single variable or to cross-classify two or more variables. #simple example – frequency table for values of recoded wksgest table(births$wksgest, useNA = "always") Variable values

Use the table function to get a contingency table of counts for a single variable or to cross-classify two or more variables. #simple example – frequency table for values of recoded wksgest table(births$wksgest, useNA = "always") Counts

Use the table function to get a contingency table of counts for a single variable or to cross-classify two or more variables. #simple example – frequency table for values of recoded wksgest table(births$wksgest, useNA = "always") Missing

Use the table function to compare the original variable (wksgest_99) to the recoded variable (wksgest). # adding arguments to table function table(births$wksgest, births$wksgest_99, useNA = "always", dnn = c("recoded", "original"))

Recode 99 for other variables
Run summary() on the variables mage, visits, and mdif to check for 99s. Just as you did for wksgest, recode 99s to NA for these variables. For each variable, check your recoding by running summary().

Recoding answers births$mage[births$mage==99] <- NA births$visits[births$visits==99] <- NA births$mdif[births$mdif==99] <- NA

Summary output

Add a New Variable: Preterm Birth
Preterm birth is defined as gestational age < 37 weeks. Term birth is defined as gestational age ≥ 37 weeks. # coding preterm as 1, term as 0 births$preterm <- ifelse(births$wksgest<37, 1, 0) # check our coding: preterm should correspond to all births where wksgest < 37 table(births$preterm, births$wksgest, useNA = “always”)

Logical Operators Slide credit: EPID 700 (Xiaojuan Li & Jordan Cates)

Subset Data There are several ways to subset data in base R. We will return to indexing and also introduce the subset function.

Subset Data using Indexing
# Index individual variable: syntax is [which rows?] births$wksgest[1:10] # Indexing dataset: syntax is [which rows? , which columns?] births[1:3, 1:5]

Subset Data using Indexing
Create a new dataset (named preterm_data) consisting only of preterm births: # Select preterm birth observations, all variables preterm_data <- births[births$preterm==1,] # Select preterm birth observations, only study ID and preterm variables preterm_data <- births[births$preterm==1, c("x","preterm")]

Subset Data using Subset Function
Create a new dataset (named preterm_data) consisting only of preterm births: # First argument = data to subset; Second argument = logical expression preterm_data <- subset(births, preterm==1)

Basic Summary Statistics
We have already seen the summary() function and will introduce other functions for descriptive statistics throughout today’s workshop. # summary(): min, 25%, median, mean, 75%, max summary(births$mage) summary(births$mage[births$preterm==1]) summary(births$mage[births$preterm==0])

We have already seen the summary() function and will introduce other functions for descriptive statistics throughout today’s workshop. # what happens when you run the mean and standard deviation functions? mean(births$mage) sd(births$mage)

Many functions in R force you to be explicit about missing data. Will return NA if there are any missing values. Specify na.rm=TRUE to first remove the missing observations before calculating the statistic of interest. mean(births$mage, na.rm=TRUE) sd(births$mage, na.rm=TRUE)

5. Intro to R Objects

Data Frames and other objects in R
R is object-based: any time you use the assignment operator (<- or =), you are creating an object in R’s memory that can be subsequently referenced in your code. births <- readRDS("births.Rds") # births is a data frame object Data frames are the fundamental data structure in R (what you think of as a dataset). 2-dimensional data structure, where 1 row corresponds to 1 observation, and 1 column corresponds to 1 variable.

Data Frames and other objects in R
In addition to data frames, other types of objects are values, vectors, lists, and matrices. n <- nrow(births) # storing value of births sample size congen_anom <- c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl", "cp", "dowt", "cdit", "hypo") # character vector we’ll use later! apgar_scores <- seq(0, 10, by = 1) # numeric vector birth_dates <- list(min(births$dob), births[50001:50005, "dob"], max(births$dob), Sys.Date()) # list holds different types of objects

Tips on Objects in R Object names
Cannot have spaces and are case-sensitive Must start with an alphabetical character Can use a period or underscore to form part of the name of an object To list the objects created in an R session: ls() If you save the workspace image when quitting R, all objects will be saved in the global environment.

Understand Data Structure
# Identify type of object class(births) class(births$cores) # Describe structure of object str(births) str(births$cores)

Common Data Types Logical: Boolean values (TRUE or FALSE) Numeric
as.logical(births$preterm) Numeric as.numeric(births$preterm) Character as.character(births$preterm) Factor as.factor(births$preterm)

Always check data type! Functions often expect a certain type (e.g., numeric vs. character). In particular, when reading in data from a different program (i.e., SAS to R), make sure that variable type is preserved.

Always check variable type!
# hypothetical situation where number of prenatal care visits was read in as character instead of numeric births$visits_oops <- as.character(births$visits) mean(births$visits_oops) # recoding character version of visits to numeric births$visits_num <- as.numeric(births$visits_oops) Warnings: Alert messages that came up while your code ran. Errors: Stop execution of your code.

Recode Numeric Variable to be Factor
Create factor variable: 0 visits, 1-9 visits, visits, >15 visits # intro to cut function births$visit_cat <- cut(births$visits, breaks = c(0, 1, 10, 16, 99), right = FALSE) # table function to check your coding table(births$visits, births$visit_cat, useNA = "always")

Factor Variables Efficient way to encode both numeric and character information in a categorical variable. A factor is stored as a vector of numeric values with a corresponding set of character values to use when the factor is displayed. Factors represent ≥2 categories: you can specify the order of categories and the reference category. Useful for plotting and regression models.

Creating factor variable for preterm
Use the function factor(): specifying the variable to recode into a factor and the corresponding levels (numeric) and labels (character). # already created numeric preterm variable births$preterm <- ifelse(births$wksgest<37, 1, 0) # create preterm_f variable (factor) from preterm (numeric) births$preterm_f <- factor(births$preterm, levels = c(1, 0), labels = c("Preterm", "Term"))

Frequency Table # easy way to check coding
table(births$wksgest, births$preterm_f, useNA = "always") # bivariable distributions table(births$preterm_f, births$visit_cat, useNA = "always")

Frequency Table # you can save any table as a table object
counts <- table(births$preterm_f, births$visit_cat) # only non-missing # then can use the table as input prop.table(counts, margin = 1) #get row proportions prop.table(counts, margin = 2) #get column proportions

6. Intro to Functions

Functions in R Functions are R objects (just like everything else!)
Nearly everything in R is done through functions. There are many built-in functions: e.g., mean() A strength of R is the user’s ability to write their own functions.

Built-in functions Numeric functions Character functions
Examples: log(x), log10(x), exp(x), round(x, digits=n) Character functions Examples: toupper(x), tolower(x), paste(..., sep=""), grep(pattern, x) Try running this code: paste("Today is", date()) Statistical functions Examples: mean(x), sd(x), quantile(x, probs), min(x), max(x), sum(x) Specify na.rm=TRUE as a second argument if there is missing data Probability functions Examples: rnorm(n, m, sd), rbinom(n, size, prob), runif(n, min, max)

Table-making/Summarizing functions
prop.table() summary() describe() from the Hmisc package CreateTableOne() from the tableone package

Writing your own functions
Syntax for writing functions: myfunction <- function(arg1, arg2, ...) { statements return(object) } Trying writing your own logit function: logit <- function(p) { log(p / (1-p)) Specify a probability of 0.5: what is the logit? logit(p = 0.5)

Intro to the apply() family
Functions in base R to manipulate slices of data in a repetitive way without explicit use of loop constructs. Also known as “functional functions” (functions taking a function as an argument). Typically require very few lines of code. Family members: apply() lapply() sapply() vapply() mapply() rapply() tapply() eapply() Q: How can I use a loop to […insert task here…]? A: Don’t. Use one of the apply functions.

General Idea “For each of these things… (columns or rows in a dataframe, elements of a vector, etc.) …do this… (apply a function) …put it together… (compile the results of applying that function to each thing) …and give me the results.” (create a new object or print results to the console)

Our motivation: Congenital anomalies
There are 12 variables corresponding to congenital anomalies. We would like to create a no_anomalies variable: Coded 1 if all 12 variables = N Coded 0 if ≥1 variable != N

Getting ready for apply()
# Character vector corresponding to 12 variable names congen_anom <- c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl", "cp", "dowt", "cdit", "hypo") # Reference those 12 variables in births, without much typing births[, congen_anom]

Run apply() function births$no_anomalies <- apply(X = , MARGIN = , FUN = )

Run apply() function births$no_anomalies <- apply(X = , MARGIN = , FUN = ) Data to perform function on Apply function on rows (1) or columns (2) Any R function (built-in or user-defined) Create new variable in births

Run apply() function births$no_anomalies <- apply(X = births[, congen_anom], MARGIN = 1, FUN = function(x) { as.numeric(all(x == "N")) } )

Check your work # check out rows coded as 0 births[births$no_anomalies==0, c(congen_anom, "no_anomalies")] # check out rows coded as 1 births[births$no_anomalies==1, c(congen_anom, "no_anomalies")]

Module 1 Code Summary for Births Data
## read in raw data births <- read.csv("births2012.csv", stringsAsFactors = F, header = T) ## lowercase variable names names(births) <- tolower(names(births)) ## recode 99s as NA births$wksgest[births$wksgest==99] <- NA births$mage[births$mage==99] <- NA births$visits[births$visits==99] <- NA births$mdif[births$mdif==99] <- NA ## categorical visit variable births$visit_cat <- cut(births$visits, breaks = c(0, 1, 10, 16, 99), right = FALSE) ## preterm variable births$preterm <- ifelse(births$wksgest<37, 1, 0) births$preterm_f <- factor(births$preterm, levels = c(1, 0), labels = c("Preterm", "Term")) ## congenital anomalies congen_anom <- c("anen","mnsb","cchd", "cdh", "omph","gast", "limb", "cl","cp","dowt","cdit", "hypo") births$no_anomalies <- apply(X = births[, congen_anom], MARGIN = 1, FUN = function(x){as.numeric(all(x=="N"))})

End of Module 1 – Preparing for Module 2
During the break, leave this code running in your RStudio session to install the ‘tidyverse’ packages (takes ~5-10 min): install.packages('tidyverse')

R for Epi Workshop Module 1: Learning Base R

Similar presentations

Presentation on theme: "R for Epi Workshop Module 1: Learning Base R"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

R for Epi Workshop Module 1: Learning Base R

Similar presentations

Presentation on theme: "R for Epi Workshop Module 1: Learning Base R"— Presentation transcript:

Similar presentations

About project

Feedback