R for Epi Workshop Module 2: Data Manipulation & Summary Statistics Sara Levintow, MSPH PhD Candidate in Epidemiology UNC Gillings School of Global Public Health
Outline Intro to packages Intro to the tidyverse dplyr code structure: key functions & the pipe Data manipulation examples Summary statistics examples Other useful data wrangling
1. Intro to packages
Getting started with packages Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). https://cran.r-project.org/web/packages/available_packages_by_name.html The “tidyverse” is a set of packages for data manipulation, exploration, and visualization. They share a common design and work in harmony. #Install and load the package 'tidyverse' install.packages('tidyverse') #only need to run once library(tidyverse) #run at start of every R session to use https://www.tidyverse.org/
Syntax for packages Developers of different packages may use the same function name. Good coding practice to specify package::function() to be explicit: # Instead of: filter() # Specify the package for that function: dplyr::filter()
2. Intro to the tidyverse
The Tidyverse A collection of packages for data manipulation, exploration, and visualization. Share a common philosophy of R programming and work in harmony. Core tidyverse packages: readr dplyr tidyr ggplot2 tibble purrr
The Tidyverse A collection of packages for data manipulation, exploration, and visualization. Share a common philosophy of R programming and work in harmony. Core tidyverse packages: readr dplyr tidyr ggplot2 tibble purrr Module 1 Module 2 Module 3 Module 4 (readr::read_csv, etc.)
Data Manipulation with dplyr Very useful package for exploring and managing your data. “A grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Resources for getting started: http://dplyr.tidyverse.org/ https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling- cheatsheet.pdf http://r4ds.had.co.nz/transform.html
3. dplyr code structure
dplyr Key Functions select() filter() arrange() summarise() mutate() group_by() More helpful info here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
dplyr Key Functions select() Picks variables (columns) based on their names.
dplyr Key Functions filter() Picks observations (rows) based on their values.
dplyr Key Functions arrange() Changes the ordering of the rows based on their values.
dplyr Key Functions summarise() Reduces multiple values down to a single summary value.
dplyr Key Functions mutate() Adds new variables that are functions of existing variables.
dplyr Key Functions group_by() Performs data operations on groups that are defined by variables.
Key Functions select() filter() arrange() summarise() mutate() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables
Key Operator: The Pipe %>% Enables you to pass the object on left hand side as first argument of function on the right hand side. Goal of making our code more efficient and easier to read. x %>% f(y) #is the same as f(x, y) x %>% f(y) %>% g(z) g(f(x, y),z)
Basic Structure Use the key functions and pipe to chain together multiple simple steps to achieve a more complicated result. Dataset %>% Select rows or filter columns %>% Arrange or group the data %>% Calculate statistics or new variables of interest
Basic Structure #Prints output to the console: Dataset %>% #Creates a new R object: Dataset %>% Select rows or columns %>% Arrange or group the data %>% Calculate statistics or new variables new_obj <- Dataset %>% Select rows or columns %>% Arrange or group the data %>% Calculate statistics or new variables
Births Data Example You are interested in exploring the relationship between early prenatal care (exposure) and preterm birth (outcome). Let’s get started by preparing our data for analysis and exploring the distribution of key variables: Data manipulation: Create an analytic dataset that is a subset of the observations and variables in the original births data, specific to our research question. Summary statistics: Explore the variables corresponding to early prenatal care, preterm birth, and maternal age.
4. Data manipulation examples
Births Data: Filter Example Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies.
Births Data: Filter Example Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies. # my code births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1)
Births Data: Filter Example Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies. # my code births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1)
Births Data: Select Example Can pipe to select() to also include only variables of interest for analysis. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies, then select variables for analysis
Births Data: Select Example Can pipe to select() to also include only variables of interest for analysis. # my code births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1) %>% dplyr::select(x, cores, dob, sex, mrace, methnic, meduc, mage, marital, visits, mdif, cigdur, wksgest, visit_cat, preterm, preterm_f)
Births Data: Mutate Example Can pipe to mutate() to create a new variable for analysis: pnc5. Corresponds to receipt of early prenatal care in the first 5 months of pregnancy. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies, then select variables for analysis, then mutate to add new variables
Births Data: Mutate Example Can pipe to mutate() to create new variables for analysis: pnc5 (numeric) and pnc5_f (factor). Key exposure: receipt of early prenatal care in the first 5 months of pregnancy. # my code births_sample <- births %>% dplyr::filter(…) %>% dplyr::select(…) %>% dplyr::mutate(pnc5 = if_else(mdif<=5, true = 1, false = 0), pnc5_f = factor(pnc5, levels = c(0,1), labels = c("No Early PNC", "Early PNC")))
Beauty of the Pipe Chained together steps without naming intermediate objects: filtered_sample <- dplyr::filter(births, …) selected_sample <- dplyr::select(filtered_sample, …) sample_with_pnc <- dplyr::mutate(selected_sample, …) For this example, ordering is flexible as long as select() includes variables needed for future operations: filter() %>% mutate() %>% select() #include pnc5 new vars select() %>% mutate() %>% filter() #include plur, no_anomalies
Check your code
5. Summary statistics examples
Births Data: Summarize Prenatal Care Now that you’ve prepared your analytic dataset, you are interested in the numbers of births in each prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations within the group
Births Data: Summarize Prenatal Care Now that you’ve prepared your analytic dataset, you are interested in the numbers of births in each prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations within the group # my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n())
Functions for summarise() See help page: ?dplyr::summarise
Births Data: Summarize Maternal Age You would also like to know the average maternal age each pnc5 group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations and mean age by group
# my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n(), mean(mage, na.rm=T))
# my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n(), mean(mage, na.rm=T)) # can filter out missing PNC and name the summary columns dplyr::filter(!is.na(pnc5_f)) %>% dplyr::summarise(n = n(), mean_age = mean(mage, na.rm=T))
Births Data: Summarize Preterm Birth Now let’s explore the risk of preterm birth by prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations, mean age, preterm
# my code births_sample %>% dplyr::filter(. is # my code births_sample %>% dplyr::filter(!is.na(pnc5_f)) %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n = n(), mean_age = mean(mage, na.rm=T), prop_preterm = mean(preterm, na.rm=T))
Births Data: Summarize by Age Alternatively, we could look at the proportions of early prenatal care and preterm birth by age. Pseudo-code: use the sample dataset, then group by age, then summarize proportions with prenatal care and preterm
# my code births_sample %>% dplyr::group_by(mage) %>% dplyr::summarise(n = n(), prop_pnc5 = mean(pnc5, na.rm=T), prop_preterm = mean(preterm, na.rm=T)) %>% dplyr::filter(n>30) # extra step: exclude ages with little data
# save output as an object age_summary <- births_sample %>% dplyr::group_by(mage) %>% dplyr::summarise(n = n(), prop_pnc5 = mean(pnc5, na.rm=T), prop_preterm = mean(preterm, na.rm=T)) %>% dplyr::filter(n>30) View(age_summary) #Or, click on it in the Environment tab
Preview: Pipe dplyr code to ggplot!
6. Other useful data wrangling
Merging Data Key data manipulation task. Base R function: merge() Introducing the dplyr verbs for joining: inner_join() semi_join() left_join() anti_join() right_join() full_join()
Merging Data Key data manipulation task. Base R function: merge() Introducing the dplyr verbs for joining: inner_join() semi_join() left_join() anti_join() right_join() full_join() Filtering joins Mutating joins AWESOME (animated!) resource: https://github.com/gadenbuie/tidy-animated-verbs#readme
Merging Example The variable cores is the mother’s NC county of residence. This is a numeric code, and you would like to add a column for the county name corresponding to that code. Example of a mutating join.
Read in County Names Let’s read in a spreadsheet of NC county names with the corresponding numeric codes: nc_counties <- readr::read_csv("http://bit.ly/nc_county_names")
Our two dataframes to be merged:
Merge in County Names births_counties <- dplyr::left_join(births_sample, nc_counties, by = "cores") # All rows from x, and all columns from x and y
births_counties %>% dplyr::group_by(county_name) %>% dplyr::summarise_at(c("pnc5", "preterm"), mean, na.rm=T) Advanced topic: scoped variants of dplyr verbs. If interested, see more here. Note we could have just used summarise() with more typing.
Case_when for complicated conditional logic We would like to create a factor variable for race/ethnicity categories to be used in analysis: White non-Hispanic (“WnH”) White Hispanic (“WH”) African American (“AA”) American Indian or Alaska Native (“AI/AN”) Other (“Other”) From data dictionary:
Motivation – don’t do this! ifelse(mrace==1 & methnic=="N", "WnH", ifelse(mrace==1 & methnic=="Y", "WH", ifelse(mrace==2, "AA", ifelse(mrace==3, "AI/AN", ifelse(mrace==4 | (mrace==1 & methnic=="U") ~ "Other", NA)))))
Case_when example: Race/ethnicity dplyr::case_when( mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other")
Case_when example: Race/ethnicity births_final <- births_counties %>% dplyr::mutate( raceeth_f = factor( dplyr::case_when( mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other") )
Bonus Material for Longitudinal Data Other dplyr favorites: lead() and lag() Find the next value (at time+1) or the last value (at time-1) More info here: https://dplyr.tidyverse.org/reference/lead-lag.html tidyr package to gather() and spread() Make “wide” data longer or make “long” data wider More info here: https://tidyr.tidyverse.org/
Last step: Save dataframe as RDS saveRDS(births_final, file = "births_final.rds") # version of the data you’ll start from in module 3
Module 2 Code Summary for Births Data ## start with births data from module 1 ## subset births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1) %>% dplyr::select(x, cores, dob, sex, mrace, methnic, meduc, mage, marital, visits, mdif, cigdur, wksgest, visit_cat, preterm, preterm_f) %>% dplyr::mutate(pnc5 = if_else(mdif<=5, true = 1, false = 0), pnc5_f = factor(pnc5, levels = c(0,1), labels = c("No Early PNC", "Early PNC"))) ## NC counties nc_counties <- readr::read_csv("http://bit.ly/nc_county_names") births_counties <- dplyr::left_join(births_sample, nc_counties, by = "cores") ## race/ethnicity births_final <- births_counties %>% dplyr::mutate(raceeth_f = factor(dplyr::case_when(mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other"))) ## save data for module 3