R for Epi Workshop Module 2: Data Manipulation & Summary Statistics

R for Epi Workshop Module 2: Data Manipulation & Summary Statistics
Sara Levintow, MSPH PhD Candidate in Epidemiology UNC Gillings School of Global Public Health

Outline Intro to packages Intro to the tidyverse
dplyr code structure: key functions & the pipe Data manipulation examples Summary statistics examples Other useful data wrangling

1. Intro to packages

Getting started with packages
Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). The “tidyverse” is a set of packages for data manipulation, exploration, and visualization. They share a common design and work in harmony. #Install and load the package 'tidyverse' install.packages('tidyverse') #only need to run once library(tidyverse) #run at start of every R session to use

Syntax for packages Developers of different packages may use the same function name. Good coding practice to specify package::function() to be explicit: # Instead of: filter() # Specify the package for that function: dplyr::filter()

2. Intro to the tidyverse

The Tidyverse A collection of packages for data manipulation, exploration, and visualization. Share a common philosophy of R programming and work in harmony. Core tidyverse packages: readr dplyr tidyr ggplot2 tibble purrr

The Tidyverse A collection of packages for data manipulation, exploration, and visualization. Share a common philosophy of R programming and work in harmony. Core tidyverse packages: readr dplyr tidyr ggplot2 tibble purrr Module 1 Module 2 Module 3 Module 4 (readr::read_csv, etc.)

Data Manipulation with dplyr
Very useful package for exploring and managing your data. “A grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Resources for getting started: cheatsheet.pdf

3. dplyr code structure

dplyr Key Functions select() filter() arrange() summarise() mutate()
group_by() More helpful info here:

dplyr Key Functions select()
Picks variables (columns) based on their names.

dplyr Key Functions filter()
Picks observations (rows) based on their values.

dplyr Key Functions arrange()
Changes the ordering of the rows based on their values.

dplyr Key Functions summarise()
Reduces multiple values down to a single summary value.

dplyr Key Functions mutate()
Adds new variables that are functions of existing variables.

dplyr Key Functions group_by()
Performs data operations on groups that are defined by variables.

Key Functions select() filter() arrange() summarise() mutate()
Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables

Key Operator: The Pipe %>%
Enables you to pass the object on left hand side as first argument of function on the right hand side. Goal of making our code more efficient and easier to read. x %>% f(y) #is the same as f(x, y) x %>% f(y) %>% g(z) g(f(x, y),z)

Basic Structure Use the key functions and pipe to chain together multiple simple steps to achieve a more complicated result. Dataset %>% Select rows or filter columns %>% Arrange or group the data %>% Calculate statistics or new variables of interest

Basic Structure #Prints output to the console: Dataset %>%
#Creates a new R object: Dataset %>% Select rows or columns %>% Arrange or group the data %>% Calculate statistics or new variables new_obj <- Dataset %>% Select rows or columns %>% Arrange or group the data %>% Calculate statistics or new variables

Births Data Example You are interested in exploring the relationship between early prenatal care (exposure) and preterm birth (outcome). Let’s get started by preparing our data for analysis and exploring the distribution of key variables: Data manipulation: Create an analytic dataset that is a subset of the observations and variables in the original births data, specific to our research question. Summary statistics: Explore the variables corresponding to early prenatal care, preterm birth, and maternal age.

4. Data manipulation examples

Births Data: Filter Example
Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies.

Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies. # my code births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1)

Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies. # my code births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1)

Births Data: Select Example
Can pipe to select() to also include only variables of interest for analysis. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies, then select variables for analysis

Births Data: Select Example
Can pipe to select() to also include only variables of interest for analysis. # my code births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1) %>% dplyr::select(x, cores, dob, sex, mrace, methnic, meduc, mage, marital, visits, mdif, cigdur, wksgest, visit_cat, preterm, preterm_f)

Births Data: Mutate Example
Can pipe to mutate() to create a new variable for analysis: pnc5. Corresponds to receipt of early prenatal care in the first 5 months of pregnancy. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies, then select variables for analysis, then mutate to add new variables

Births Data: Mutate Example
Can pipe to mutate() to create new variables for analysis: pnc5 (numeric) and pnc5_f (factor). Key exposure: receipt of early prenatal care in the first 5 months of pregnancy. # my code births_sample <- births %>% dplyr::filter(…) %>% dplyr::select(…) %>% dplyr::mutate(pnc5 = if_else(mdif<=5, true = 1, false = 0), pnc5_f = factor(pnc5, levels = c(0,1), labels = c("No Early PNC", "Early PNC")))

Beauty of the Pipe Chained together steps without naming intermediate objects: filtered_sample <- dplyr::filter(births, …) selected_sample <- dplyr::select(filtered_sample, …) sample_with_pnc <- dplyr::mutate(selected_sample, …) For this example, ordering is flexible as long as select() includes variables needed for future operations: filter() %>% mutate() %>% select() #include pnc5 new vars select() %>% mutate() %>% filter() #include plur, no_anomalies

Check your code

5. Summary statistics examples

Births Data: Summarize Prenatal Care
Now that you’ve prepared your analytic dataset, you are interested in the numbers of births in each prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations within the group

Births Data: Summarize Prenatal Care
Now that you’ve prepared your analytic dataset, you are interested in the numbers of births in each prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations within the group # my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n())

Functions for summarise()
See help page: ?dplyr::summarise

Births Data: Summarize Maternal Age
You would also like to know the average maternal age each pnc5 group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations and mean age by group

# my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n(), mean(mage, na.rm=T))

# my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n(), mean(mage, na.rm=T)) # can filter out missing PNC and name the summary columns dplyr::filter(!is.na(pnc5_f)) %>% dplyr::summarise(n = n(), mean_age = mean(mage, na.rm=T))

Births Data: Summarize Preterm Birth
Now let’s explore the risk of preterm birth by prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations, mean age, preterm

# my code births_sample %>% dplyr::filter(. is
# my code births_sample %>% dplyr::filter(!is.na(pnc5_f)) %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n = n(), mean_age = mean(mage, na.rm=T), prop_preterm = mean(preterm, na.rm=T))

Births Data: Summarize by Age
Alternatively, we could look at the proportions of early prenatal care and preterm birth by age. Pseudo-code: use the sample dataset, then group by age, then summarize proportions with prenatal care and preterm

# my code births_sample %>% dplyr::group_by(mage) %>% dplyr::summarise(n = n(), prop_pnc5 = mean(pnc5, na.rm=T), prop_preterm = mean(preterm, na.rm=T)) %>% dplyr::filter(n>30) # extra step: exclude ages with little data

# save output as an object age_summary <- births_sample %>% dplyr::group_by(mage) %>% dplyr::summarise(n = n(), prop_pnc5 = mean(pnc5, na.rm=T), prop_preterm = mean(preterm, na.rm=T)) %>% dplyr::filter(n>30) View(age_summary) #Or, click on it in the Environment tab

Preview: Pipe dplyr code to ggplot!

6. Other useful data wrangling

Merging Data Key data manipulation task. Base R function: merge()
Introducing the dplyr verbs for joining: inner_join() semi_join() left_join() anti_join() right_join() full_join()

Merging Data Key data manipulation task. Base R function: merge()
Introducing the dplyr verbs for joining: inner_join() semi_join() left_join() anti_join() right_join() full_join() Filtering joins Mutating joins AWESOME (animated!) resource:

Merging Example The variable cores is the mother’s NC county of residence. This is a numeric code, and you would like to add a column for the county name corresponding to that code. Example of a mutating join.

Read in County Names Let’s read in a spreadsheet of NC county names with the corresponding numeric codes: nc_counties <- readr::read_csv("

Our two dataframes to be merged:

Merge in County Names births_counties <- dplyr::left_join(births_sample, nc_counties, by = "cores") # All rows from x, and all columns from x and y

births_counties %>% dplyr::group_by(county_name) %>% dplyr::summarise_at(c("pnc5", "preterm"), mean, na.rm=T) Advanced topic: scoped variants of dplyr verbs. If interested, see more here. Note we could have just used summarise() with more typing.

Case_when for complicated conditional logic
We would like to create a factor variable for race/ethnicity categories to be used in analysis: White non-Hispanic (“WnH”) White Hispanic (“WH”) African American (“AA”) American Indian or Alaska Native (“AI/AN”) Other (“Other”) From data dictionary:

Motivation – don’t do this!
ifelse(mrace==1 & methnic=="N", "WnH", ifelse(mrace==1 & methnic=="Y", "WH", ifelse(mrace==2, "AA", ifelse(mrace==3, "AI/AN", ifelse(mrace==4 | (mrace==1 & methnic=="U") ~ "Other", NA)))))

Case_when example: Race/ethnicity
dplyr::case_when( mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other")

Case_when example: Race/ethnicity
births_final <- births_counties %>% dplyr::mutate( raceeth_f = factor( dplyr::case_when( mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other") )

Bonus Material for Longitudinal Data
Other dplyr favorites: lead() and lag() Find the next value (at time+1) or the last value (at time-1) More info here: tidyr package to gather() and spread() Make “wide” data longer or make “long” data wider More info here:

Last step: Save dataframe as RDS
saveRDS(births_final, file = "births_final.rds") # version of the data you’ll start from in module 3

Module 2 Code Summary for Births Data
## start with births data from module 1 ## subset births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1) %>% dplyr::select(x, cores, dob, sex, mrace, methnic, meduc, mage, marital, visits, mdif, cigdur, wksgest, visit_cat, preterm, preterm_f) %>% dplyr::mutate(pnc5 = if_else(mdif<=5, true = 1, false = 0), pnc5_f = factor(pnc5, levels = c(0,1), labels = c("No Early PNC", "Early PNC"))) ## NC counties nc_counties <- readr::read_csv(" births_counties <- dplyr::left_join(births_sample, nc_counties, by = "cores") ## race/ethnicity births_final <- births_counties %>% dplyr::mutate(raceeth_f = factor(dplyr::case_when(mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other"))) ## save data for module 3

R for Epi Workshop Module 2: Data Manipulation & Summary Statistics

Similar presentations

Presentation on theme: "R for Epi Workshop Module 2: Data Manipulation & Summary Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

R for Epi Workshop Module 2: Data Manipulation & Summary Statistics

Similar presentations

Presentation on theme: "R for Epi Workshop Module 2: Data Manipulation & Summary Statistics"— Presentation transcript:

Similar presentations

About project

Feedback