Data manipulation in R: dplyr

Data manipulation in R: dplyr
EPID 799C Wednesday, Sept. 27, 2017

Today’s Outline Key functions of dplyr
Review of coding for key variables in births dataset Practice dplyr coding with births dataset

Key Functions select() filter() arrange() summarise() mutate()
For more, see this great resource: Data Wrangling Cheat Sheet select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

Key Functions select() filter() arrange() summarise() mutate()
For more, see this great resource: Data Wrangling Cheat Sheet select() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables The pipe operator %>% enables you to pass the output from one function to the input of the next function

Basic structure of dplyr code
Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics or new variables of interest

Basic structure of dplyr code
summary <- Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics, new variables Creates a new object named summary that stores the output Otherwise, output is just printed in the console

Manipulating the births dataset with dplyr
Key variables of interest in today’s examples Preterm birth Early prenatal care Maternal age Smoking during pregnancy Race/ethnicity County of residence Review of coding for each variable

Preterm Birth births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37,1,0) births$preterm_f <- factor(births$preterm, levels = c(1,0), labels = c("preterm", "term"))

Early Prenatal Care births$mdif[births$mdif==99] <- NA births$pnc5 <- ifelse(births$mdif<=5,1,0) births$pnc5_f <- factor(births$pnc5, levels = c(1,0), labels = c("Early prenatal care", "No early care"))

Maternal Age births$mage[births$mage==99] <- NA

Maternal Smoking during Pregnancy
births$cigdur[births$cigdur=="U"] <- NA births$smoke <- ifelse(births$cigdur=="Y",1,0) births$smoke_f <- factor(births$smoke, levels = c(1,0), labels = c("Smoker", "Nonsmoker"))

Format Helper CSV file with labels for the levels of the race/ethnicity and county variables to save you from typing them out Save file to your computer and then read into R Studio formatter <- read.csv(“birth-format-helper-2012.csv”, stringsAsFactors = F)

Maternal Race/Ethnicity
births$race_f <- factor(births$mrace, levels = 1:4, labels = formatter[formatter$variable=="mrace",]$recode, ordered = T) births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH", ifelse(births$mrace == 1 & births$methnic == "Y", "WH", ifelse(births$mrace==2, "AA", ifelse(births$mrace==3, "AI/AN", "Other")))) births$raceeth_f <- factor(births$raceeth, levels=c("WnH", "AA", "WH", "AI/AN", "Other"))

County of Residence in NC
births$county <- factor(births$cores, levels = formatter[formatter$variable=="cores",]$code, labels = formatter[formatter$variable=="cores",]$recode, ordered = T)

Practice Problems using dplyr
1) Calculate the numbers of births by early prenatal care (received early care vs. no early care). Exclude the births with missing values for prenatal care or preterm. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births in each group Syntax for summary statistics is to name the new variable and set equal to the function of interest: summarise(number = n()) Name you choose The function n() counts the number of observations (no arguments)

2) Calculate the numbers of births and average age of mothers in those same groups (received early care vs. no early care). Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births, average age, in each group Are there mothers with missing age? Within the mean function, include an argument to remove missing values prior to calculating the average age.

3) In addition to the numbers of births and average age of mothers by early care, calculate the number and percentage of preterm births in these two groups. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize total births, average age, number of preterm, % preterm Different ways to calculate % preterm: Within summarise(): using the function mean(preterm) OR sum(preterm)/n() Within mutate(): as a function of the variables you created in summarise() for numbers of preterm and total births

4) Continuing to build on your code you’ve already written, calculate the percentage of smokers in the same two groups (early care vs. no early care). Pseudo-code: use the births dataset %>% group births by pnc5_f %>% summarize total births, average age, # preterm, # smokers, % preterm, % smokers You could try out different methods for calculating % smokers and % preterm: within summarise (using built-in functions) within mutate (using the new variables you’ve created) **Note: should you use the numeric variables (smoke, preterm) or factor variables (smoke_f, preterm_f) for calculating these summary statistics?**

5) Onto a new example: Calculate the total births in each maternal race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also calculate the prevalence of early care and prevalence of preterm birth. Pseudo-code: use the births dataset %>% group births by raceeth_f %>% summarize the numbers of total births, births with early care, and preterm births in each group %>% calculate percentage of early care, percentage of preterm in each group

6) Final example: Calculate the prevalence of early prenatal care and prevalence of preterm birth by NC county of residence Pseudo-code: use the births dataset %>% group births by county %>% summarize the number of births with early care, number of preterm births in each county %>% calculate prevalence of early care and preterm in each county This output is large – how would you store it instead of just printing to the console?

Data manipulation in R: dplyr

Similar presentations

Presentation on theme: "Data manipulation in R: dplyr"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data manipulation in R: dplyr

Similar presentations

Presentation on theme: "Data manipulation in R: dplyr"— Presentation transcript:

Similar presentations

About project

Feedback