R for Epi Workshop Module 2: Data Manipulation & Summary Statistics

Slides:



Advertisements
Similar presentations
Microsoft ® Office Excel ® 2007 Training Get started with PivotTable ® reports Sweetwater ISD presents:
Advertisements

State of Connecticut Core-CT Project Query 8 hrs Updated 6/06/2006.
Generating new variables and manipulating data with STATA Biostatistics 212 Lecture 3.
Working with JavaScript. 2 Objectives Introducing JavaScript Inserting JavaScript into a Web Page File Writing Output to the Web Page Working with Variables.
Pet Fish and High Cholesterol in the WHI OS: An Analysis Example Joe Larson 5 / 6 / 09.
PROC SQL – Select Codes To Master For Power Programming Codes and Examples from SAS.com Nethra Sambamoorthi, PhD Northwestern University Master of Science.
An Introduction to the New and Improved NJSHAD Online Data Access Tool March, 2015.
Presenting Statistical Aspects of Your Research Analysis of Factors Associated with Pre-term Births in North Carolina.
XP Tutorial 10New Perspectives on Creating Web Pages with HTML, XHTML, and XML 1 Working with JavaScript Creating a Programmable Web Page for North Pole.
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.
Microsoft ® Office Excel 2003 Training Using XML in Excel SynAppSys Educational Services presents:
XP Tutorial 10New Perspectives on HTML and XHTML, Comprehensive 1 Working with JavaScript Creating a Programmable Web Page for North Pole Novelties Tutorial.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Review for MassHunter and reporting
HRP Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected by copyright law and.
XP Tutorial 10New Perspectives on HTML, XHTML, and DHTML, Comprehensive 1 Working with JavaScript Creating a Programmable Web Page for North Pole Novelties.
R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
Building Comfort With MATLAB
Tidy data, wrangling, and pipelines in R
EMPA Statistical Analysis
QUALITY OF CARE TRENDS FOR CALIFORNIA CHILDREN
Data Virtualization Tutorial: Introduction to SQL Script
ICT AND PRINCIPLES OF DATA ANALYSIS
Data Cleansing with SQL and R Kevin Feasel
Journal Club Notes.
7/19/2018 Data, and Metrics, and Reports! Oh, my!: Just Follow the Yellow Brick Road Presented at the CSM Symposium 2016 Andrew Brubaker strategic_data_gathering_and_dashboarding-3.pptx.
Creating & Managing Workbooks
Getting your data into R
Jonathan W. Duggins; James Blum NC State University; UNC Wilmington
Next Generation R tidyr, dplyr, ggplot2
R Package Management By Toni Lee McCreash.
Managing Multiple Worksheets and Workbooks
Data Wrangling in the Tidyverse
Data manipulation in R: dplyr
Dplyr I EPID 799C Mon Sep
R Programming III: Real Things with Real Data!
Advanced Analytics Using Enterprise Miner
Ggplot2 I EPID 799C Mon Sep
ECONOMETRICS ii – spring 2018
Numerical Descriptives in R
R Data Manipulation Bootstrapping
R Programming I: Basic data types, structures & subsetting
Recoding III: Introducing apply()
Recoding II: Numerical & Graphical Descriptives
Regional Architecture Development for Intelligent Transportation
Thank you Sponsors.
Organizing Data from Long-to-Wide Format: Issues and Troubleshooting
R Programming For Sql Developers ETL USING R
Recoding III: Introducing apply()
This is where R scripts will load
L07 Apply and purrr EPID 799C Fall 2018.
How Can I Use My Completeness Report to Improve Data Quality?
Tidy data, wrangling, and pipelines in R
Installing Packages Introduction to R, Part II
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Spreadsheets, Modelling & Databases
Statistics for the Social Sciences
Statistics for the Social Sciences
This is where R scripts will load
This is where R scripts will load
Data analysis with R and the tidyverse
Key Concepts R for Data Science.
Using the R Tidyverse packages
R for Epi Workshop Module 1: Learning Base R
DATA VISUALISATION (QUANTITATIVE).
MASH R workshop 2:.
Chapter 2 Excel Extension: Now You Try!
Spark with R Martijn Tennekes
Presentation transcript:

R for Epi Workshop Module 2: Data Manipulation & Summary Statistics Sara Levintow, MSPH PhD Candidate in Epidemiology UNC Gillings School of Global Public Health

Outline Intro to packages Intro to the tidyverse dplyr code structure: key functions & the pipe Data manipulation examples Summary statistics examples Other useful data wrangling

1. Intro to packages

Getting started with packages Packages are extensions to base R. They contain additional functions, documentation for using them, and sample data. Packages are available from the Comprehensive R Archive Network (CRAN). https://cran.r-project.org/web/packages/available_packages_by_name.html The “tidyverse” is a set of packages for data manipulation, exploration, and visualization. They share a common design and work in harmony. #Install and load the package 'tidyverse' install.packages('tidyverse') #only need to run once library(tidyverse) #run at start of every R session to use https://www.tidyverse.org/

Syntax for packages Developers of different packages may use the same function name.  Good coding practice to specify package::function() to be explicit: # Instead of: filter() # Specify the package for that function: dplyr::filter()

2. Intro to the tidyverse

The Tidyverse A collection of packages for data manipulation, exploration, and visualization. Share a common philosophy of R programming and work in harmony. Core tidyverse packages: readr dplyr tidyr ggplot2 tibble purrr

The Tidyverse A collection of packages for data manipulation, exploration, and visualization. Share a common philosophy of R programming and work in harmony. Core tidyverse packages: readr dplyr tidyr ggplot2 tibble purrr Module 1 Module 2 Module 3 Module 4 (readr::read_csv, etc.)

Data Manipulation with dplyr Very useful package for exploring and managing your data. “A grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.” Resources for getting started: http://dplyr.tidyverse.org/ https://cran.r-project.org/web/packages/dplyr/vignettes/dplyr.html https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling- cheatsheet.pdf http://r4ds.had.co.nz/transform.html

3. dplyr code structure

dplyr Key Functions select() filter() arrange() summarise() mutate() group_by() More helpful info here: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

dplyr Key Functions select() Picks variables (columns) based on their names.

dplyr Key Functions filter() Picks observations (rows) based on their values.

dplyr Key Functions arrange() Changes the ordering of the rows based on their values.

dplyr Key Functions summarise() Reduces multiple values down to a single summary value.

dplyr Key Functions mutate() Adds new variables that are functions of existing variables.

dplyr Key Functions group_by() Performs data operations on groups that are defined by variables.

Key Functions select() filter() arrange() summarise() mutate() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables

Key Operator: The Pipe %>% Enables you to pass the object on left hand side as first argument of function on the right hand side. Goal of making our code more efficient and easier to read. x %>% f(y) #is the same as f(x, y) x %>% f(y) %>% g(z) g(f(x, y),z)

Basic Structure Use the key functions and pipe to chain together multiple simple steps to achieve a more complicated result. Dataset %>% Select rows or filter columns %>% Arrange or group the data %>% Calculate statistics or new variables of interest

Basic Structure #Prints output to the console: Dataset %>% #Creates a new R object: Dataset %>% Select rows or columns %>% Arrange or group the data %>% Calculate statistics or new variables new_obj <- Dataset %>% Select rows or columns %>% Arrange or group the data %>% Calculate statistics or new variables

Births Data Example You are interested in exploring the relationship between early prenatal care (exposure) and preterm birth (outcome). Let’s get started by preparing our data for analysis and exploring the distribution of key variables: Data manipulation: Create an analytic dataset that is a subset of the observations and variables in the original births data, specific to our research question. Summary statistics: Explore the variables corresponding to early prenatal care, preterm birth, and maternal age.

4. Data manipulation examples

Births Data: Filter Example Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies.

Births Data: Filter Example Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies. # my code births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1)

Births Data: Filter Example Only include singleton births with non-missing gestational age and no congenital anomalies. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies. # my code births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1)

Births Data: Select Example Can pipe to select() to also include only variables of interest for analysis. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies, then select variables for analysis

Births Data: Select Example Can pipe to select() to also include only variables of interest for analysis. # my code births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1) %>% dplyr::select(x, cores, dob, sex, mrace, methnic, meduc, mage, marital, visits, mdif, cigdur, wksgest, visit_cat, preterm, preterm_f)

Births Data: Mutate Example Can pipe to mutate() to create a new variable for analysis: pnc5. Corresponds to receipt of early prenatal care in the first 5 months of pregnancy. Pseudo-code: use the births dataset, then filter to observations with singleton births, non-missing wksgest, and no congenital anomalies, then select variables for analysis, then mutate to add new variables

Births Data: Mutate Example Can pipe to mutate() to create new variables for analysis: pnc5 (numeric) and pnc5_f (factor). Key exposure: receipt of early prenatal care in the first 5 months of pregnancy. # my code births_sample <- births %>% dplyr::filter(…) %>% dplyr::select(…) %>% dplyr::mutate(pnc5 = if_else(mdif<=5, true = 1, false = 0), pnc5_f = factor(pnc5, levels = c(0,1), labels = c("No Early PNC", "Early PNC")))

Beauty of the Pipe Chained together steps without naming intermediate objects: filtered_sample <- dplyr::filter(births, …) selected_sample <- dplyr::select(filtered_sample, …) sample_with_pnc <- dplyr::mutate(selected_sample, …) For this example, ordering is flexible as long as select() includes variables needed for future operations: filter() %>% mutate() %>% select() #include pnc5 new vars select() %>% mutate() %>% filter() #include plur, no_anomalies

Check your code

5. Summary statistics examples

Births Data: Summarize Prenatal Care Now that you’ve prepared your analytic dataset, you are interested in the numbers of births in each prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations within the group

Births Data: Summarize Prenatal Care Now that you’ve prepared your analytic dataset, you are interested in the numbers of births in each prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations within the group # my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n())

Functions for summarise() See help page: ?dplyr::summarise

Births Data: Summarize Maternal Age You would also like to know the average maternal age each pnc5 group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations and mean age by group

# my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n(), mean(mage, na.rm=T))

# my code births_sample %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n(), mean(mage, na.rm=T)) # can filter out missing PNC and name the summary columns dplyr::filter(!is.na(pnc5_f)) %>% dplyr::summarise(n = n(), mean_age = mean(mage, na.rm=T))

Births Data: Summarize Preterm Birth Now let’s explore the risk of preterm birth by prenatal care group. Pseudo-code: use the sample dataset, then group by prenatal care, then summarize numbers of observations, mean age, preterm

# my code births_sample %>% dplyr::filter(. is # my code births_sample %>% dplyr::filter(!is.na(pnc5_f)) %>% dplyr::group_by(pnc5_f) %>% dplyr::summarise(n = n(), mean_age = mean(mage, na.rm=T), prop_preterm = mean(preterm, na.rm=T))

Births Data: Summarize by Age Alternatively, we could look at the proportions of early prenatal care and preterm birth by age. Pseudo-code: use the sample dataset, then group by age, then summarize proportions with prenatal care and preterm

# my code births_sample %>% dplyr::group_by(mage) %>% dplyr::summarise(n = n(), prop_pnc5 = mean(pnc5, na.rm=T), prop_preterm = mean(preterm, na.rm=T)) %>% dplyr::filter(n>30) # extra step: exclude ages with little data

# save output as an object age_summary <- births_sample %>% dplyr::group_by(mage) %>% dplyr::summarise(n = n(), prop_pnc5 = mean(pnc5, na.rm=T), prop_preterm = mean(preterm, na.rm=T)) %>% dplyr::filter(n>30) View(age_summary) #Or, click on it in the Environment tab

Preview: Pipe dplyr code to ggplot!

6. Other useful data wrangling

Merging Data Key data manipulation task. Base R function: merge() Introducing the dplyr verbs for joining: inner_join() semi_join() left_join() anti_join() right_join() full_join()

Merging Data Key data manipulation task. Base R function: merge() Introducing the dplyr verbs for joining: inner_join() semi_join() left_join() anti_join() right_join() full_join() Filtering joins Mutating joins AWESOME (animated!) resource: https://github.com/gadenbuie/tidy-animated-verbs#readme

Merging Example The variable cores is the mother’s NC county of residence. This is a numeric code, and you would like to add a column for the county name corresponding to that code. Example of a mutating join.

Read in County Names Let’s read in a spreadsheet of NC county names with the corresponding numeric codes: nc_counties <- readr::read_csv("http://bit.ly/nc_county_names")

Our two dataframes to be merged:

Merge in County Names births_counties <- dplyr::left_join(births_sample, nc_counties, by = "cores") # All rows from x, and all columns from x and y

births_counties %>% dplyr::group_by(county_name) %>% dplyr::summarise_at(c("pnc5", "preterm"), mean, na.rm=T) Advanced topic: scoped variants of dplyr verbs. If interested, see more here. Note we could have just used summarise() with more typing.

Case_when for complicated conditional logic We would like to create a factor variable for race/ethnicity categories to be used in analysis: White non-Hispanic (“WnH”) White Hispanic (“WH”) African American (“AA”) American Indian or Alaska Native (“AI/AN”) Other (“Other”) From data dictionary:

Motivation – don’t do this! ifelse(mrace==1 & methnic=="N", "WnH", ifelse(mrace==1 & methnic=="Y", "WH", ifelse(mrace==2, "AA", ifelse(mrace==3, "AI/AN", ifelse(mrace==4 | (mrace==1 & methnic=="U") ~ "Other", NA)))))

Case_when example: Race/ethnicity dplyr::case_when( mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other")

Case_when example: Race/ethnicity births_final <- births_counties %>% dplyr::mutate( raceeth_f = factor( dplyr::case_when( mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other") )

Bonus Material for Longitudinal Data Other dplyr favorites: lead() and lag() Find the next value (at time+1) or the last value (at time-1) More info here: https://dplyr.tidyverse.org/reference/lead-lag.html tidyr package to gather() and spread() Make “wide” data longer or make “long” data wider More info here: https://tidyr.tidyverse.org/

Last step: Save dataframe as RDS saveRDS(births_final, file = "births_final.rds") # version of the data you’ll start from in module 3

Module 2 Code Summary for Births Data ## start with births data from module 1 ## subset births_sample <- births %>% dplyr::filter(!is.na(wksgest) & plur==1 & no_anomalies==1) %>% dplyr::select(x, cores, dob, sex, mrace, methnic, meduc, mage, marital, visits, mdif, cigdur, wksgest, visit_cat, preterm, preterm_f) %>% dplyr::mutate(pnc5 = if_else(mdif<=5, true = 1, false = 0), pnc5_f = factor(pnc5, levels = c(0,1), labels = c("No Early PNC", "Early PNC"))) ## NC counties nc_counties <- readr::read_csv("http://bit.ly/nc_county_names") births_counties <- dplyr::left_join(births_sample, nc_counties, by = "cores") ## race/ethnicity births_final <- births_counties %>% dplyr::mutate(raceeth_f = factor(dplyr::case_when(mrace==1 & methnic=="N" ~ "WnH", mrace==1 & methnic=="Y" ~ "WH", mrace==2 ~ "AA", mrace==3 ~ "AI/AN", mrace==4 | (mrace==1 & methnic=="U") ~ "Other"))) ## save data for module 3