Data manipulation in R: dplyr

Slides:



Advertisements
Similar presentations
New Jersey State Health Assessment Data An Introduction to the NJSHAD Online Data Access Tool September, 2011.
Advertisements

INSERT BOOK COVER 1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Excel 2010 by Robert Grauer, Keith.
Beginning Data Manipulation HRP Topic 4 Oct 19 th 2011.
Introduction to Spreadsheets Presented by Frank H. Osborne, Ph. D. © 2005 Bio 2900 Computer Applications in Biology.
Spreadsheet design an overview of further issues Research Methods Group Wim Buysse – ICRAF-ILRI Research Methods Group October 2004.
The risk factors of preterm births and their implication for neonatal deaths in South Carolina during Joanna Yoon, MSPH Division of Biostatistics.
An Introduction to the New and Improved NJSHAD Online Data Access Tool March, 2015.
Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall 11 Copyright © 2008 Prentice-Hall. All rights reserved. Committed to Shaping the Next.
Presenting Statistical Aspects of Your Research Analysis of Factors Associated with Pre-term Births in North Carolina.
Introduction to SAS BIO 226 – Spring Outline Windows and common rules Getting the data –The PRINT and CONTENT Procedures Manipulating the data.
XP 1 Excel Tables Purpose of tables – Process data in a group – Used to facilitate calculations – Used to enhance readability of output Types of tables.
The Advantage Series ©2004 The McGraw-Hill Companies, Inc. All rights reserved Chapter 8 Managing Worksheet Lists Microsoft Office Excel 2003.
Chapter 19 Managing Worksheet Lists. Creating Lists ► Microsoft Office Excel 2003 is inarguably the most powerful electronic spreadsheet available. ►
Part 4 Syntax or point-and-click?. British Social Attitudes 1986 Q.114, page 43b.
Microsoft Office 2007 Intermediate© 2008 Pearson Prentice Hall1 PowerPoint Presentation to Accompany GO! With Microsoft ® Office 2007 Intermediate Chapter.
Exploring Data Section 1.1 Analyzing Categorical Data.
PROCESSING, ANALYSIS & INTERPRETATION OF DATA
Priya Ramaswami Janssen R&D US. Advantages of PROC REPORT -Very powerful -Perform lists, subsets, statistics, computations, formatting within one procedure.
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
COM: 111 Introduction to Computer Applications Department of Information & Communication Technology Panayiotis Christodoulou.
Survey Training Pack Session 16 – Custom Tables in SPSS.
R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
SAS ® 101 Based on Learning SAS by Example: A Programmer’s Guide Chapters 3 & 4 By Tasha Chapman, Oregon Health Authority.
Maternal and child health profile, Kansas City, Missouri,
Introduction to the SPSS Interface
Tidy data, wrangling, and pipelines in R
Microsoft Excel.
AP CSP: Cleaning Data & Creating Summary Tables
Microsoft Visual Basic 2005: Reloaded Second Edition
Fundamentals of Python: First Programs
Topics Designing a Program Input, Processing, and Output
Applied Business Forecasting and Regression Analysis
GO! with Microsoft Office 2016
Introduction to R Carolina Salge March 29, 2017.
Introduction to SPSS.
GO! with Microsoft Access 2016
Lesson 13 - Cleaning Data Lesson 14 - Creating Summary Tables
Getting your data into R
Jonathan W. Duggins; James Blum NC State University; UNC Wilmington
Data Wrangling in the Tidyverse
Dplyr I EPID 799C Mon Sep
Ggplot2 I EPID 799C Mon Sep
ECONOMETRICS ii – spring 2018
Numerical Descriptives in R
regex (Regular Expressions) Examples Logic in Python (and pandas)
Python I/O.
Recoding III: Introducing apply()
Recoding II: Numerical & Graphical Descriptives
Code is on the Website Outline Comparison of Excel and R
Recoding III: Introducing apply()
L07 Apply and purrr EPID 799C Fall 2018.
Tidy data, wrangling, and pipelines in R
Microsoft Excel 101.
Simple Linear Regression
Exploring Microsoft® Office 2016 Series Editor Mary Anne Poatsy
Producing Descriptive Statistics
Using Microsoft Excel for Marketing Research
Topics Designing a Program Input, Processing, and Output
Topics Designing a Program Input, Processing, and Output
Have you signed up (or had) your meeting?
By A.Arul Xavier Department of mathematics
Microsoft Office Illustrated Fundamentals
R for Epi Workshop Module 2: Data Manipulation & Summary Statistics
Introduction to the SPSS Interface
Lesson 13 Working with Tables
regex (Regular Expressions) Examples Logic in Python (and pandas)
Microsoft Office Illustrated Fundamentals
Chapter 2 Excel Extension: Now You Try!
Spark with R Martijn Tennekes
Presentation transcript:

Data manipulation in R: dplyr EPID 799C Wednesday, Sept. 27, 2017

Today’s Outline Key functions of dplyr Review of coding for key variables in births dataset Practice dplyr coding with births dataset

Key Functions select() filter() arrange() summarise() mutate() For more, see this great resource: Data Wrangling Cheat Sheet select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

Key Functions select() filter() arrange() summarise() mutate() For more, see this great resource: Data Wrangling Cheat Sheet select() Picks variables (columns) based on their names. filter() Picks observations (rows) based on their values. arrange() Changes the ordering of the rows based on their values. summarise() Reduces multiple values down to a single summary value. mutate() Adds new variables that are functions of existing variables. group_by() Performs data operations on groups that are defined by variables.

Key Functions select() filter() arrange() summarise() mutate() For more, see this great resource: Data Wrangling Cheat Sheet select() Picks variables (columns) based on their names filter() Picks observations (rows) based on their values arrange() Changes the ordering of the rows summarise() Reduces multiple values down to a single summary value mutate() Adds new variables that are functions of existing variables group_by() Performs data operations on groups that are defined by variables The pipe operator %>% enables you to pass the output from one function to the input of the next function

Basic structure of dplyr code Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics or new variables of interest

Basic structure of dplyr code summary <- Dataset %>% Select rows or columns to manipulate %>% Arrange or group the data %>% Calculate statistics, new variables Creates a new object named summary that stores the output Otherwise, output is just printed in the console

Manipulating the births dataset with dplyr Key variables of interest in today’s examples Preterm birth Early prenatal care Maternal age Smoking during pregnancy Race/ethnicity County of residence Review of coding for each variable

Preterm Birth births$wksgest[births$wksgest==99] <- NA births$preterm <- ifelse(births$wksgest<37,1,0) births$preterm_f <- factor(births$preterm, levels = c(1,0), labels = c("preterm", "term"))

Early Prenatal Care births$mdif[births$mdif==99] <- NA births$pnc5 <- ifelse(births$mdif<=5,1,0) births$pnc5_f <- factor(births$pnc5, levels = c(1,0), labels = c("Early prenatal care", "No early care"))

Maternal Age births$mage[births$mage==99] <- NA

Maternal Smoking during Pregnancy births$cigdur[births$cigdur=="U"] <- NA births$smoke <- ifelse(births$cigdur=="Y",1,0) births$smoke_f <- factor(births$smoke, levels = c(1,0), labels = c("Smoker", "Nonsmoker"))

Format Helper CSV file with labels for the levels of the race/ethnicity and county variables to save you from typing them out Save file to your computer and then read into R Studio formatter <- read.csv(“birth-format-helper-2012.csv”, stringsAsFactors = F)

Maternal Race/Ethnicity births$race_f <- factor(births$mrace, levels = 1:4, labels = formatter[formatter$variable=="mrace",]$recode, ordered = T) births$raceeth <- ifelse(births$mrace == 1 & births$methnic == "N", "WnH", ifelse(births$mrace == 1 & births$methnic == "Y", "WH", ifelse(births$mrace==2, "AA", ifelse(births$mrace==3, "AI/AN", "Other")))) births$raceeth_f <- factor(births$raceeth, levels=c("WnH", "AA", "WH", "AI/AN", "Other"))

County of Residence in NC births$county <- factor(births$cores, levels = formatter[formatter$variable=="cores",]$code, labels = formatter[formatter$variable=="cores",]$recode, ordered = T)

Practice Problems using dplyr 1) Calculate the numbers of births by early prenatal care (received early care vs. no early care). Exclude the births with missing values for prenatal care or preterm. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births in each group Syntax for summary statistics is to name the new variable and set equal to the function of interest: summarise(number = n()) Name you choose The function n() counts the number of observations (no arguments)

Practice Problems using dplyr 2) Calculate the numbers of births and average age of mothers in those same groups (received early care vs. no early care). Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize number of births, average age, in each group Are there mothers with missing age? Within the mean function, include an argument to remove missing values prior to calculating the average age.

Practice Problems using dplyr 3) In addition to the numbers of births and average age of mothers by early care, calculate the number and percentage of preterm births in these two groups. Pseudo-code: use the births dataset %>% exclude births with missing pnc5_f or preterm %>% group births by pnc5_f %>% summarize total births, average age, number of preterm, % preterm Different ways to calculate % preterm: Within summarise(): using the function mean(preterm) OR sum(preterm)/n() Within mutate(): as a function of the variables you created in summarise() for numbers of preterm and total births

Practice Problems using dplyr 4) Continuing to build on your code you’ve already written, calculate the percentage of smokers in the same two groups (early care vs. no early care). Pseudo-code: use the births dataset %>% group births by pnc5_f %>% summarize total births, average age, # preterm, # smokers, % preterm, % smokers You could try out different methods for calculating % smokers and % preterm: within summarise (using built-in functions) within mutate (using the new variables you’ve created) **Note: should you use the numeric variables (smoke, preterm) or factor variables (smoke_f, preterm_f) for calculating these summary statistics?**

Practice Problems using dplyr 5) Onto a new example: Calculate the total births in each maternal race/ethnicity group (WnH, AA, WH, AI/AN, Other). For each group, also calculate the prevalence of early care and prevalence of preterm birth. Pseudo-code: use the births dataset %>% group births by raceeth_f %>% summarize the numbers of total births, births with early care, and preterm births in each group %>% calculate percentage of early care, percentage of preterm in each group

Practice Problems using dplyr 6) Final example: Calculate the prevalence of early prenatal care and prevalence of preterm birth by NC county of residence Pseudo-code: use the births dataset %>% group births by county %>% summarize the number of births with early care, number of preterm births in each county %>% calculate prevalence of early care and preterm in each county This output is large – how would you store it instead of just printing to the console?