Dplyr I EPID 799C Mon Sep 24 2017
Today’s Overview Pipes dplyr Homework 1: thoughts Homework 2: due Wed verbs: filter summarise group_by Windows demos Homework 1: thoughts Homework 2: due Wed Homework 3: coming soon!
dplyr theory, verbs
Dplyr big-picture Standard grammar of data manipulation: Standard “words” and “phrases.” More abstraction for us humans. Dataset abstracted. Base R largely operates on vectors. Dplyr is oriented toward operating on data sets all at once. Functions aim at returning datasets. Smart & efficient. E.g. use dplyr on a database connection, and dplyr translates to sql for you.
Dplyr big-picture One Table Verbs filter, select, arrange, summarize, mutate, group_by Linking Phrases Pipe %>% (think “…then…”) Multi-Table Verbs mutating & filtering table joins, set operations, binding Concepts / tidy data wide & long data
Whiteboard Overview Use the words in a sentence!
Sidenote: Star Wars One of a few datasets included in tidyr/dplyr http://dplyr.tidyverse.org/reference/starwars.html#examples
filter()
filter() We have ways to do this [] filter(starwars, homeworld=="Tatooine") Almost same as : starwars[starwars$homeworld == "Tatooine",]
select()
select() select(starwars, name, height, mass)
arrange() arrange(starwars, name) arrange(starwars, desc(homeworld))
mutate() Row-by-row actions
mutate() mutate(starwars, is_tatooine_native = homeworld=="Tatooine") transmute(starwars, is_tatooine_native = homeworld=="Tatooine")
mutate() Window functions Others (rolling & recycled aggregates) are beyond the scope of this introduction
summarise() Many to one operations
summarise() summarise(starwars, avg_height = mean(height, na.rm=T), avg_mass = mean(mass, na.rm=T)) summarise_at(starwars, c("height", "mass"), mean, na.rm=T)
group_by() Groups variables within a data.frame* to perform multiple summarizing (or windowed*) actions on.
group_by() group_by(starwars, homeworld) summarise_at( group_by(starwars, homeworld), c("height", "mass"), mean, na.rm=T)
Multi-Table Operations
Tibble sidenote
Tibbles A layer built on data.frames Largely work the same (if not, as.data.frame() it), but support retaining groups, prettier printing, etc. class(starwars) str(starwars) Note a slick move with films, vehicles, starships…
Pipes
The Pipe What? Simplest pipe (%>%) takes what’s on the left and makes it the new first argument of what’s on the right a %>% b(arg1=1, arg2=2) becomes b(a, arg1=1, arg2=2)
The Pipe
The Pipe Why? Easier to chain than nesting or multiple temporary datasets. And we often think or operate in chains, doing something new to the thing we just worked on. “Take this thing, do this to it, do this other thing, then another, then group that, summarize that, and plot it.” Helps reorder R constructs to human language. Dplyr (with pipes) create a “grammar” of data manipulation, which help translate concepts into “sentences.”
The Pipe a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30)
The Pipe filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) arr > 30 | dep > 30 )
The Pipe flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)
The Pipe births$sex %>% hist() starwars %>% filter(mass > 100) starwars %>% filter(films %in% "Revenge of the Sith") # How new sf in GIS works....
The Pipe planet_bmi = starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2)
The Pipe More complex piping here: https://cran.r- project.org/web/packages/magrittr/vignettes/mag rittr.html
Tidy Data wide? long?
tidyr
tidyr
tidyr Most common: gather() spread() Less common: separate() unite()
a = starwars %>% gather("num", "val", height, mass, birth_year) b = a %>% spread(num, val)
Advanced Concepts Things we’re not covering, but you should know exist http://dplyr.tidyverse.org/reference/index.html
Working with Databases Package dbplyr Translates your dplyr into SQL code to send to a connection Try it out if you have access to a server! https://github.com/tidyverse/dbplyr https://github.com/tidyverse/dbplyr/blob/master/vignettes/dbplyr.Rmd
Non-Standard Evaluation It’s why not quoting things works It gets really hairy Use case: What if you want to “program” with dplyr?
Integration w/ other packages %>% passes objects (often data) around into first argument. What have we seen recently that starts with data?
What does this do? # Instaggplot starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2) %>% ggplot(aes(homeworld, bmi, fill=homeworld)) + geom_col(show.legend = F)+ coord_flip()
Putting it all together Back to births
Let’s Try What is the mean and sd weeks of gestation by race-ethnicity group? Construct a dplyr “sentence” to look at county- specific effects on preterm and pnc5. (HW3!)
Answers births %>% left_join(data.frame(mrace=1:4, race_f=c("W", "B", "AI/AN", "O"))) %>% group_by(race_f, methnic) %>% summarise(avg_gest = mean(wksgest, na.rm = T), gest_sd = sd(wksgest, na.rm=T), n=n()) %>% mutate(ci_low = avg_gest-0.5*1.96*gest_sd, ci_high = avg_gest+0.5*1.96*gest_sd) %>% arrange(avg_gest) %>% filter(methnic != "U" & race_f != "O") %>% unite(raceeth, race_f, methnic, sep=".") %>% ggplot(aes(raceeth, avg_gest, fill=avg_gest))+ geom_col()+ geom_linerange(aes(x=raceeth, ymin=ci_low, ymax=ci_high), color="grey")+ geom_text(aes(label=round(avg_gest, 1)), nudge_y = 1) #Q2: See you on Wednesday! Homework 3!