Download presentation
Presentation is loading. Please wait.
1
Dplyr I EPID 799C Mon Sep
2
Today’s Overview Pipes dplyr Homework 1: thoughts Homework 2: due Wed
verbs: filter summarise group_by Windows demos Homework 1: thoughts Homework 2: due Wed Homework 3: coming soon!
3
dplyr theory, verbs
4
Dplyr big-picture Standard grammar of data manipulation: Standard “words” and “phrases.” More abstraction for us humans. Dataset abstracted. Base R largely operates on vectors. Dplyr is oriented toward operating on data sets all at once. Functions aim at returning datasets. Smart & efficient. E.g. use dplyr on a database connection, and dplyr translates to sql for you.
5
Dplyr big-picture One Table Verbs filter, select, arrange, summarize, mutate, group_by Linking Phrases Pipe %>% (think “…then…”) Multi-Table Verbs mutating & filtering table joins, set operations, binding Concepts / tidy data wide & long data
6
Whiteboard Overview Use the words in a sentence!
7
Sidenote: Star Wars One of a few datasets included in tidyr/dplyr
8
filter()
9
filter() We have ways to do this [] filter(starwars, homeworld=="Tatooine") Almost same as : starwars[starwars$homeworld == "Tatooine",]
10
select()
11
select() select(starwars, name, height, mass)
12
arrange() arrange(starwars, name) arrange(starwars, desc(homeworld))
13
mutate() Row-by-row actions
14
mutate() mutate(starwars, is_tatooine_native = homeworld=="Tatooine") transmute(starwars, is_tatooine_native = homeworld=="Tatooine")
15
mutate() Window functions Others (rolling & recycled aggregates) are beyond the scope of this introduction
16
summarise() Many to one operations
17
summarise() summarise(starwars, avg_height = mean(height, na.rm=T), avg_mass = mean(mass, na.rm=T)) summarise_at(starwars, c("height", "mass"), mean, na.rm=T)
18
group_by() Groups variables within a data.frame* to perform multiple summarizing (or windowed*) actions on.
19
group_by() group_by(starwars, homeworld) summarise_at( group_by(starwars, homeworld), c("height", "mass"), mean, na.rm=T)
20
Multi-Table Operations
21
Tibble sidenote
22
Tibbles A layer built on data.frames Largely work the same (if not, as.data.frame() it), but support retaining groups, prettier printing, etc. class(starwars) str(starwars) Note a slick move with films, vehicles, starships…
23
Pipes
24
The Pipe What? Simplest pipe (%>%) takes what’s on the left and makes it the new first argument of what’s on the right a %>% b(arg1=1, arg2=2) becomes b(a, arg1=1, arg2=2)
25
The Pipe
26
The Pipe Why? Easier to chain than nesting or multiple temporary datasets. And we often think or operate in chains, doing something new to the thing we just worked on. “Take this thing, do this to it, do this other thing, then another, then group that, summarize that, and plot it.” Helps reorder R constructs to human language. Dplyr (with pipes) create a “grammar” of data manipulation, which help translate concepts into “sentences.”
27
The Pipe a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30)
28
The Pipe filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) arr > 30 | dep > 30 )
29
The Pipe flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)
30
The Pipe births$sex %>% hist() starwars %>% filter(mass > 100) starwars %>% filter(films %in% "Revenge of the Sith") # How new sf in GIS works....
31
The Pipe planet_bmi = starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2)
32
The Pipe More complex piping here: project.org/web/packages/magrittr/vignettes/mag rittr.html
33
Tidy Data wide? long?
34
tidyr
35
tidyr
36
tidyr Most common: gather() spread() Less common: separate() unite()
37
a = starwars %>% gather("num", "val", height, mass, birth_year) b = a %>% spread(num, val)
38
Advanced Concepts Things we’re not covering, but you should know exist
39
Working with Databases
Package dbplyr Translates your dplyr into SQL code to send to a connection Try it out if you have access to a server!
40
Non-Standard Evaluation
It’s why not quoting things works It gets really hairy Use case: What if you want to “program” with dplyr?
41
Integration w/ other packages
%>% passes objects (often data) around into first argument. What have we seen recently that starts with data?
42
What does this do? # Instaggplot starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2) %>% ggplot(aes(homeworld, bmi, fill=homeworld)) + geom_col(show.legend = F)+ coord_flip()
43
Putting it all together
Back to births
44
Let’s Try What is the mean and sd weeks of gestation by race-ethnicity group? Construct a dplyr “sentence” to look at county- specific effects on preterm and pnc5. (HW3!)
45
Answers births %>% left_join(data.frame(mrace=1:4, race_f=c("W", "B", "AI/AN", "O"))) %>% group_by(race_f, methnic) %>% summarise(avg_gest = mean(wksgest, na.rm = T), gest_sd = sd(wksgest, na.rm=T), n=n()) %>% mutate(ci_low = avg_gest-0.5*1.96*gest_sd, ci_high = avg_gest+0.5*1.96*gest_sd) %>% arrange(avg_gest) %>% filter(methnic != "U" & race_f != "O") %>% unite(raceeth, race_f, methnic, sep=".") %>% ggplot(aes(raceeth, avg_gest, fill=avg_gest))+ geom_col()+ geom_linerange(aes(x=raceeth, ymin=ci_low, ymax=ci_high), color="grey")+ geom_text(aes(label=round(avg_gest, 1)), nudge_y = 1) #Q2: See you on Wednesday! Homework 3!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.