Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dplyr I EPID 799C Mon Sep 24 2017.

Similar presentations


Presentation on theme: "Dplyr I EPID 799C Mon Sep 24 2017."— Presentation transcript:

1 Dplyr I EPID 799C Mon Sep

2 Today’s Overview Pipes dplyr Homework 1: thoughts Homework 2: due Wed
verbs: filter summarise group_by Windows demos Homework 1: thoughts Homework 2: due Wed Homework 3: coming soon!

3 dplyr theory, verbs

4 Dplyr big-picture Standard grammar of data manipulation: Standard “words” and “phrases.” More abstraction for us humans. Dataset abstracted. Base R largely operates on vectors. Dplyr is oriented toward operating on data sets all at once. Functions aim at returning datasets. Smart & efficient. E.g. use dplyr on a database connection, and dplyr translates to sql for you.

5 Dplyr big-picture One Table Verbs filter, select, arrange, summarize, mutate, group_by Linking Phrases Pipe %>% (think “…then…”) Multi-Table Verbs mutating & filtering table joins, set operations, binding Concepts / tidy data wide & long data

6 Whiteboard Overview Use the words in a sentence!

7 Sidenote: Star Wars One of a few datasets included in tidyr/dplyr

8 filter()

9 filter() We have ways to do this [] filter(starwars, homeworld=="Tatooine") Almost same as : starwars[starwars$homeworld == "Tatooine",]

10 select()

11 select() select(starwars, name, height, mass)

12 arrange() arrange(starwars, name) arrange(starwars, desc(homeworld))

13 mutate() Row-by-row actions

14 mutate() mutate(starwars, is_tatooine_native = homeworld=="Tatooine") transmute(starwars, is_tatooine_native = homeworld=="Tatooine")

15 mutate() Window functions Others (rolling & recycled aggregates) are beyond the scope of this introduction

16 summarise() Many to one operations

17 summarise() summarise(starwars, avg_height = mean(height, na.rm=T), avg_mass = mean(mass, na.rm=T)) summarise_at(starwars, c("height", "mass"), mean, na.rm=T)

18 group_by() Groups variables within a data.frame* to perform multiple summarizing (or windowed*) actions on.

19 group_by() group_by(starwars, homeworld) summarise_at( group_by(starwars, homeworld), c("height", "mass"), mean, na.rm=T)

20 Multi-Table Operations

21 Tibble sidenote

22 Tibbles A layer built on data.frames Largely work the same (if not, as.data.frame() it), but support retaining groups, prettier printing, etc. class(starwars) str(starwars) Note a slick move with films, vehicles, starships…

23 Pipes

24 The Pipe What? Simplest pipe (%>%) takes what’s on the left and makes it the new first argument of what’s on the right a %>% b(arg1=1, arg2=2) becomes b(a, arg1=1, arg2=2)

25 The Pipe

26 The Pipe Why? Easier to chain than nesting or multiple temporary datasets. And we often think or operate in chains, doing something new to the thing we just worked on. “Take this thing, do this to it, do this other thing, then another, then group that, summarize that, and plot it.” Helps reorder R constructs to human language. Dplyr (with pipes) create a “grammar” of data manipulation, which help translate concepts into “sentences.”

27 The Pipe a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30)

28 The Pipe filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) arr > 30 | dep > 30 )

29 The Pipe flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)

30 The Pipe births$sex %>% hist() starwars %>% filter(mass > 100) starwars %>% filter(films %in% "Revenge of the Sith") # How new sf in GIS works....

31 The Pipe planet_bmi = starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2)

32 The Pipe More complex piping here: project.org/web/packages/magrittr/vignettes/mag rittr.html

33 Tidy Data wide? long?

34 tidyr

35 tidyr

36 tidyr Most common: gather() spread() Less common: separate() unite()

37 a = starwars %>% gather("num", "val", height, mass, birth_year) b = a %>% spread(num, val)

38 Advanced Concepts Things we’re not covering, but you should know exist

39 Working with Databases
Package dbplyr Translates your dplyr into SQL code to send to a connection Try it out if you have access to a server!

40 Non-Standard Evaluation
It’s why not quoting things works It gets really hairy Use case: What if you want to “program” with dplyr?

41 Integration w/ other packages
%>% passes objects (often data) around into first argument. What have we seen recently that starts with data?

42 What does this do? # Instaggplot starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2) %>% ggplot(aes(homeworld, bmi, fill=homeworld)) + geom_col(show.legend = F)+ coord_flip()

43 Putting it all together
Back to births

44 Let’s Try What is the mean and sd weeks of gestation by race-ethnicity group? Construct a dplyr “sentence” to look at county- specific effects on preterm and pnc5. (HW3!)

45 Answers births %>% left_join(data.frame(mrace=1:4, race_f=c("W", "B", "AI/AN", "O"))) %>% group_by(race_f, methnic) %>% summarise(avg_gest = mean(wksgest, na.rm = T), gest_sd = sd(wksgest, na.rm=T), n=n()) %>% mutate(ci_low = avg_gest-0.5*1.96*gest_sd, ci_high = avg_gest+0.5*1.96*gest_sd) %>% arrange(avg_gest) %>% filter(methnic != "U" & race_f != "O") %>% unite(raceeth, race_f, methnic, sep=".") %>% ggplot(aes(raceeth, avg_gest, fill=avg_gest))+ geom_col()+ geom_linerange(aes(x=raceeth, ymin=ci_low, ymax=ci_high), color="grey")+ geom_text(aes(label=round(avg_gest, 1)), nudge_y = 1) #Q2: See you on Wednesday! Homework 3!


Download ppt "Dplyr I EPID 799C Mon Sep 24 2017."

Similar presentations


Ads by Google