Dplyr I EPID 799C Mon Sep 24 2017.

Slides:



Advertisements
Similar presentations
Introduction to R Lecture 3: Data Manipulation Andrew Jaffe 9/27/10.
Advertisements

_______________________________________________________________________________________________________________ PHP Bible, 2 nd Edition1  Wiley and the.
Programming in R SQL in R. Running SQL in R In this session I will show you how to: Run basic SQL commands within R.
Advanced Tips And Tricks For Power Query
Basic Scheme February 8, 2007 Compound expressions Rules of evaluation Creating procedures by capturing common patterns.
Types of Verbs. Action Verbs Tells what action someone or something is performing. Example: People consider slavery the chief cause of the Civil War.
R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
R PROGRAMMING FOR SQL DEVELOPERS Kiran Math Developer : Proterra in Greenville SC
SQL INJECTION Diwakar Kumar Dinkar M.Tech, CS&E Roll Diwakar Kumar Dinkar M.Tech, CS&E Roll
Clojure Macros. Homoiconicity All versions of Lisp, including Clojure, are homoiconic This means that there is no difference between the form of the data.
CMSC201 Computer Science I for Majors Lecture 05 – Comparison Operators and Boolean (Logical) Operators Prof. Katherine Gibson Prof. Jeremy.
Tidy data, wrangling, and pipelines in R
Lesson 06: Functions Class Participation: Class Chat:
Coupling and Cohesion Rajni Bhalla.
CSE 103 Day 20 Jo is out today; I’m Carl
Basic Scheme February 8, 2007 Compound expressions Rules of evaluation
6.001 Jeopardy.
IS301 – Software Engineering Dept of Computer Information Systems
Introduction to R.
Data Cleansing with SQL and R Kevin Feasel
R For The SQL Developer Kevin Feasel Manager, Predictive Analytics
Objects, Functions and Parameters
NURS6803 Clinical DB Design Katherine Sward, PhD, RN
Getting your data into R
Next Generation R tidyr, dplyr, ggplot2
PHP Introduction.
Dreamweaver – Setting up a Site and Page Layouts
Data Wrangling in the Tidyverse
Data manipulation in R: dplyr
Tutorial 8 Objectives Continue presenting methods to import data into Access, export data from Access, link applications with data stored in Access, and.
Ggplot2 I EPID 799C Mon Sep
CS139 – Fall 2010 How far we have come
ETL – Using R Kiran Math Developer : Flour in Greenville SC
Numerical Descriptives in R
Access queries p.meade.
R Data Manipulation Bootstrapping
R Programming I: Basic data types, structures & subsetting
Enhance BI Applications and Simplify Development
Recoding III: Introducing apply()
Recoding II: Numerical & Graphical Descriptives
Thank you Sponsors.
Levelled Objectives (What are we learning?)
Unlimited POWER.
R Programming For Sql Developers ETL USING R
Recoding III: Introducing apply()
Matrix Algebra (2).
Building a Balloon Rocket Car
L07 Apply and purrr EPID 799C Fall 2018.
Lesson 06: Functions Class Chat: Attendance: Participation
Tidy data, wrangling, and pipelines in R
Global Health 811 October 30th, 2018
Today’s goals Peer review the outline of our WTE essays
Maps 1 EPID 799C Fall 2018.
M1G Introduction to Database Development
6.001 SICP Variations on a Scheme
The of and to in is you that it he for was.
Clojure Macros.
Generalized Linear Models (GLM) in R: Part 2
Dplyr Tidyr & R Markdown
Topic 11 Lesson 1 - Analyzing Data in Access
Curry A Tasty dish? Haskell Curry!.
Key Concepts R for Data Science.
Introduction to the Lab
Exchange.
Tidy Data Global Health 811 April 9th, 2018.
R for Epi Workshop Module 2: Data Manipulation & Summary Statistics
C++ Object Oriented 1.
Starting out with formal logic
All about Indexes Gail Shaw.
Spark with R Martijn Tennekes
Presentation transcript:

Dplyr I EPID 799C Mon Sep 24 2017

Today’s Overview Pipes dplyr Homework 1: thoughts Homework 2: due Wed verbs: filter summarise group_by Windows demos Homework 1: thoughts Homework 2: due Wed Homework 3: coming soon!

dplyr theory, verbs

Dplyr big-picture Standard grammar of data manipulation: Standard “words” and “phrases.” More abstraction for us humans. Dataset abstracted. Base R largely operates on vectors. Dplyr is oriented toward operating on data sets all at once. Functions aim at returning datasets. Smart & efficient. E.g. use dplyr on a database connection, and dplyr translates to sql for you.

Dplyr big-picture One Table Verbs filter, select, arrange, summarize, mutate, group_by Linking Phrases Pipe %>% (think “…then…”) Multi-Table Verbs mutating & filtering table joins, set operations, binding Concepts / tidy data wide & long data

Whiteboard Overview Use the words in a sentence!

Sidenote: Star Wars One of a few datasets included in tidyr/dplyr http://dplyr.tidyverse.org/reference/starwars.html#examples

filter()

filter() We have ways to do this [] filter(starwars, homeworld=="Tatooine") Almost same as : starwars[starwars$homeworld == "Tatooine",]

select()

select() select(starwars, name, height, mass)

arrange() arrange(starwars, name) arrange(starwars, desc(homeworld))

mutate() Row-by-row actions

mutate() mutate(starwars, is_tatooine_native = homeworld=="Tatooine") transmute(starwars, is_tatooine_native = homeworld=="Tatooine")

mutate() Window functions Others (rolling & recycled aggregates) are beyond the scope of this introduction

summarise() Many to one operations

summarise() summarise(starwars, avg_height = mean(height, na.rm=T), avg_mass = mean(mass, na.rm=T)) summarise_at(starwars, c("height", "mass"), mean, na.rm=T)

group_by() Groups variables within a data.frame* to perform multiple summarizing (or windowed*) actions on.

group_by() group_by(starwars, homeworld) summarise_at( group_by(starwars, homeworld), c("height", "mass"), mean, na.rm=T)

Multi-Table Operations

Tibble sidenote

Tibbles A layer built on data.frames Largely work the same (if not, as.data.frame() it), but support retaining groups, prettier printing, etc. class(starwars) str(starwars) Note a slick move with films, vehicles, starships…

Pipes

The Pipe What? Simplest pipe (%>%) takes what’s on the left and makes it the new first argument of what’s on the right a %>% b(arg1=1, arg2=2) becomes b(a, arg1=1, arg2=2)

The Pipe

The Pipe Why? Easier to chain than nesting or multiple temporary datasets. And we often think or operate in chains, doing something new to the thing we just worked on. “Take this thing, do this to it, do this other thing, then another, then group that, summarize that, and plot it.” Helps reorder R constructs to human language. Dplyr (with pipes) create a “grammar” of data manipulation, which help translate concepts into “sentences.”

The Pipe a1 <- group_by(flights, year, month, day) a2 <- select(a1, arr_delay, dep_delay) a3 <- summarise(a2, arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE)) a4 <- filter(a3, arr > 30 | dep > 30)

The Pipe filter( summarise( select( group_by(flights, year, month, day), arr_delay, dep_delay ), arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) arr > 30 | dep > 30 )

The Pipe flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30)

The Pipe births$sex %>% hist() starwars %>% filter(mass > 100) starwars %>% filter(films %in% "Revenge of the Sith") # How new sf in GIS works....

The Pipe planet_bmi = starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2)

The Pipe More complex piping here: https://cran.r- project.org/web/packages/magrittr/vignettes/mag rittr.html

Tidy Data wide? long?

tidyr

tidyr

tidyr Most common: gather() spread() Less common: separate() unite()

a = starwars %>% gather("num", "val", height, mass, birth_year) b = a %>% spread(num, val)

Advanced Concepts Things we’re not covering, but you should know exist http://dplyr.tidyverse.org/reference/index.html

Working with Databases Package dbplyr Translates your dplyr into SQL code to send to a connection Try it out if you have access to a server! https://github.com/tidyverse/dbplyr https://github.com/tidyverse/dbplyr/blob/master/vignettes/dbplyr.Rmd

Non-Standard Evaluation It’s why not quoting things works It gets really hairy Use case: What if you want to “program” with dplyr?

Integration w/ other packages %>% passes objects (often data) around into first argument. What have we seen recently that starts with data?

What does this do? # Instaggplot starwars %>% group_by(homeworld) %>% summarise_at(c("height", "mass"), mean, na.rm=T) %>% mutate(bmi = mass / (height/100)^2) %>% ggplot(aes(homeworld, bmi, fill=homeworld)) + geom_col(show.legend = F)+ coord_flip()

Putting it all together Back to births

Let’s Try What is the mean and sd weeks of gestation by race-ethnicity group? Construct a dplyr “sentence” to look at county- specific effects on preterm and pnc5. (HW3!)

Answers births %>% left_join(data.frame(mrace=1:4, race_f=c("W", "B", "AI/AN", "O"))) %>% group_by(race_f, methnic) %>% summarise(avg_gest = mean(wksgest, na.rm = T), gest_sd = sd(wksgest, na.rm=T), n=n()) %>% mutate(ci_low = avg_gest-0.5*1.96*gest_sd, ci_high = avg_gest+0.5*1.96*gest_sd) %>% arrange(avg_gest) %>% filter(methnic != "U" & race_f != "O") %>% unite(raceeth, race_f, methnic, sep=".") %>% ggplot(aes(raceeth, avg_gest, fill=avg_gest))+ geom_col()+ geom_linerange(aes(x=raceeth, ymin=ci_low, ymax=ci_high), color="grey")+ geom_text(aes(label=round(avg_gest, 1)), nudge_y = 1) #Q2: See you on Wednesday! Homework 3!