Presentation is loading. Please wait.

Presentation is loading. Please wait.

R for Epi Workshop Module 3: Data visualization with ggplot

Similar presentations


Presentation on theme: "R for Epi Workshop Module 3: Data visualization with ggplot"— Presentation transcript:

1 R for Epi Workshop Module 3: Data visualization with ggplot
Mike Dolan Fliss, MSW, MPS PhD Candidate in Epidemiology UNC Gillings School of Global Public Health

2 Module Outline The grammar of graphics ggplot syntax Data-aes-geoms
Facets & Themes Preparing data to Plot Scales Colors Labels & legends Stats ggplot extension packages Bit of theory to start, then we’ll dig back into our births data. Broadly divided into required and not-required parts of ggplot.

3 Same MO as before: Code along!
Recommend typing it out Tab-complete is your speed and typo friend Copy and paste always available Mostly follow-along, with a little bit of free-work at the end of the module. A little bit of live coding. Minimum to start: library(tidyverse) setwd("D:/<YOUR DIR HERE>/R Workshop/") births_sm = readRDS("births_final.rds") Note about readRDS and save/load.

4 To start: New script, load data, headers
# # IPH Workshop: R for Epi # Examples for modules 3 and 4 # Mike Dolan Fliss library(tidyverse) setwd("D:/<YOUR FILE PATH HERE!>/R Workshop/") # ^ Above is Mike’s – make your own! births_sm = readRDS("births_final.rds") # Module 3 #### # _ Slide: ggplot components ####

5 To start: New script, load data, headers

6 1. Grammar of Graphics Underlying theory

7 Grammar of Graphics A consistent, theoretical framework for describing data visualizations. Based on a book (1999) by Leland Wilkinson Helps think about and plan graphics outside of R… but implemented deeply in R’s in ggplot2 package. May also be familiar if you’ve worked in Tableau – Wilkinson now works for Tableau

8 Grammar of graphics components

9 Grammar of graphics components
data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting

10 Anatomy of a ggplot data aesthetic mapping geometric object
statistical transformations scales coordinate system position adjustments faceting

11 2. ggplot2 syntax An implementation of gg!

12 ggplot components (are the same!)
data aesthetic mapping geometric object statistical transformations scales coordinate system position adjustments faceting

13 ggplot components

14 ggplot components Or minimally, ggplot(data=<DATA>)+ <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>)) e.g., using mpg dataset ggplot(mpg)+ geom_point(aes(displ, hwy, color=class)) * Read the “+” here as “and” … And a note on overloaded operators!

15 Versus base / qplot shorthand
hist(mpg$cyl) ggplot(mpg)+geom_histogram(aes(x=cyl)) ggplot(mpg, aes(cyl))+geom_histogram() qplot(mpg$cyl, geom="histogram") Simple vs. Easy

16 3. Data, aes, geoms The flexible power of aesthetic mappings

17 data ggplot likes “long”, well structured data.frames ggplot “stats” can make quick transformations dplyr will help with complicated transformations tidyr will help go from wide to long Extensions allow ggplot to understand other kinds of data (e.g. maps, network data) Stats – don’t do this.

18 geoms …and many more in sister packages. Like other key packages, a cheat sheet is built into R.

19 Aesthetic Mappings Data Aesthetics Geometry
Numbers & Factors (characters coerced) meduc mage cigdur wksgest preterm_f pnc5_f county_name raceeth_f …and more x y color (name, rgb) fill size linetype (int or name) alpha height width shape angle ….and more Minimum for geom_point(), with rest are defaulted

20 Cheatsheets Cheatsheet: content/uploads/2015/03/ggplot2-cheatsheet.pdf ... Or right in RStudio! (One more benefit of using current RStudio IDE version)

21 Sampling for prototyping speed
Graphs can take a few moments, and for prototyping we want speed. Let’s use some dplyr to generate a sample to play with, and transfer to a dplyr tibble for the dataset viewing benefits. set.seed(1) # for replicability table(complete.cases(births_sm)) # test how it works births_10k = births_sm %>% filter(complete.cases(.)) %>% # dot syntax as_tibble() %>% sample_n(10000) births_10k %>% head(2) # helpful when doing EDA

22 Basic geometries and mappings
Let’s create a scatterplot of wksgest and mage using ggplot and geom_point. ggplot(births_10k, aes(mage, ksgest))+geom_point()

23 Basic geometries and mappings
D’oh! Overplotting! Use the geom_jitter() geometry instead. ggplot(births_10k, aes(mage, ksgest))+geom_jitter()

24 Basic geometries and mappings
Let’s try colors. Map cigdur to color. That’s it! ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter()

25 Basic geometries and mappings
Fancier: change color to preterm_f, change the point character to a “.” and alpha to 0.5. Note global aes! ggplot(births_10k, aes(mage, wksgest, color=preterm_f))+ geom_jitter(pch=“.”, alpha=0.1) # ^ Typical chained spacing in last two examples, like dplyr

26 Aesthetic inheritance
Subsequent geometric layers will inherit the aesthetic mappings from the original ggplot() call unless they’re overwritten. Meaning these are equivalent: ggplot(births_10k, aes(mage, wksgest))+ geom_jitter() ggplot(births_10k)+ geom_jitter(aes(mage, wksgest)) #^ equivalent

27 Aesthetic inheritance 2
…but there’s good reason to be intentional about this – like when we want multiple geometries to use the same mappings ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth(method="lm") We’d like geom_jitter and geom_smooth to BOTH inherit mappings. They can be overridden in a specific geometry.

28 Connecting dplyr to ggplot
While we’re at it, we may want to send dplyr output (without saving) directly into ggplot. Example: ggplot(births_10k, aes(mage, wksgest, color=cigdur))+ geom_jitter(alpha=0.1) + geom_smooth() # smooth is unhappy. Why? births_10k %>% filter(cigdur %in% c("Y", "N")) %>% # Note new logical operator, %in%! ggplot(aes(mage, wksgest, color=cigdur))+ geom_smooth() # Sends the dplyr'ed tibble / data.frame into # the first param of ggplot... the data param!

29 Common Visual EDA* workflow:
Exploratory data analysis Pick some variables: mage, wksgest Consider a geometry: e.g. geom_bin2d Check the help file (F1, or type it in) Peruse, maybe run the example code (can run right within help window) Write our own!

30 Common Visual EDA workflow:
Pick some variables: mage, wksgest Consider a geometry: e.g. geom_bin2d Check the help file (F1, or type it in) Peruse, maybe run the example code (can run righ within help window). Write our own!

31 Common Visual EDA workflow:
Pick some variables: e.g. mage, wksgest Consider a geometry: e.g. geom_bin2d Check the help file (F1, or type it in) Peruse, maybe run the example code (can run righ within help window). Write our own! ggplot(births_10k, aes(mage, wksgest))+ geom_bin2d() Note: Some blanks seem to be a function of our unassigned (so, default) bin width. We could do something like bins = max(births_10k$wksgest)-min(births_10k$wksgest) to fix this!

32 4. Facets & Themes Small multiples and overall look and feel

33 Facets (wrap/grid):small multiples
Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_wrap( ~ raceeth_f)

34 Facets (grid/wrap):small multiples
Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid( ~ raceeth_f)

35 Facets (grid/wrap):small multiples
Facets take an R formula object (e.g . ~ x, y ~ x) and split your graph into small multiples based on that. Can also “free the scales” so they aren’t shared across plots with scales = “free_x” and/or “free_y” ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid(methnic ~ mrace)

36 Themes Change the theme with theme_NAME(), e.g. theme_minimal().
Can define your own themes, or tweak existing ones. See project.org/web/packages/ggthemes/vignettes/ggthemes.html for more themes. More on ggplot extensions later! ggplot(births_10k, aes(mage, wksgest))+ geom_jitter(aes(color=preterm_f), pch=".", alpha=0.5)+ geom_smooth()+ facet_grid( ~ raceeth_f)+ theme_minimal()

37 5. Preparing data to plot Thinking ahead

38 Back to data Sometimes we need to transform our data to get at the question we have in mind. Saw with cigdur earlier. e.g. What is the association of preterm birth and maternal age?* * Again, we’re not approaching this from a stricter causal inference epidemiology frame in this workshop… Before, we did a minor transform with a filter for cigdur

39 Preparing to plot: preterm/mage
Might first try this as a plot of the relationship of preterm birth and maternal age using: ggplot(births)+ geom_histogram(aes(x=mage, fill=preterm_f), position="fill") Maybe suggestive of some meaning… but YUCK!

40 Preparing to plot: preterm/mage
Let’s try again, but first create a new dataset to plot with. Use dplyr, group the original dataset by (mage), and summarize it into two variables: pct_preterm and n. Then create a plot of the relationship of preterm birth and maternal age. magepreterm_df = births_sm %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) ggplot(magepreterm_df, aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n)) # ^ Note the weight & size aesthetics!

41 Preparing to plot: preterm/mage
Could do this all in one pass, if we don’t need to keep working with the temporary dataset. Readable! births_sample %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) %>% ggplot(aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n))

42 Other plots to explore Starting to speak to some of the reality of the data. How might we interpret these? What do you see?

43 Plots & Interpretation
ggplot(births_10k, aes(wksgest))+ geom_histogram(bins=25) * Note we don’t (yet) know how to clean up axes, labels, legends, etc. Just aiming for the functional look of the plot this time (see upper right.) But this is typical of iterative development – aim for visual communication, then hone details.

44 Plots & Interpretation
births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(wksgest, fill=cigdur))+ geom_density(adjust=2, alpha=.5)+theme_minimal()

45 Plots & Interpretation
births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(cigdur, wksgest, color=cigdur, fill=cigdur))+ geom_boxplot(alpha=.5)+ facet_wrap(~pnc5_f)+ theme_minimal()

46 Plots & Interpretation
county_stats = births_sm %>% group_by(county_name, cores) %>% summarise(pct_preterm = mean(preterm, na.rm=T)*100, pct_earlyPNC = mean(pnc5, na.rm=T)*100, n=n()) ggplot(county_stats, aes(pct_earlyPNC, pct_preterm, size=n))+ geom_point()+ geom_smooth()

47 Sidenote (other tools later)

48 Plots & Interpretation
births_sm %>% group_by(raceeth_f) %>% summarise(preterm = mean(preterm, na.rm=T), pnc5 = mean(pnc5, na.rm=T)) %>% gather(measure, percent, -raceeth_f) %>% mutate(percent = percent * 100) %>% ggplot(aes(measure, percent, fill=raceeth_f, group=raceeth_f, label=round(percent, 1))) + geom_bar(stat="identity", position="dodge")+ geom_text(position=position_dodge(.9), aes(y=percent+3), hjust=0.5)

49 Plots & Interpretation
births_10k %>% filter(cigdur %in% c("N", "Y")) %>% ggplot(aes(wksgest, mage, color=cigdur))+ geom_density_2d(alpha=.5)+theme_minimal()

50 Worth a skim: http://r-statistics
Worth a skim: MasterList-R-Code.html

51 Pause for a brief breather and questions!
Next: Colors, themes, ggplot extension packages

52 Module Outline The grammar of graphics ggplot syntax Data-aes-geoms
Facets & Themes Preparing data to Plot Scales: Limits, colors, positions Labels & legends Stats ggplot extension packages

53 6. Scales Limits, colors, positions

54 ggplot components

55

56 Scales: axes Some form of: scale_x_continuous(limits=c(a,b), breaks=a:b) …is pretty common A note on clipping: Most common

57 Scales: axes births_sm %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) %>% ggplot(aes(mage, pct_preterm))+ geom_point(aes(size = n))+ geom_smooth(aes(weight=n))+ scale_y_continuous(limits=c(0,30)) # coord_cartesian(ylim = c(0,30)) #change the "window"

58 Scales : color Typically using brewer or manual Will focus on discrete scales, but continuous works very similarly (and is typically easier)

59 Scales: Colors births_10k %>% ggplot(aes(mage, wksgest, color=preterm_f))+ geom_jitter()+ scale_color_discrete() # default scale_color_manual(values = c("Blue", "Red")) scale_color_manual(values = c("Preterm" = "Red", "Term" = "Blue"))

60 Scales: color brewer http://colorbrewer2.org
RColorBrewer::display.brewer.all()

61 Scales: viridis Use the color scales in this package to make plots that are pretty, better represent your data, easier to read by those with colorblindness, and print well in grey scale. Compare, for instance, with rainbow / spectral perceptual “jumps.”

62 Scales: Colors births_10k %>% ggplot(aes(mage, wksgest, color=preterm_f))+ geom_jitter()+ scale_color_brewer(palette = "Paired") scale_color_viridis_d(begin=0, end=.4) # (0->1)

63 positions

64 positions dodge fill jitter nudge stack ggplot(mpg, aes(fl, fill=drv))+ geom_bar(position=“<POSITION>”) Where <POSITION> is one of: dodge fill jitter nudge stack Can be tweaked further with the function equivalents position = position_dodge(width=1) Or a few more detailed versions …+geom_label(aes(label=mylabs), nudge_y=1)

65 coordinate systems Rarely used, except for coord_flip() Other solutions in helper packages. Check ggstance in particular. (Also used for radial geometries, maps, etc.)

66 coordinate systems # Horizontal with coord_flip() ggplot(mpg, aes(class, hwy, fill = factor(cyl))) + geom_boxplot() + coord_flip() # In ggstance, you supply aesthetics in their natural order: # Horizontal with ggstance ggplot(mpg, aes(hwy, class, geom_boxploth()

67 7. Labels & Legends Limits, colors, positions

68 Labels labs() covers almost every label: x y title subtitle caption And can also do legend titles… …or do in the scale_*() function Use g+annotate() to just stick something (anything!) right where you want it.

69 Labels

70 Labels ggplot(county_stats, aes(pct_earlyPNC, pct_preterm, size=n, color=county_name, label=county_name))+ geom_point()+ geom_text(nudge_y = .5, size=3)+ # ^ behind the scenes, this is a position adjustment! geom_smooth(aes(weight=n), color="black", show.legend = F, se=F)+ scale_color_discrete(guide="none")+ #^ turn color guide off another way labs(x="% early prenatal care", y="% preterm", title="% preterm vs. % early prenatal care County", subtitle="Roughly inverse relationship, but very rough!")+ annotate(geom="rect", xmin=75, xmax=80, ymin=0, ymax=10, alpha=.2, fill="black")+ annotate(geom="text", x=77, y=5, angle=90, label="nothing here!", vjust=0.5)

71 7. Stats More advanced! Put on your thinking caps / this is just an introduction.

72 Layer=data+stats+geoms
Hadley’s (package author) underlying theory from grammar of graphics: layer=data+stats+geoms. Often stat or geom imply the other, so each has a default parameter of the other. Excellent Stack Overflow Review difference-between-geoms-and-stats-in-ggplot2 Default stats for geoms: reference/ggplot2/geom

73 Let’s Try

74 Stat variables geoms, behind the scenes, often calculate a new dataframe to actually plot on screen. stat_<thing> calculates a new dataframe explicitly That dataframe has some “secret” (documented) variable names, accessible by special inline variables ..count.. ..ncount.. ..density.. ..ndensity.. ..count.. ..prop.. Etc. What if I want to refer to those new variables elsewhere in the ggplot call? (rare use – dplyr instead!)

75 Example g = ggplot(births_10k, aes(cigdur, fill=preterm_f))+ geom_bar(position="dodge") ggplot_build(g)$data

76 Example ggplot(births_10k, aes(cigdur, fill=preterm_f))+ geom_bar(position="dodge", aes(y=..count../sum(..count..))) ggplot(births_10k, aes(cigdur, fill=preterm_f, group=preterm_f))+ geom_bar(position="dodge", aes(y=..prop..)) # ^ less than ideal. Really best to do yourself… Requires careful group-aesthetic assignment.

77 8. ggplot extensions Extending the language

78 Overview Extensions to ggplot2 add geometries, statistics, themes… all within the language you now know. Dedicated website for them: And many in development elsewhere. Let’s look at some greatest hits

79 So,so many themes https://github.com/cttobin/ggthemr ggthemes, etc.
library(ggthemes) ggplot(county_stats, aes(pct_preterm, pct_earlyPNC, size=n, label=cores))+ geom_jitter()+ theme_tufte()

80 ggsci : ggsci offers a collection of ggplot2 color palettes inspired by scientific journals, data visualization libraries, science fiction movies, and TV shows

81 plotly Make your plots mouse-interactive. Great for exploring data. Minimal use is: ggplotly() …which makes your last ggplot interactive. Can also send plots to your account online, integrate with shiny interactive websites (renderPlotly and outputPlotly), etc. Will demo more in Module 4.

82 ggrepel Already have geom_text() and geom_label (label draws a text box, text is just text) Now have geom_text_repel() and geom_label_repel() to use a “force” to spread out data and labels. library(ggrepel) ggplot(county_stats %>% filter(n>1000), aes(pct_preterm, pct_earlyPNC, size=n, label=county_name))+ geom_jitter(alpha=0.3)+ geom_text_repel(force=0.01, show.legend = F)+ theme_minimal()

83 ggmosaic library(ggmosaic) ggplot(births_10k)+ geom_mosaic(aes(x=product(preterm_f, raceeth_f), fill=preterm_f))

84 GGally ggpairs

85 GGally high level EDA tools. ggcorr()

86 ggdag For the causal epidemiologists in the room!
Works alongside dagitty tools and website. Can download a saved dag from the web as an object, plot and adjust it for publication.

87 Special mention: broom
Broom… tidies data! Nice to get ready for ggplot. Currently broom provides tidying methods for many S3 objects from the built-in stats package, including lm glm htest anova nls kmeans manova TukeyHSD arima It also provides methods for S3 objects in popular third-party packages, including Lme4 glmnet boot gam survival Lfe zoo multcomp sp maps Will demo just a little in Module 4.

88 Survminer (also GGAlly::ggsurv())
s <- survival::survfit(Surv(wksgest, preterm) ~ pnc5_f, data=births_sm, type="kaplan-meier") ggplot2::autoplot(s)+ labs(title="Rough survival plot, preterm birth as outcome") # autoplot: have ggplot guess what to do by the data type

89 Thinking big: packages off CRAN
For example, treemapify Install dev versions of things with devtools, and install_github, etc.

90 Thinking big: Sankey / river / alluvial diagrams
Good example of how many ways to cut it: y-diagrams-in-r

91 10. Putting it together

92

93 births_sm %>% group_by(mage) %>% summarise(pct_preterm = mean(preterm_f == "Preterm", na.rm=T)*100, n=n()) %>% ggplot(aes(mage, pct_preterm))+ geom_point(aes(size = n), alpha=0.5)+ geom_smooth(aes(weight=n), se=F, lty="dashed", color="red")+ scale_x_continuous(limits = c(0,NA))+ scale_y_continuous(limits = c(0,30))+ labs(x="Maternal age", y="% Preterm", size="# of births", title="Relationship between maternal age and % preterm birth", subtitle=paste0("Data from NC SCHS, including ", prettyNum(nrow(births_sm), big.mark = ","), " births."), caption = "NOTES: The relationship of maternal age and preterm birth \r is roughly quadratic, as modeled with a LOESS curve.")+ theme_minimal() ggsave("maternal_age_and_preterm.png", width=6, height=4)

94 Saving a ggplot The usual GUI / Rstudio way

95 Saving a ggplot Programmatically - super easy. Save my last ggplot:
ggsave(“filename.png”) ggsave(“filename.pdf”, width=10, height=8) Or can specifically save the ggplot to an object, then send to ggsave() by object name. my_ggplot = ggplot()… etc. ggsave(g, “my_new_ggplot.png”)

96 END of MODULE 3!


Download ppt "R for Epi Workshop Module 3: Data visualization with ggplot"

Similar presentations


Ads by Google