Using the R Tidyverse packages

Using the R Tidyverse packages
Simon Andrews v

What's wrong with R? Stupid legacy design choices Inconsistent API
Difficult to follow code logical flow Multiple ways to model data

Stupid Design Choices data.frame( gene=c("ABC10","DEF10","GHI10"),
"gene name"=c("ABC10 gene","DEF10 gene","GHI10 Gene"), value=c(10,12,14), row.names=1 ) -> some.data

Stupid Design Choices class(some.data$name) [1] "factor"
some.data["ABC1",] name value ABC10 ABC10 gene 10

Stupid Design Choices small.data
interesting.samples <- c("sample1","sample2") small.data[,interesting.samples] sample1 sample2 ABC DEF GHI interesting.samples <- c("sample3") [1]

Inconsistent API barplot(1:3, horiz=TRUE)
stripchart(1:3,vertical = FALSE) min(DATA) log(DATA,[other options]) substr(DATA,[other options]) grep(pattern,DATA,[other options])

Difficult to follow logical flow
6 5 4 2 1 3 sqrt(min(log(abs(some.data)+1,base=2))) Which function does this belong to?

Tidyverse Collection of R packages
Collection of R packages Aims to fix many of core R's structural problems Common design and data philosophy Designed to work together, but integrate seamlessly with other parts of R

Tidyverse Packages Tibble - data storage
ReadR - reading data from files TidyR - Model data correctly DplyR - Manipulate and filter data Ggplot2 - Draw figures and graphs

Tidyverse Philosophies
All data stored in a tibble (like a data frame) Make all functions perform a single, simple operation Function names should be descriptive - normally a verb All functions take a tibble as their first argument, allowing 'pipes' to be constructed All (well most) functions return a tibble Use functional programming (not OO) Don't modify existing data - create a modified copy

Using Tidyverse

Things to cover today Installing Tidyverse
Reading files and using tibbles Restructuring data into ‘tidy’ format Transforming tidy data Subsetting and filtering Grouping and Summarising Extending and Joining Final exercise

Installation and calling
install.packages("tidyverse") library("tidyverse") -- Attaching packages tidyverse v ggplot v purrr v tibble v dplyr v tidyr v stringr 1.3.1 v readr v forcats 0.3.0

Reading Files with readr
Provides functions which mirror R's read.table read_csv("file.csv") read_tsv("file.tsv") read_delim("file.txt") read_fwf("file.txt",col_positions=c(1,3,6))

Reading files with readr
Output is always a tibble No name translation Guesses appropriate formats Says what formats were guessed (can get it wrong though!) No string to factor conversion

Reading files with readr
> read_tsv("trumpton.txt") -> trumpton Parsed with column specification: cols( LastName = col_character(), FirstName = col_character(), Age = col_double(), Weight = col_double(), Height = col_double() ) > trumpton # A tibble: 7 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 Hugh Chris 2 Pew Adam 3 Barney Daniel 4 McGrew Chris 5 Cuthbert Carl 6 Dibble Liam 7 Grub Doug

Fixing guessed columns
> read_csv("Child_Variants.csv") -> child_tbl Parsed with column specification: cols( CHR = col_double(), POS = col_double(), dbSNP = col_character(), REF = col_character(), ALT = col_character(), QUAL = col_double(), GENE = col_character(), ENST = col_character(), MutantReads = col_double(), COVERAGE = col_double(), MutantReadPercent = col_double() ) Warning: 477 parsing failures. row col expected actual file 25346 CHR a double MT 'Child_Variants.csv' 25347 CHR a double MT 'Child_Variants.csv' 25348 CHR a double MT 'Child_Variants.csv' 25349 CHR a double MT 'Child_Variants.csv' 25350 CHR a double MT 'Child_Variants.csv' See problems(...) for more details. Types are guessed on first 1000 lines Warnings for later mismatches Values converted to NA

Fixing guessed columns
> read_csv( "Child_Variants.csv", col_types=cols(CHR=col_character()) ) -> child_tbl # A tibble: 25,822 x 11 CHR POS dbSNP REF ALT QUAL GENE ENST MutantReads COVERAGE <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> A G OR4F5 ENST~ rs75~ A G OR4F5 ENST~ A T OR4F5 ENST~ rs75~ T C OR4F5 ENST~ rs66~ T C SAMD~ ENST~ rs22~ G A NOC2L ENST~ rs38~ A G NOC2L ENST~ rs37~ T C NOC2L ENST~ rs37~ T C NOC2L ENST~ rs13~ G C NOC2L ENST~ # ... with 25,812 more rows, and 1 more variable: MutantReadPercent <dbl>

Tibbles vs Data Frames Most operations are compatible my.tibble$column
Any function taking data frame as argument

Tibbles vs Data Frames Tibbles print more nicely
Tell you the column types Tell you the overall dimensions Omit columns which don't fit Truncate values which don’t fit Show fewer rows by default (10 vs 1000) Tibbles don't support row names Causes issues with some packages, especially BioConductor

Printing tibbles Tells you the column types
# A tibble: 25,822 x 11 CHR POS dbSNP REF ALT QUAL GENE ENST MutantReads COVERAGE <chr> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> A G OR4F5 ENST~ rs75~ A G OR4F5 ENST~ A T OR4F5 ENST~ rs75~ T C OR4F5 ENST~ rs66~ T C SAMD~ ENST~ rs22~ G A NOC2L ENST~ rs38~ A G NOC2L ENST~ rs37~ T C NOC2L ENST~ rs37~ T C NOC2L ENST~ rs13~ G C NOC2L ENST~ # ... with 25,812 more rows, and 1 more variable: MutantReadPercent <dbl> Tells you the column types Tells you the overall dimensions Omits columns which don't fit Truncates values which don’t fit Shows fewer (10 vs 1000) rows by default

Creating tibbles Any read method from readr
Using as_tibble() on a data frame Using tibble() from vectors

Exercise 1 Reading Data into Tibbles

"Tidy Data" Many ways to store the same data
# A tibble: 3 x 8 Gene Chr Start End Sample1_WT Sample2_WT Sample3_KO Sample4_KO <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Gnai 2 Pbsn 3 Cdc Many ways to store the same data Different structures need different code to process them Good to have a common design structure

Wide vs Long Wide format Rows are measures Samples are columns
Implied (often incorrectly)linkage between measures on the same row for different samples Works OK when measures are paired and equal Tricky to alter as you need to select columns first

Wide vs Long Long format is the standard "Tidy" format
Every row is a single observation Every column should be a different type of annotation or measure Should never have the same measure for difference samples in multiple columns Slightly more complex for paired studies More flexible and consistent generally.

Converting to "Tidy" format
# A tibble: 3 x 8 Gene Chr Start End Sample1_WT Sample2_WT Sample3_KO Sample4_KO <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Gnai 2 Pbsn 3 Cdc Put all measures into a single column Add a 'sample' and 'genotype' column Duplicate the gene information as required Or separate it into a different table

# A tibble: 3 x 8 Gene Chr Start End Sample1_WT Sample2_WT Sample3_KO Sample4_KO <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Gnai 2 Pbsn 3 Cdc non.normalised %>% gather(sample, value, -Gene,-Chr,-Start,-End) %>% separate(sample,into=c("sample","genotype"),sep="_")

# A tibble: 12 x 7 Gene Chr Start End sample genotype value <chr> <dbl> <dbl> <dbl> <chr> <chr> <dbl> 1 Gnai Sample1 WT 2 Pbsn Sample1 WT 3 Cdc Sample1 WT 4 Gnai Sample2 WT 5 Pbsn Sample2 WT 6 Cdc Sample2 WT 7 Gnai Sample3 KO 8 Pbsn Sample3 KO 9 Cdc Sample3 KO 10 Gnai Sample4 KO 11 Pbsn Sample4 KO 12 Cdc Sample4 KO

Can clean up duplicated information # A tibble: 12 x 4 Gene sample genotype value <chr> <chr> <chr> <dbl> 1 Gnai3 Sample1 WT 2 Pbsn Sample1 WT 3 Cdc45 Sample1 WT 4 Gnai3 Sample2 WT 5 Pbsn Sample2 WT 6 Cdc45 Sample2 WT 7 Gnai3 Sample3 KO 8 Pbsn Sample3 KO 9 Cdc45 Sample3 KO 10 Gnai3 Sample4 KO 11 Pbsn Sample4 KO 12 Cdc45 Sample4 KO # A tibble: 3 x 4 Gene Chr Start End <chr> <dbl> <dbl> <dbl> 1 Gnai 2 Pbsn 3 Cdc These can be recombined later on as needed.

Tidying operations gather spread separate unite
Takes multiple columns of the same type and puts them into a pair of key-value columns spread Takes a key-value column pair and spreads them out to multiple columns of the same type separate Splits a delimited column into multiple columns unite Combines multiple columns into one

Using gather gather(gather.data,sample,value,-genes)
gather(data,key_name,value_name, columns) gather(gather.data,sample,value,-genes) # A tibble: 12 x 3 genes sample value <chr> <chr> <dbl> 1 Nanog sample 2 Sfi1 sample 3 GAPDH sample 4 Nanog sample 5 Sfi1 sample 6 GAPDH sample 7 Nanog sample 8 Sfi1 sample 9 GAPDH sample 10 Nanog sample 11 Sfi1 sample 12 GAPDH sample > gather.data # A tibble: 3 x 5 genes sample1 sample2 sample3 sample4 <chr> <dbl> <dbl> <dbl> <dbl> 1 Nanog 2 Sfi 3 GAPDH Column selections can be positive (sample1,sample2,sample3) or negative (-genes)

Using spread spread(spread.data,dimension,measures)
spread(data,key_col,value_col) spread(spread.data,dimension,measures) samples depth height width <chr> <int> <int> <int> 1 sample 2 sample > spread.data # A tibble: 6 x 3 samples dimension measures <chr> <chr> <int> 1 sample1 height 2 sample1 width 3 sample1 depth 4 sample2 height 5 sample2 width 6 sample2 depth

Using separate separate(data,col,into_vector,sep)
separate(sep.data,size,into=c("h","w"),sep="x",convert = TRUE) # A tibble: 4 x 3 standard h w <chr> <int> <int> 1 SVGA 2 XGA 3 WXGA 4 FHD > sep.data # A tibble: 4 x 2 standard size <chr> <chr> 1 SVGA x600 2 XGA x768 3 WXGA x720 4 FHD x1080 convert=TRUE means that the type for the new columns will be re-guessed since it might well change.

Using unite unite(data,cols,colname,sep)
unite(unite.data,chr,start,strand,col="locus",sep=":") # A tibble: 3 x 2 locus value <chr> <dbl> 1 1:12876: 2 2: : 3 X: :- 72.1 > unite.data # A tibble: 3 x 4 chr start strand value <chr> <dbl> <chr> <dbl> 3 X

Multiple operations gather the sample columns into sample and value
# A tibble: 3 x 8 Gene Chr Start End Sample1_WT Sample2_WT Sample3_KO Sample4_KO <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Gnai 2 Pbsn 3 Cdc gather the sample columns into sample and value Split the sample column into sample and genotype with separate

Multiple operations gather(non.norm,sample,value,-Gene,-Chr,-Start,-End) -> n1 separate(n1, sample, into=c("sample","geno"),sep="_") -> n2 separate( gather( non.normalised, sample, value, -Gene,-Chr,-Start,-End ), into=c("sample","genotype"), sep="_" ) -> norm2

The %>% operator (pipe)
All* tidyverse functions take a tibble as the first argument and return a modified tibble Often need to 'chain' several functions together. The %>% operator takes the argument on its left and makes it the first argument to the function on its right Correct logical flow to the code Easier to read and interpret Cleaner (fewer intermediate variables, less typing) *Well nearly all, anyway.

The %>% operator (pipe)
data %>% head() head(data) tibble(a=1:10,b=21:30) %>% nrow() *Well nearly all, anyway.

Multiple operations non.norm %>%
# A tibble: 3 x 8 Gene Chr Start End Sample1_WT Sample2_WT Sample3_KO Sample4_KO <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Gnai 2 Pbsn 3 Cdc non.norm %>% gather(sample,value,-Gene,-Chr,-Start,-End) %>% separate(sample,into=c("sample","geno"),sep="_")

Exercise 2 Restructuring data into ‘tidy’ format

Transforming data with dplyr

Functions in dplyr Replacement for core selecting, sorting and filtering Operations can be split into groups Subsetting or filtering a tibble Grouping and summarising variables Extending tibbles by adding data

Subsetting and Filtering
slice select rows by position select select columns by name/position arrange sort rows distinct deduplication filter functional selection of rows

Trumpton # A tibble: 7 x 5 LastName FirstName Age Weight Height
<chr> <chr> <dbl> <dbl> <dbl> 1 Hugh Chris 2 Pew Adam 3 Barney Daniel 4 McGrew Chris 5 Cuthbert Carl 6 Dibble Liam 7 Grub Doug

Using slice # A tibble: 3 x 5 LastName FirstName Age Weight Height
slice(data,rows) trumpton %>% slice(1,4,7) # A tibble: 3 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 Hugh Chris 2 McGrew Chris 3 Grub Doug Note, rows can be separate (as above) or combined into one vector

Using select # A tibble: 7 x 3 LastName Age Height
select(data,cols) trumpton %>% select(LastName,Age,Height) # A tibble: 7 x 3 LastName Age Height <chr> <dbl> <dbl> 1 Hugh 2 Pew 3 Barney 4 McGrew 5 Cuthbert 6 Dibble 7 Grub

Using select # A tibble: 7 x 4 FirstName Age Weight Height
select(data, -unwanted_cols) trumpton %>% select(-LastName) # A tibble: 7 x 4 FirstName Age Weight Height <chr> <dbl> <dbl> <dbl> 1 Chris 2 Adam 3 Daniel 4 Chris 5 Carl 6 Liam 7 Doug

Defining Selected Columns
Common rules used throughout tidyverse. Single definitions (name, position or function) Positive weight, height, length, 1, 2, 3, last_col() Negative -chromosome, -start, -end, -1, -2, -3 Range selections 3:5 -(3:5) height:length -(height:length) Functional selections (positive or negative) starts_with() -starts_with() ends_with() -ends_with() contains() -contains() matches() -matches

Using select helpers > child.variants
CHR POS dbSNP REF ALT QUAL GENE ENST MutantReads COVERAGE MutantReadPercent > child.variants %>% select(REF,COVERAGE) REF COVERAGE > child.variants %>% select(-CHR, -ENST) POS dbSNP REF ALT QUAL GENE MutantReads COVERAGE MutantReadPercent > child.variants %>% select(5:last_col()) ALT QUAL GENE ENST MutantReads COVERAGE MutantReadPercent > child.variants %>% select(POS:GENE) POS dbSNP REF ALT QUAL GENE > child.variants %>% select(-(POS:GENE)) CHR ENST MutantReads COVERAGE MutantReadPercent > child.variants %>% select(starts_with("Mut")) MutantReads MutantReadPercent > child.variants %>% select(-ends_with("t",ignore.case = TRUE)) CHR POS dbSNP REF QUAL GENE MutantReads COVERAGE > child.variants %>% select(contains("Read")) MutantReads MutantReadPercent

Using arrange # A tibble: 7 x 5 LastName FirstName Age Weight Height
arrange(data,cols) trumpton %>% arrange(Age) # A tibble: 7 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 Barney Daniel 2 Hugh Chris 3 Cuthbert Carl 4 Grub Doug 5 Pew Adam 6 Dibble Liam 7 McGrew Chris

Using arrange # A tibble: 7 x 5 LastName FirstName Age Weight Height
arrange(data,cols) trumpton %>% arrange(desc(Age)) # A tibble: 7 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 McGrew Chris 2 Dibble Liam 3 Pew Adam 4 Grub Doug 5 Cuthbert Carl 6 Hugh Chris 7 Barney Daniel

Using distinct # A tibble: 6 x 5 LastName FirstName Age Weight Height
distinct(data,cols, .keep_all=TRUE) trumpton %>% distinct(FirstName,.keep_all=TRUE) # A tibble: 6 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 Hugh Chris 2 Pew Adam 3 Barney Daniel 4 Cuthbert Carl 5 Dibble Liam 6 Grub Doug Note, “keep_all” has a dot before it

Using filter # A tibble: 6 x 5 LastName FirstName Age Weight Height
filter(data,conditions) trumpton %>% filter(Height>170 & Age < 30) # A tibble: 6 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 Hugh Chris 2 Cuthbert Carl

Clashes with other packages
Many other packages define a function called select (eg MASS) In R, the last package loaded, wins. Tidyverse usually loses since it’s loaded at the start! In these cases you need to specify the package as well as the function: dplyr::select() Can affect other functions too

Clashes with other packages
> library(MASS) Attaching package: ‘MASS’ The following object is masked from ‘package:dplyr’: select > select function (obj) UseMethod("select") <bytecode: 0x > <environment: namespace:MASS> > dplyr::select function (.data, ...) { } <bytecode: 0x b90> <environment: namespace:dplyr>

Exercise 3 Filtering and selecting with dplyr

Grouping and Summarising
group_by sets groups for summarisation ungroup removes grouping information summarise collapse grouped variables count count grouped variables

Grouping and Summarising Workflow
Load a tibble with repeated values in one or more columns Use group_by to select all of the categorical columns you want to combine to define your groups Run summarise saying how you want to combine the quantitative values Run ungroup to remove any remaining group information

> group.data # A tibble: 8 x 5 Sample Genotype Sex Height Length <dbl> <chr> <chr> <dbl> <dbl> WT M WT F WT F WT M KO M KO F KO F KO M Want to get average Height and Length for each sex, but still split by genotype

Discard Group Mean Median Sample Genotype Sex Height Length <dbl> <chr> <chr> <dbl> <dbl> Categorical Quantitative Want to get average Height and Length for each combination of sex and genotype

group.data %>% group_by(Genotype,Sex) # A tibble: 8 x 5 # Groups: Genotype, Sex [4] Sample Genotype Sex Height Length <dbl> <chr> <chr> <dbl> <dbl> WT M WT F WT F WT M KO M KO F KO F KO M

group.data %>% group_by(Genotype,Sex) %>% summarise(Height2=mean(Height),Length=median(Length)) # A tibble: 4 x 4 # Groups: Genotype [2] Genotype Sex Height2 Length <chr> <chr> <dbl> <dbl> 1 KO F 2 KO M 3 WT F 4 WT M

Ungrouping A summarise operation removes the the last level of grouping (“sex” in our worked example) Other levels of grouping (“Genotype”) remain annotated on the data, so you could do an additional summarisation if needed If you’re not going to use them it’s a good idea to use ungroup to remove remaining groups so they don’t interfere with other operations

Exercise 4 Grouping and Summarising

Extending tibbles mutate compute a new column
add_row adds an additional row bind_rows join two tibbles by row add_column adds an additional column bind_cols join two tibbles by column rename renames an existing column

Using mutate mutate(data,newcol=oldcol1+oldcol2)
trumpton %>% mutate(bmi=Weight/(Height/100)^2) # A tibble: 7 x 6 LastName FirstName Age Weight Height bmi <chr> <chr> <dbl> <dbl> <dbl> <dbl> 1 Hugh Chris 2 Pew Adam 3 Barney Daniel 4 McGrew Chris 5 Cuthbert Carl 6 Dibble Liam 7 Grub Doug

Using add_row add_row(data,key1=value1, key2=value2) trumpton %>%
LastName="Andrews",FirstName="Simon", Age=39,Weight=80,Height=190 ) # A tibble: 8 x 5 LastName FirstName Age Weight Height <chr> <chr> <dbl> <dbl> <dbl> 1 Hugh Chris 2 Pew Adam 3 Barney Daniel 4 McGrew Chris 5 Cuthbert Carl 6 Dibble Liam 7 Grub Doug 8 Andrews Simon

Using bind_rows # A tibble: 6 x 2 gene value <chr> <dbl>
bind_rows(data1,data2) bind_rows(some.data,some.more.data) > some.data # A tibble: 3 x 2 gene value <chr> <dbl> 1 ABC 2 DEF 3 GHI > some.more.data 1 JKL 2 MNO 3 PQR # A tibble: 6 x 2 gene value <chr> <dbl> 1 ABC 2 DEF 3 GHI 4 JKL 5 MNO 6 PQR

Using add_column # A tibble: 3 x 3 gene value value2
add_column(data,newcol=new_vector) add_column(some.data,value2=c(16,18,20)) # A tibble: 3 x 3 gene value value2 <chr> <dbl> <dbl> 1 ABC 2 DEF 3 GHI

Using bind_cols # A tibble: 3 x 4 gene value gene1 value1
bind_cols(tibble1,tibble2) bind_cols(some.data,some.more.data) > some.data # A tibble: 3 x 2 gene value <chr> <dbl> 1 ABC 2 DEF 3 GHI > some.more.data 1 JKL 2 MNO 3 PQR # A tibble: 3 x 4 gene value gene1 value1 <chr> <dbl> <chr> <dbl> 1 ABC JKL 2 DEF MNO 3 GHI PQR

Using rename # A tibble: 3 x 2 gene_name value <chr> <dbl>
rename(data,new_name=name) rename(some.data, gene_name = gene) > some.data # A tibble: 3 x 2 gene value <chr> <dbl> 1 ABC 2 DEF 3 GHI # A tibble: 3 x 2 gene_name value <chr> <dbl> 1 ABC 2 DEF 3 GHI

Joining tibbles x and y left_join join matching values from y into x
right_join join matching values of x into y inner_join join x and y keeping only rows in both full_join join x and y keeping all values in both

Join types left_join right_join inner_join full_join > join1
name count percentage 1 Simon 2 Steven NA 3 Felix > join1 name count 1 Simon 2 Steven 6 3 Felix > join2 name percentage 1 Felix 2 Anne 3 Simon right_join name count percentage 1 Felix 2 Anne NA 3 Simon inner_join name count percentage 1 Simon 2 Felix full_join name count percentage 1 Simon 2 Steven NA 3 Felix 4 Anne NA

Joining options by specify the columns to join on
Simple name if it’s the same between both by=“gene” Paired names if they differ between x and y by=c(“gene” = “gene_name”) suffix the text suffix for duplicated column names

Exercise 5 Extending and Joining

Using the R Tidyverse packages

Similar presentations

Presentation on theme: "Using the R Tidyverse packages"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using the R Tidyverse packages

Similar presentations

Presentation on theme: "Using the R Tidyverse packages"— Presentation transcript:

Similar presentations

About project

Feedback