Download presentation
Presentation is loading. Please wait.
1
n=54
2
What School / Dept?
3
What is the primary research question you work on?
4
Why you want to participate?
5
School of Biotechnology
Data for life: An Introduction to R Dr. Tim Downing School of Biotechnology See Resources and References 5
6
Code not spreadsheets! data in formulas! =(3.6946*10^-6)/'Old snails'!J26 ref: Jenny Bryan
7
Code not spreadsheets! ref: Jenny Bryan
8
Mistakes matter! Paper: "Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak mutation rate and genotype variation of Ebola virus from Mali case sequences" DOI: /science.aaa5646 Wait, WTF? Comment on “Mutation rate and genotype variation of Ebola virus from Mali case sequences” DOI: /science.aaf3823 Oops OMG rly sry Response to Comment on “Mutation rate and genotype variation of Ebola virus from Mali case sequences” DOI: /science.aaf4561
9
Fig. 2 Maximum likelihood tree of the 106 sequences analyzed by Hoenen et al. (left side) initially (1) with lines linking to the correctly labeled sequences after the erratum (3) (right side). Maximum likelihood tree of the 106 sequences analyzed by Hoenen et al. (left side) initially (1) with lines linking to the correctly labeled sequences after the erratum (3) (right side). Lines of the same color represent multiple samples taken from the same patient, which in most cases have identical sequences. These correctly group together on the right but do not in many cases on the left. Andrew Rambaut et al. Science 2016;353:658
10
Code not spreadsheets! If you do something once, you usually don’t need a script. Do it hundreds or thousands of times, you will want something to help you. Want to share what you did, providing a script is usually a good way. Sometimes though, scripts are too complicated, and don’t capture all that is need to do an experiment. For example: the version of a tool you used! ref: BF Francis Ouellette
11
ref: Jenny Bryan
13
Code not spreadsheets! ref: Jenny Bryan
14
Code not spreadsheets! For Every Result, Keep Track of How It Was Produced Avoid Manual Data Manipulation Steps Archive Exact Versions of All External Programs Used Version Control All Custom Scripts Record All Intermediate Results in Standard Formats For Analyses That Include Randomness, Note Underlying Random Seeds Always Store Raw Data behind Plots Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected Connect Textual Statements to Underlying Results Provide Public Access to Scripts, Runs, and Results
15
R shiny eg
16
Introduction to R R is a command-line programming language
& statistics package (GNU project based on S) Great for statistical analysis of biological data Excellent for visualising experimental results "R" or "RStudio" 1. Go to "Start" 2. "All programs" 3. Hit "RStudio" (64-bit)
17
Using your own laptop • Download R from the R Project website install it, and verify that it starts. • Download R Studio from the RStudio website install it, verify that it starts • Download ‘A very short introduction to R’ from CRAN read it, and work through it.
18
What is R? - An environment for statistical analysis and graphics
- Supports almost every statistical method - The environment of choice for developing new methods It's for anyone who has to analyse data - from A to Z - agriculture, astrophysics, climatology, - ecology and environmental science, econometrics, - electrical engineering, finance, genetics, - genomics, geography, psychology, public health, - social sciences, zoology! 18
19
R Course Aims - Introduce you to R
- Get you to use R yourself on your own data - Work in groups to talk about using R in your projects - Show you how to learn what you need to know about R 19
21
Running commands Click anywhere in the line with your cursor
Ctrl + Return to run it or click on the Run button 21
22
R Course Learning Outcomes
At the end of this session you will be able to: Start R Studio and create a Rmd (R Markdown) file Understand what a Rmd file is Load a simple data file Understand how R uses dataframes and vectors Use the R help system Examine a dataframe Prepare simple tables Locate and perform some stats on your data (Produce good quality graphs, and save these) (Prepare a simple function) 22
23
R Course Resources Go to
23
24
R Course Data Download these 24
25
R code terms Rmd file – useful text file of your code and comments
vector – a list of numbers variable – something we define, eg x function – R code that does something specific, eg mean(x) arguments – data required by a function, eg x above library – package of R functions that does something specific loaded using the library() function environment - the space you are working in, and the set of things that can be found there (eg Windows, Mac, etc) CRAN – website from with libraries to meet your needs
26
Managing projects in R Many options: I prefer Rmd files (R Markdown)
Record your commands in your notebook (I use Evernote) and in your Rmd file Using GUI (eg Excel) or websites is asking for trouble FAIR etc 26
27
Murphy's Law / Reviewer 3 - How exactly did you define your key explanatory variable? - You had 14,327 records at the start, but the analysis is based on 14,225. What happened to the other 122? - Why did you exclude the measurements from Sensor D, but only b/w 2pm and 4pm on the eighteenth day? - Figure 4 is great, but I want you to make the axis labels a little bigger and move the key to the top left hand corner, please - You do realize that you've calculated the SCQ scores wrong, the average ought to be about 4, not about 24 27
28
Managing projects in R - *If* you have kept the code you ran to produce the data, run the analyses, and make the graphics - *Then* all of these are trivial questions to answer - If you haven't, then you have to redo the analysis from scratch - very painful, very slow, very boring 28
29
Managing projects in R - Keep your work organised
- Keep your original data - Show how you got from there to wherever you ended up - Record all your analyses, both the ones you eventually published, and the ones you didn't - Start with the original data file, once data entry, and checking have been finished - Never overwrite the original data, disk space = cheap - I read it in, and do data cleaning 29
30
Cleaning data / Data QC 90% of the work
- Sometimes I find errors in the data, which I check, if I can, with the original record - Fix these in software - Sometimes I have to omit observations, and again this is done, and fully documented in software - Create new variables, and recode old ones, and all of this is done live in R - Output a dataset ready for further analysis 30
31
Actually doing the thing you wanted to do in the first place after all that checking
10% of the work - Prepare tables and graphs from the data - Carry out statistical tests as needed - Start doing regression analyses, mostly glms, beginning with very simple models, and working up to more complex ones - Prepare tables, graphs and text for output, using R, knitr, and markdown directly 31
32
32
34
Introduction to R After this section you should be able to:
- Read data in and output data from R - Manipulate data in R - Use some basic R functions (eg statistics and plotting) You may be able to write basic R computer programs Use it or lose it There is a lot to R beyond this course – much more available in books, tutorials, online forums, R demos, help pages etc.
35
Introduction to R Help( ) > help(topic) help() is an R function
R functions take arguments (info inside the function: they which go in between brackets, separated by commas) ‘help’ function => displays info on R documentation files
36
Introduction to R type in "help()" and then use Enter to move down lines or the Spacebar to scroll down pages Most R commands are constructed with brackets to denote the data that they are acting on, along with a series of input parameters For example, see "Usage:" for help here help(topic, verbose = getOption("verbose")) help(x) gives information on the topic x, with many adjustable parameters like the verbosity Press q to exit help() See for example help(file).
37
Introduction to R R as calculator
R will evaluate basic calculations which you type into the console (input window) Type in “R” > 10+1 [1] 11 > 10*2 [1] 20 > (10*2)+1 use brackets to clarify precedence [1] 21
38
Introduction to R Assigning variables > x <- 1.5
> y <- 2.6 > x check what x and y were assigned as [1] 1.5 > y [1] 2.6
39
Introduction to R Assigning variables > x <- 1.5
> y <- 2.6 R uses standard operators / * a**b a to the power of b a %% b a modulo b (get the remainder) > (x/(y + 20/15))**18 R can do complex operations [1] e-08
40
Introduction to R Generally can obtain the number (or other value) stored in any variable by typing the name of the variable, followed by enter; or print (variable) or show(variable) > x [1] 1.5 Wait, what did I assign x as again … ? > print (x) > show(x)
41
Introduction to R Vectors
Vectors = a single column/row of numbers in a spreadsheet Create a vector using the c() function (concatenate): x <- c() eg x <- c(1,2,4,8) creates a column of the numbers 1,2,4,8
42
Introduction to R For simple operations (+ - * /) on vectors x and y:
If x and y have same number of entries: normal operations on numbers in vector entry by entry Else x and y have different number of entries: for vector with fewest entries x <- c(1,2,3) # 3 elements in x y <- c(1,2,3,4) # 4 elements in y z <- x*y # This causes R to panic, it can't handle this # merged elements must contain same #elements
43
Introduction to R Vectors x y v > x <- c(1,2,3,4)
> y <- c(55,66,77,88) > z <- c(10,11,12) > v <- x*y multiple commands on one line > v [1] > v <- x/y [1]
44
Introduction to R We can do standard operations on vectors x y v
[1] > v + x [1] > v + x*y [1] > v <- c(x,y) create a longer vector v > v [1]
45
Introduction to R Some more useful ways of creating columns of numbers (vectors) The seq function seq(1,10,1) = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 seq(1,4,0.5) = 1, 1.5, 2, 2.5, 3, 3.5, 4 x:y 1:10 = 1,2,3,4,5,6,7,8,9,10 The rep function rep(2,4) = 2, 2, 2, 2
46
Introduction to R > seq(1,10) [1] 1 2 3 4 5 6 7 8 9 10
[1] > seq(1,21,2.5) [1] > 1:9 [1] > rep(2,3) [1] 2 2 2 > rep(3,5) [1]
47
Introduction to R Get specific parts of a vector:
Example: if x = vector 1, 2, 5 then x[1] = 1, x[2] = 2, x[3] = 5 > x <- c(1,2,5) > x [1] 1 2 5 > x[1] [1] 1 > x[2] [1] 2 > x[3] [1] 5
48
Introduction to R Use square brackets to refer to 2+ cells of a vector
Examples x <- c(1,2,5) y <- c(1,5) c(y, y) 1, 5, 1, 5 x[1:2] 1, 2 x[2:3] 2, 5 x[c(1,3)] 1, 5
49
Introduction to R More useful stuff...
ls() # lists all variables/objects you made rm(list = ls()) # removes ALL objects! A vector index can be a condition as well as a number Example x <- c(1, 0, 8, 12, 0) Get positive elements of x x[x>0] [1]
50
Introduction to R vectors can consist of strings of letters
x <- c (“purple”, “green”, “blue”) A vector never consists of numbers AND strings If numbers and strings in a vector, R considers every entry in vector to be a string (ie not viable for maths operations etc)
51
Introduction to R Now try the questions!
52
Set your working directory
On the top menu, go to "Session" Then "Set Working Directory" Then "Choose Directory" - select the location you want or getwd() setwd("C:/Users/DowningT/myfolder/") Now: Save your commands (eg a Notepad text file, Evernote, R studio, etc)
53
Introduction to R Data frames (and the function data.frame( ) )
Ordered list (entries correspond to one another) Each element of the list has the same length h <- c (150, 170, 168, 179, 130) w <- c (65, 70, 72, 80, 51) patient_data <- data.frame (weights=w, heights=h)
54
Introduction to R Accessing a particular column, cell or row of a data frame Accessing a cell patient_data [i,j] (jth cell in the ith column) Accessing a row patient_data [i,] (row i) Accessing a column patient_data [,i] (column i)
55
Introduction to R Factors = data type with info about observations original group heights <- c(1.7,1.95,1.63,1.54,1.29) Make new vector fac_heights to record nationalities: fac_heights <- factor(c(“GB”, “IR”, “GB”, “GB”, “IR”)) Useful for statistical tests b/w groups
56
Introduction to R Finding out about data object
mode(): tells you storage mode of object class(): info on object’s class often determines how object is handled by a function You can also set object’s mode, attributes or class using above functions. mode(x) <- “numeric” mode(y) <- “character”
57
Data input and output R comes with several pre-packaged datasets
You can access these datasets with the data function data() gets you a list of all the datasets data(Titanic) loads a dataset about passengers on the Titanic (for example, others include data(tea) etc) summary(Titanic) provides some summary information about the dataset Titanic attributes(Titanic) provides some more information Typing the dataset name on its own (followed by Enter) will display the data
58
Data input and output Can use function read.table() to read in dataframe: q <- read.table(file.choose(), sep=“\t”, header=T)
59
CSV data First line = names of variables, separated by commas
Variables = proper numbers or plain text - no spaces, funny characters or punctuation Data = mix of numbers and grouping variables = time sequence is ok = dates and times are not Missing data = represented by the two letters NA and nothing else - no dashes, no 999, 77, 88 or anything else
60
CSV data
61
Data input and output Writing data to a file (the write() and write.table() functions) Write list out to current directory write (q, file = “filename”, ncol = 2) for vector, ncol specifies #columns in output For a data frame (many optional arguments ... )
62
For loops Writing your own functions in R Syntax: let x be a vector
for ( i in x ) command eg try for(i in 1:10) { print(i) } Writing your own functions in R
63
For loops Writing your own functions in R
Example: add the elements of a vector x <- c (1, 2, 8) sum <- 0 for (i in x) sum <- sum + i sum = 11 sum() does this for you Writing your own functions in R
64
Conditional statements
If statement execution Syntax: if (condition){ command } OR x <- 11 if(x>2){ print (x) } Writing your own functions in R
65
Conditional execution
>, <, <=, >=, !=, == are all used to compare numbers = means assign These can also be used in conditional indexing (see earlier) You can combine conditions with the & or | sign AND OR Writing your own functions in R
66
Combined loop & conditional execution
Example: add positive elements of vector x <- c (1, -2, 8) pos_sum <- 0 for (i in x) if (i > 0) pos_sum <- pos_sum + i pos_sum This is for illustration – there are quicker ways of doing this Writing your own functions in R
67
Blocks Writing your own functions in R
Sometimes you need to execute several commands in a loop or after testing a condition R will group together all commands within { } x <- 11 if(x>2){ print (x) } Writing your own functions in R
68
Blocks Writing your own functions in R
Add up all the numbers between 1 and 10 and multiply by the sum of the numbers between 5 and 10 for ( i in 1:10) { m <- m + i 1, 3, 6, … if ( i > 5 ) { k <- k + i } ,13,… } s <- m * k s[3]=m[1]*k[3], s[4]=m[2]*k[4] Writing your own functions in R
69
Writing your own functions in R
Making your own function is possible: my_function <- function ( ... ) {} fix ( my_function ) Note: comment your code (anything after the # symbol is ignored by R – this is a place to put ‘notes to self’) for ( i in 1:10) { # for each number 1 to 10 m <- m + i # m is assigned as itself + i if ( i > 5 ) { k <- k + i } # if i>5, k is assigned as itself + i } # end for s <- m * k # s is m by k Writing your own functions in R
70
Data input and output R comes with several pre-packaged datasets
You can access these datasets with the data function eg 1990 Davis PMID "Body image and weight preoccupation: A comparison between exercising and non-exercising women" View(Davis) head(Davis) str(Davis) glimpse(Davis) summary(Davis)
71
Scientific method in biostatistics
1. Defining problems: What type of data is being tested? 2. Assumptions of each model: Is this valid for this data? Why has this particular model been used? 3. Testing hypothesis predictions: If X is true, then what follows? 4. Boundary conditions: When does it fail or not work?
72
Getting your data into R
Let's learn: * Start R Studio and set up a project * Prepare and run R script files * what an R package is and how to install and load them * Load a simple data file * Understand how (& why) R uses dataframes & vectors * Prepare simple tables * Produce good quality graphs, and save these * Locate and perform some statistical tests on your data * Prepare a simple function
73
Getting your data into R
Download -> CSV -> data.frame -> data cleaning At the end of this session, you will be able to: Upload, download, create or import data into R Manipulate large datasets as tables in R Explore datasets using multiple approaches Test for missing, partial or inconsistent data Summarise and compare datasets with R
74
Data input and output Davis table(Davis$weight) table(Davis$height)
data.frame(Davis$height,Davis$weight) # create a dataframe from heights and weights only data.frame(Davis$height,Davis$weight)[1:5,] look at dataframe for heights and weights - samples 1-5
75
Dplyr -> Data cleaning + EDA
=> manipulating data: Davis %>% filter(height<quantile(height,0.5)) # subset rows of smallest 50% %>% arrange(desc(height)) # sort rows of smallest 50% %>% select(sex,weight,repwt) # select columns we want %>% mutate(weight_diff=(weight-repwt)) # create new variables, eg "weight_diff" %>% group_by(sex) %>% summarise(mean=mean(weight)) # summarise details of interest
76
Dplyr -> Data cleaning + EDA
=> manipulating data: # assign Davis heights vs weights using ggplot2 x <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_jitter() + geom_line() + stat_smooth(span=0.5) + ggtitle('heights v weights') + xlab('height') + ylab('weight') # Higher spans = smoother # plot it (this way avoids errors) ggsave(filename='heights-weights-plot1.png', plot=x, dpi=1200)
78
Dplyr -> Data cleaning + EDA
=> manipulating data: # assign Davis heights vs weights using ggplot2 x2 <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_boxplot() + coord_flip() + facet_wrap(~sex) + ggtitle('heights v weights') # plot it (NB not all plot types are sensible) ggsave(filename = 'heights-weights-plot2.png', plot=x2, dpi=1200)
79
Dplyr -> Data cleaning + EDA
=> manipulating data: scale_h <- function(height) { return(height/185200) } # create function to scale heights as nautical miles Davis$height_nm <-scale_h(Davis$height) # assign a new variable with nautical mile heights Davis$height_nm[1:5] # always check the output
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.