Download presentation
Presentation is loading. Please wait.
Published byPaul Hardy Modified over 7 years ago
2
What School / Dept? n=55
3
What is the primary research question you work on?
4
Why you want to participate?
5
School of Biotechnology
Data for life: An Introduction to R Dr. Tim Downing School of Biotechnology See Resources and References #dataforlife 5
7
Social Networking Sites
metabolite data ion channels + prostate cancer blockchain technology Cybersecurity and parenting gene copies and visualization cycling ability marketing + market development NLP for morphologically rich languages operational efficiencies + data analytics e-commerce wearable fitness technology proteomics and phosphoproteomic analysis how to predict future Pancreatic Cancer Research drugs interaction with biological membrane Intelligent Power Systems violent online political extremism literacy attainment for deaf and hard of hearing children Trends in combat sports biomanufacturability of CHO cells Collaboration and interorganisational relationships capturing carbon dioxide and using it as a raw material Predicting the sustainability and resale value of a car
8
Who owns the data? Example: NHS March 2018: sequence genomes
100,000 Genomes Project -> 32k Genetics Expert Network for Enterprises (GENE) 13 private companies AbbVie Alexion Pharmaceuticals AstraZeneca Biogen Dimension Therapeutics GSK Helomics Roche Takeda UCB* March 2018: sequence genomes Jan 2019: NHS own genomes as database Who owns the data?
9
Example: NHS Google? 99.7% "Deriving genomic diagnoses without revealing patient genomes" Report at
11
Code not spreadsheets! data in formulas! =(3.6946*10^-6)/'Old snails'!J26 ref: Jenny Bryan
12
ref: Jenny Bryan
14
Code not spreadsheets!
15
Code not spreadsheets! Do something once, you may not need a script. But, do it hundreds or thousands of times, you need one(s) Share what you did with a script This may be complicated by tool version numbers ref: BF Francis Ouellette
16
Mistakes matter! Paper: "Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak mutation rate and genotype variation of Ebola virus from Mali case sequences" DOI: /science.aaa5646 Wait, WTF? Comment on “Mutation rate and genotype variation of Ebola virus from Mali case sequences” DOI: /science.aaf3823 Oops OMG rly sry Response to Comment on “Mutation rate and genotype variation of Ebola virus from Mali case sequences” DOI: /science.aaf4561
17
Fig. 2 Maximum likelihood tree of the 106 sequences analyzed by Hoenen et al. (left side) initially (1) with lines linking to the correctly labeled sequences after the erratum (3) (right side). Maximum likelihood tree of the 106 sequences analyzed by Hoenen et al. (left side) initially (1) with lines linking to the correctly labeled sequences after the erratum (3) (right side). Lines of the same color represent multiple samples taken from the same patient, which in most cases have identical sequences. These correctly group together on the right but do not in many cases on the left. Rambaut et al. Science 2016
18
Murphy's Law / Reviewer 3 - How exactly did you define your variables?
- You had 14,327 records at the start, but the analysis is based on 14,225. What happened to the other 122? - Why did you exclude the measurements from Sensor D, but only b/w 2pm and 4pm on the 18th day? - You realise that you've normalised the Loop grades incorrectly, the average is 62%, not 153% - Figure 4 is great, but can you to make the axis labels one louder and move the legend to the top left hand corner beside the dancing leprechaun? 18
19
Code not spreadsheets! For Every Result, Track How It Was Produced
Avoid Manual Data Manipulation Archive Exact Versions of All Programs Used Version Control All Scripts Record All Intermediate Results in Standard Formats Note Random Seeds/Numbers Always Store Raw Data behind Plots Allow Layers of Increasing Detail to Be Inspected Connect Textual Statements to Underlying Results Provide Public Access to Scripts, Runs and Results
20
What is R / RStudio? - An environment for statistical analysis and graphics - Supports almost every statistical method - The environment of choice for developing new methods It's for anyone who has to analyse data - from A to Z - agriculture, astrophysics, climatology, - ecology and environmental science, econometrics, - electrical engineering, finance, genetics, - genomics, geography, psychology, public health, - social sciences, zoology! 20
21
R Course Aims - Introduce you to R
- Get you to use R yourself on your own data - Work in groups to talk about using R in your projects - Show you how to learn what you need to know about R 21
22
R Course Learning Outcomes
At the end of this session you will be able to: Start R Studio and create a Rmd (R Markdown) file Understand what a Rmd file is Load a simple data file Understand how R uses dataframes and vectors Use the R help system Examine a dataframe Prepare simple tables Locate and perform some stats on your data (Produce good quality graphs, and save these) (Prepare a simple function) 22
23
R code terms Rmd file – useful text file of your code and comments
vector – a list of numbers variable – something we define, eg x function – R code that does something specific, eg mean(x) arguments – data required by a function, eg x above library – package of R functions that does something specific loaded using the library() function environment - the space you are working in, and the set of things that can be found there (eg Windows, Mac, etc) CRAN – website from with libraries to meet your needs
24
Managing projects in R Many options: I prefer Rmd files (R Markdown)
Record your commands in your notebook (I use Evernote) and in your Rmd file Using GUI (eg Excel) or websites is asking for trouble FAIR etc 24
25
Managing projects in R - Keep your work organised
- Keep your original data - How did you get from there to here? - Record all analyses, both the ones you published and the ones you didn't - Start with the original data and re-trace your steps - Never overwrite the original data, disk space = cheap 25
26
Cleaning data / Data QC - 90% of the work - Errors in the data
- Fix these - Or omit these, and document why - Create new variables, and recode old ones, and all of this is done live in R - Output a dataset ready for further analysis 26
27
Actually doing the thing you wanted to do in the first place after all that annoying checking that took forever omfg 10% of the work - Prepare tables and graphs from the data - Carry out statistical tests - Start with simple models & complex ones if needed All with R (Perl, Python, Java, Jython, C, C++, Ruby, Fortran, ...) 27
28
28
29
Introduction to R After this section you should be able to:
- Read data in and output data from R - Manipulate data in R - Use some basic R functions (eg statistics and plotting) You may be able to write basic R computer programs Use it or lose it There is a lot to R beyond this course – more available in books, tutorials, online forums, R demos, help pages etc
30
Using your own laptop Website
Download R from the R Project website install it and start it Download RStudio from the RStudio website install it and start it Explore "A very short introduction to R" from CRAN
31
R Course Data Download these 31
32
Oh no. I can't immediately install R & RStudio
Oh no! I can't immediately install R & RStudio! Try the online versions: (Good) (the first part is free, which is enough)
33
What RStudio looks like
Environment Scripting Plots Console
34
Running commands Click anywhere in the line with your cursor
Ctrl + Return to run it or click on the Run button 34
35
Introduction to R Help( ) > help(topic) help() is an R function
R functions take arguments (info inside the function: they which go in between brackets, separated by commas) ‘help’ function => displays info on R documentation files
36
Introduction to R R as calculator
R will evaluate basic calculations which you type into the console (input window) Type in “R” > 10+1 [1] 11 > 10*2 [1] 20 > (10*2)+1 use brackets to clarify precedence [1] 21
37
Introduction to R Assigning variables > x <- 1.5
> y <- 2.6 > x check what x and y were assigned as [1] 1.5 > y [1] 2.6
38
Introduction to R Assigning variables > x <- 1.5
> y <- 2.6 R uses standard operators / * a**b a to the power of b a %% b a modulo b (get the remainder) > (x/(y + 20/15))**18 R can do complex operations [1] e-08
39
Introduction to R Generally can obtain the number (or other value) stored in any variable by typing the name of the variable, followed by enter; or print (variable) or show(variable) > x [1] 1.5 Wait, what did I assign x as again … ? > print (x) > show(x)
40
Introduction to R Now try the questions!
41
Introduction to R Vectors
Vectors = a single column/row of numbers in a spreadsheet Create a vector using the c() function (concatenate): x <- c() eg x <- c(1,2,4,8) creates a column of the numbers 1,2,4,8
42
Introduction to R For simple operations (+ - * /) on vectors x and y:
If x and y have same number of entries: normal operations on numbers in vector entry by entry Else x and y have different number of entries: for vector with fewest entries x <- c(1,2,3) # 3 elements in x y <- c(1,2,3,4) # 4 elements in y z <- x*y # This causes R to panic, it can't handle this # merged elements must contain same #elements
43
Introduction to R Vectors x y v > x <- c(1,2,3,4)
> y <- c(55,66,77,88) > z <- c(10,11,12) > v <- x*y multiple commands on one line > v [1] > v <- x/y [1]
44
Introduction to R We can do standard operations on vectors x y v
[1] > v + x [1] > v + x*y [1] > v <- c(x,y) create a longer vector v > v [1]
45
Introduction to R Some more useful ways of creating columns of numbers (vectors) The seq function seq(1,10,1) = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 seq(1,4,0.5) = 1, 1.5, 2, 2.5, 3, 3.5, 4 x:y 1:10 = 1,2,3,4,5,6,7,8,9,10 The rep function rep(2,4) = 2, 2, 2, 2
46
Introduction to R > seq(1,10) [1] 1 2 3 4 5 6 7 8 9 10
[1] > seq(1,21,2.5) [1] > 1:9 [1] > rep(2,3) [1] 2 2 2 > rep(3,5) [1]
47
Introduction to R Get specific parts of a vector:
Example: if x = vector 1, 2, 5 then x[1] = 1, x[2] = 2, x[3] = 5 > x <- c(1,2,5) > x [1] 1 2 5 > x[1] [1] 1 > x[2] [1] 2 > x[3] [1] 5
48
Introduction to R Use square brackets to refer to 2+ cells of a vector
Examples x <- c(1,2,5) y <- c(1,5) c(y, y) 1, 5, 1, 5 x[1:2] 1, 2 x[2:3] 2, 5 x[c(1,3)] 1, 5
49
Introduction to R More useful stuff...
ls() # lists all variables/objects you made rm(list = ls()) # removes ALL objects! A vector index can be a condition as well as a number Example x <- c(1, 0, 8, 12, 0) Get positive elements of x x[x>0] [1]
50
Introduction to R vectors can consist of strings of letters
x <- c (“purple”, “green”, “blue”) A vector never consists of numbers AND strings If numbers and strings in a vector, R considers every entry in vector to be a string (ie not viable for maths operations etc)
51
Set your working directory
On the top menu, go to "Session" Then "Set Working Directory" Then "Choose Directory" - select the location you want or getwd() setwd("C:/Users/DowningT/myfolder/") Now: Save your commands (eg a Notepad text file, Evernote, R studio, etc)
52
Introduction to R Data frames (and the function data.frame( ) )
Ordered list (entries correspond to one another) Each element of the list has the same length h <- c (150, 170, 168, 179, 130) w <- c (65, 70, 72, 80, 51) patient_data <- data.frame (weights=w, heights=h)
53
Introduction to R Accessing a particular column, cell or row of a data frame Accessing a cell patient_data [i,j] (jth cell in the ith column) Accessing a row patient_data [i,] (row i) Accessing a column patient_data [,i] (column i)
54
Introduction to R Factors = data type with info about observations original group heights <- c(1.7,1.95,1.63,1.54,1.29) Make new vector fac_heights to record nationalities: fac_heights <- factor(c(“GB”, “IR”, “GB”, “GB”, “IR”)) Useful for statistical tests b/w groups
55
Introduction to R Finding out about data object
mode(): tells you storage mode of object class(): info on object’s class often determines how object is handled by a function You can also set object’s mode, attributes or class using above functions. mode(x) <- "numeric" mode(y) <- "character"
56
Data input and output R comes with several pre-packaged datasets
You can access these datasets with the data function data() gets you a list of all the datasets data(Titanic) loads a dataset about passengers on the Titanic (for example, others include data(tea) etc) summary(Titanic) provides some summary information about the dataset Titanic attributes(Titanic) provides some more information Typing the dataset name on its own (followed by Enter) will display the data
57
Data input and output R comes with several pre-packaged datasets
You can access these datasets with the data function eg 1990 Davis PMID "Body image and weight preoccupation: A comparison between exercising and non-exercising women" View(Davis) head(Davis) str(Davis) glimpse(Davis) summary(Davis)
58
Data input and output Davis table(Davis$weight) table(Davis$height)
data.frame(Davis$height,Davis$weight) # create a dataframe from heights and weights only data.frame(Davis$height,Davis$weight)[1:5,] look at dataframe for heights and weights - samples 1-5
59
Data input and output Can use function read.table() to read in dataframe: q <- read.table(file.choose(), sep=“\t”, header=T)
60
CSV data First line = names of variables, separated by commas
Variables = proper numbers or plain text - no spaces, funny characters or punctuation Data = mix of numbers and grouping variables = time sequence is ok = dates and times are not Missing data = represented by the two letters NA and nothing else - no dashes, no 999, 77, 88 or anything else
61
CSV data
62
Data input and output Writing data to a file (the write() and write.table() functions) Write list out to current directory write.table (q, file = “filename”, ncol = 2) for vector, ncol specifies #columns in output For a data frame (many optional arguments ... )
63
For loops Writing your own functions in R Syntax: let x be a vector
for ( i in x ) command eg try for(i in 1:10) { print(i) } Writing your own functions in R
64
For loops Writing your own functions in R
Example: add the elements of a vector x <- c (1, 2, 8) sum <- 0 for (i in x) sum <- sum + i sum = 11 sum() does this for you Writing your own functions in R
65
Conditional statements
If statement execution Syntax: if (condition){ command } OR x <- 11 if(x>2){ print (x) } Writing your own functions in R
66
Conditional execution
>, <, <=, >=, !=, == are all used to compare numbers = means assign These can also be used in conditional indexing (see earlier) You can combine conditions with the & or | sign AND OR Writing your own functions in R
67
Combined loop & conditional execution
Example: add positive elements of vector x <- c (1, -2, 8) pos_sum <- 0 for (i in x) if (i > 0) pos_sum <- pos_sum + i pos_sum This is for illustration – there are quicker ways of doing this Writing your own functions in R
68
Blocks Writing your own functions in R
Sometimes you need to execute several commands in a loop or after testing a condition R will group together all commands within { } x <- 11 if(x>2){ print (x) } Writing your own functions in R
69
Blocks Writing your own functions in R
Add up all the numbers between 1 and 10 and multiply by the sum of the numbers between 5 and 10 for ( i in 1:10) { m <- m + i 1, 3, 6, … if ( i > 5 ) { k <- k + i } ,13,… } s <- m * k s[3]=m[1]*k[3], s[4]=m[2]*k[4] Writing your own functions in R
70
Writing your own functions in R
Making your own function is possible: my_function <- function ( ... ) {} fix ( my_function ) Note: comment your code (anything after the # symbol is ignored by R – this is a place to put ‘notes to self’) for ( i in 1:10) { # for each number 1 to 10 m <- m + i # m is assigned as itself + i if ( i > 5 ) { k <- k + i } # if i>5, k is assigned as itself + i } # end for s <- m * k # s is m by k Writing your own functions in R
71
Scientific method in biostatistics
1. Defining problems: What type of data is being tested? 2. Assumptions of each model: Is this valid for this data? Why has this particular model been used? 3. Testing hypothesis predictions: If X is true, then what follows? 4. Boundary conditions: When does it fail or not work?
72
Getting your data into R
Download -> CSV -> data.frame -> data cleaning At the end of this session, you will be able to: Upload, download, create or import data into R Manipulate large datasets as tables in R Explore datasets using multiple approaches Test for missing, partial or inconsistent data Summarise and compare datasets with R
73
Dplyr -> Data cleaning + EDA
=> manipulating data: Davis %>% filter(height<quantile(height,0.5)) # subset rows of smallest 50% %>% arrange(desc(height)) # sort rows of smallest 50% %>% select(sex,weight,repwt) # select columns we want %>% mutate(weight_diff=(weight-repwt)) # create new variables, eg "weight_diff" %>% group_by(sex) %>% summarise(mean=mean(weight)) # summarise details of interest
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.