N=54.

Slides:



Advertisements
Similar presentations
Introduction to R Brody Sandel. Topics Approaching your analysis Basic structure of R Basic programming Plotting Spatial data.
Advertisements

KompoZer. This is what KompoZer will look like with a blank document open. As you can see, there are a lot of icons for beginning users. But don't be.
 Statistics package  Graphics package  Programming language  Can be used to share/reproduce analyses  Many new packages being created - can be downloaded.
John Porter Why this presentation? The forms data take for analysis are often different than the forms data take for archival storage Spreadsheets are.
Introduction to GTECH 201 Session 13. What is R? Statistics package A GNU project based on the S language Statistical environment Graphics package Programming.
1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.
EGR 106 – Week 2 – Arrays & Scripts Brief review of last week Arrays: – Concept – Construction – Addressing Scripts and the editor Audio arrays Textbook.
SPSS 1: An Introduction to the Statistical Package SPSS Suzie Cro MRC Clinical Trials Unit.
SAS Workshop Lecture 1 Lecturer: Annie N. Simpson, MSc.
Introduction to Python
Data, graphics, and programming in R 28.1, 30.1, Daily:10:00-12:45 & 13:45-16:30 EXCEPT WED 4 th 9:00-11:45 & 12:45-15:30 Teacher: Anna Kuparinen.
Objectives Understand what MATLAB is and why it is widely used in engineering and science Start the MATLAB program and solve simple problems in the command.
Introduction to R Lecture 1: Getting Started Andrew Jaffe 8/30/10.
Piotr Wolski Introduction to R. Topics What is R? Sample session How to install R? Minimum you have to know to work in R Data objects in R and how to.
Resetting Student PreTests. Within the MyNursingLab Study Plans, pretests can be taken only one time by the student.
Chapter 17 Creating a Database.
Outline Comparison of Excel and R R Coding Example – RStudio Environment – Getting Help – Enter Data – Calculate Mean – Basic Plots – Save a Coding Script.
Basics of Biostatistics for Health Research Session 1 – February 7 th, 2013 Dr. Scott Patten, Professor of Epidemiology Department of Community Health.
Microsoft Office 2013 Try It! Chapter 4 Storing Data in Access.
R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 
1 PEER Session 02/04/15. 2  Multiple good data management software options exist – quantitative (e.g., SPSS), qualitative (e.g, atlas.ti), mixed (e.g.,
What School / Dept? n=55. What School / Dept? n=55.
EMPA Statistical Analysis
Development Environment
AP CSP: Cleaning Data & Creating Summary Tables
CS 106 Computing Fundamentals II Chapter 5 “Excel Basics for Windows”
Spreadsheet – Microsoft Excel 2010
Release Numbers MATLAB is updated regularly
Miscellaneous Excel Combining Excel and Access.
Practical Office 2007 Chapter 10
Statistical Analysis with Excel
Stats Lab #1 TA: Kyle Davis
Getting your data into R
Microsoft Office Illustrated
Introduction to R Studio
Scripts & Functions Scripts and functions are contained in .m-files
Microsoft Excel 2003 Illustrated Complete
Engineering Innovation Center
ECONOMETRICS ii – spring 2018
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Statistical Analysis with Excel
Chapter 1: Introduction to SAS
Lab 1 Introductions to R Sean Potter.
Statistical Analysis with Excel
Number and String Operations
WEB PROGRAMMING JavaScript.
Teaching London Computing
Code is on the Website Outline Comparison of Excel and R
T. Jumana Abu Shmais – AOU - Riyadh
TRAINING OF FOCAL POINTS on the CountrySTAT SYSTEM based on FENIX
This is where R scripts will load
ICT Spreadsheets Lesson 1: Introduction to Spreadsheets
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
Lab 2 HRP223 – 2010 October 18, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This presentation is protected.
Spreadsheets, Modelling & Databases
Stata Basic Course Lab 2.
MIS2502: Data Analytics Introduction to R and RStudio
Tonga Institute of Higher Education IT 141: Information Systems
Amos Introduction In this tutorial, you will be briefly introduced to the student version of the SEM software known as Amos. You should download the current.
MATLAB Introduction MATLAB can be thought of as a powerful graphing calculator but with a lot more buttons! It is also a programming language, commands.
This is where R scripts will load
Mr Watson’s Introduction to Spreadsheets
Have you signed up (or had) your meeting?
Tonga Institute of Higher Education IT 141: Information Systems
R Course 1st Lecture.
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
By the end of the lesson, I want you to be able to say…
Introduction to Excel 2007 Part 3: Bar Graphs and Histograms
Macrosystems EDDIE: Getting Started + Troubleshooting Tips
A brief introduction to the nutrient tool-kit, getting R Studio to work and checking the data Martyn Kelly
Presentation transcript:

n=54

What School / Dept?

What is the primary research question you work on?

Why you want to participate?

School of Biotechnology Data for life: An Introduction to R Dr. Tim Downing School of Biotechnology http://dataforlife.wikispaces.com See Resources and References 5

Code not spreadsheets! data in formulas! =(3.6946*10^-6)/'Old snails'!J26 ref: Jenny Bryan

Code not spreadsheets! ref: Jenny Bryan

Mistakes matter! Paper: "Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak mutation rate and genotype variation of Ebola virus from Mali case sequences" DOI: 10.1126/science.aaa5646 Wait, WTF? Comment on “Mutation rate and genotype variation of Ebola virus from Mali case sequences” DOI: 10.1126/science.aaf3823 Oops OMG rly sry Response to Comment on “Mutation rate and genotype variation of Ebola virus from Mali case sequences” DOI: 10.1126/science.aaf4561

Fig. 2 Maximum likelihood tree of the 106 sequences analyzed by Hoenen et al. (left side) initially (1) with lines linking to the correctly labeled sequences after the erratum (3) (right side). Maximum likelihood tree of the 106 sequences analyzed by Hoenen et al. (left side) initially (1) with lines linking to the correctly labeled sequences after the erratum (3) (right side). Lines of the same color represent multiple samples taken from the same patient, which in most cases have identical sequences. These correctly group together on the right but do not in many cases on the left. Andrew Rambaut et al. Science 2016;353:658

Code not spreadsheets! If you do something once, you usually don’t need a script. Do it hundreds or thousands of times, you will want something to help you. Want to share what you did, providing a script is usually a good way. Sometimes though, scripts are too complicated, and don’t capture all that is need to do an experiment. For example: the version of a tool you used! ref: BF Francis Ouellette

ref: Jenny Bryan

Code not spreadsheets! ref: Jenny Bryan

Code not spreadsheets! For Every Result, Keep Track of How It Was Produced Avoid Manual Data Manipulation Steps Archive Exact Versions of All External Programs Used Version Control All Custom Scripts Record All Intermediate Results in Standard Formats For Analyses That Include Randomness, Note Underlying Random Seeds Always Store Raw Data behind Plots Generate Hierarchical Analysis Output, Allowing Layers of Increasing Detail to Be Inspected Connect Textual Statements to Underlying Results Provide Public Access to Scripts, Runs, and Results

R shiny www.rstudio.com/shiny/showcase/ eg

Introduction to R R is a command-line programming language & statistics package (GNU project based on S) Great for statistical analysis of biological data Excellent for visualising experimental results "R" or "RStudio" 1. Go to "Start" 2. "All programs" 3. Hit "RStudio" (64-bit)

Using your own laptop • Download R from the R Project website http://r-project.org, install it, and verify that it starts. • Download R Studio from the RStudio website www.rstudio.com/home/, install it, verify that it starts • Download ‘A very short introduction to R’ from CRAN https://cran.r-project.org/doc/contrib/Torfs+Brauer-Short-R-Intro.pdf, read it, and work through it.

What is R? - An environment for statistical analysis and graphics - Supports almost every statistical method - The environment of choice for developing new methods It's for anyone who has to analyse data - from A to Z - agriculture, astrophysics, climatology, - ecology and environmental science, econometrics, - electrical engineering, finance, genetics, - genomics, geography, psychology, public health, - social sciences, zoology! 18

R Course Aims - Introduce you to R - Get you to use R yourself on your own data - Work in groups to talk about using R in your projects - Show you how to learn what you need to know about R 19

Running commands Click anywhere in the line with your cursor Ctrl + Return to run it or click on the Run button 21

R Course Learning Outcomes At the end of this session you will be able to: Start R Studio and create a Rmd (R Markdown) file Understand what a Rmd file is Load a simple data file Understand how R uses dataframes and vectors Use the R help system Examine a dataframe Prepare simple tables Locate and perform some stats on your data (Produce good quality graphs, and save these) (Prepare a simple function) 22

R Course Resources Go to http://dataforlife.wikispaces.com/MON+AM+-+Intro+to+R 23

R Course Data Download these 24

R code terms Rmd file – useful text file of your code and comments vector – a list of numbers variable – something we define, eg x function – R code that does something specific, eg mean(x) arguments – data required by a function, eg x above library – package of R functions that does something specific loaded using the library() function environment - the space you are working in, and the set of things that can be found there (eg Windows, Mac, etc) CRAN – website from with libraries to meet your needs

Managing projects in R Many options: I prefer Rmd files (R Markdown) Record your commands in your notebook (I use Evernote) and in your Rmd file Using GUI (eg Excel) or websites is asking for trouble FAIR etc 26

Murphy's Law / Reviewer 3 - How exactly did you define your key explanatory variable? - You had 14,327 records at the start, but the analysis is based on 14,225. What happened to the other 122? - Why did you exclude the measurements from Sensor D, but only b/w 2pm and 4pm on the eighteenth day? - Figure 4 is great, but I want you to make the axis labels a little bigger and move the key to the top left hand corner, please - You do realize that you've calculated the SCQ scores wrong, the average ought to be about 4, not about 24 27

Managing projects in R - *If* you have kept the code you ran to produce the data, run the analyses, and make the graphics - *Then* all of these are trivial questions to answer - If you haven't, then you have to redo the analysis from scratch - very painful, very slow, very boring 28

Managing projects in R - Keep your work organised - Keep your original data - Show how you got from there to wherever you ended up - Record all your analyses, both the ones you eventually published, and the ones you didn't - Start with the original data file, once data entry, and checking have been finished - Never overwrite the original data, disk space = cheap - I read it in, and do data cleaning 29

Cleaning data / Data QC 90% of the work - Sometimes I find errors in the data, which I check, if I can, with the original record - Fix these in software - Sometimes I have to omit observations, and again this is done, and fully documented in software - Create new variables, and recode old ones, and all of this is done live in R - Output a dataset ready for further analysis 30

Actually doing the thing you wanted to do in the first place after all that checking 10% of the work - Prepare tables and graphs from the data - Carry out statistical tests as needed - Start doing regression analyses, mostly glms, beginning with very simple models, and working up to more complex ones - Prepare tables, graphs and text for output, using R, knitr, and markdown directly 31

32

Introduction to R After this section you should be able to: - Read data in and output data from R - Manipulate data in R - Use some basic R functions (eg statistics and plotting) You may be able to write basic R computer programs Use it or lose it There is a lot to R beyond this course – much more available in books, tutorials, online forums, R demos, help pages etc.

Introduction to R Help( ) > help(topic) help() is an R function R functions take arguments (info inside the function: they which go in between brackets, separated by commas) ‘help’ function => displays info on R documentation files

Introduction to R type in "help()" and then use Enter to move down lines or the Spacebar to scroll down pages Most R commands are constructed with brackets to denote the data that they are acting on, along with a series of input parameters For example, see "Usage:" for help here help(topic, verbose = getOption("verbose")) help(x) gives information on the topic x, with many adjustable parameters like the verbosity Press q to exit help() See for example help(file).

Introduction to R R as calculator R will evaluate basic calculations which you type into the console (input window) Type in “R” > 10+1 [1] 11 > 10*2 [1] 20 > (10*2)+1 use brackets to clarify precedence [1] 21

Introduction to R Assigning variables > x <- 1.5 > y <- 2.6 > x check what x and y were assigned as [1] 1.5 > y [1] 2.6

Introduction to R Assigning variables > x <- 1.5 > y <- 2.6 R uses standard operators + - / * a**b a to the power of b a %% b a modulo b (get the remainder) > (x/(y + 20/15))**18 R can do complex operations [1] 2.910376e-08

Introduction to R Generally can obtain the number (or other value) stored in any variable by typing the name of the variable, followed by enter; or print (variable) or show(variable) > x [1] 1.5 Wait, what did I assign x as again … ? > print (x) > show(x)

Introduction to R Vectors Vectors = a single column/row of numbers in a spreadsheet Create a vector using the c() function (concatenate): x <- c() eg x <- c(1,2,4,8) creates a column of the numbers 1,2,4,8

Introduction to R For simple operations (+ - * /) on vectors x and y: If x and y have same number of entries: normal operations on numbers in vector entry by entry Else x and y have different number of entries: for vector with fewest entries x <- c(1,2,3) # 3 elements in x y <- c(1,2,3,4) # 4 elements in y z <- x*y # This causes R to panic, it can't handle this # merged elements must contain same #elements

Introduction to R Vectors x y v > x <- c(1,2,3,4) > y <- c(55,66,77,88) > z <- c(10,11,12) > v <- x*y multiple commands on one line > v [1] 55 132 231 352 > v <- x/y [1] 0.01818182 0.03030303 0.03896104 0.04545455

Introduction to R We can do standard operations on vectors x y v [1] 55.00000 33.00000 25.66667 22.00000 > v + x [1] 1.018182 2.030303 3.038961 4.045455 > v + x*y [1] 55.01818 132.03030 231.03896 352.04545 > v <- c(x,y) create a longer vector v > v [1] 1 2 3 4 55 66 77 88

Introduction to R Some more useful ways of creating columns of numbers (vectors) The seq function seq(1,10,1) = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 seq(1,4,0.5) = 1, 1.5, 2, 2.5, 3, 3.5, 4 x:y 1:10 = 1,2,3,4,5,6,7,8,9,10 The rep function rep(2,4) = 2, 2, 2, 2

Introduction to R > seq(1,10) [1] 1 2 3 4 5 6 7 8 9 10 [1] 1 2 3 4 5 6 7 8 9 10 > seq(1,21,2.5) [1] 1.0 3.5 6.0 8.5 11.0 13.5 16.0 18.5 21.0 > 1:9 [1] 1 2 3 4 5 6 7 8 9 > rep(2,3) [1] 2 2 2 > rep(3,5) [1] 3 3 3 3 3

Introduction to R Get specific parts of a vector: Example: if x = vector 1, 2, 5 then x[1] = 1, x[2] = 2, x[3] = 5 > x <- c(1,2,5) > x [1] 1 2 5 > x[1] [1] 1 > x[2] [1] 2 > x[3] [1] 5

Introduction to R Use square brackets to refer to 2+ cells of a vector Examples x <- c(1,2,5) y <- c(1,5) c(y, y) 1, 5, 1, 5 x[1:2] 1, 2 x[2:3] 2, 5 x[c(1,3)] 1, 5

Introduction to R More useful stuff... ls() # lists all variables/objects you made rm(list = ls()) # removes ALL objects! A vector index can be a condition as well as a number Example x <- c(1, 0, 8, 12, 0) Get positive elements of x x[x>0] [1] 1 8 12

Introduction to R vectors can consist of strings of letters x <- c (“purple”, “green”, “blue”) A vector never consists of numbers AND strings If numbers and strings in a vector, R considers every entry in vector to be a string (ie not viable for maths operations etc)

Introduction to R Now try the questions!

Set your working directory On the top menu, go to "Session" Then "Set Working Directory" Then "Choose Directory" - select the location you want or getwd() setwd("C:/Users/DowningT/myfolder/") Now: Save your commands (eg a Notepad text file, Evernote, R studio, etc)

Introduction to R Data frames (and the function data.frame( ) ) Ordered list (entries correspond to one another) Each element of the list has the same length h <- c (150, 170, 168, 179, 130) w <- c (65, 70, 72, 80, 51) patient_data <- data.frame (weights=w, heights=h)

Introduction to R Accessing a particular column, cell or row of a data frame Accessing a cell patient_data [i,j] (jth cell in the ith column) Accessing a row patient_data [i,] (row i) Accessing a column patient_data [,i] (column i)

Introduction to R Factors = data type with info about observations original group heights <- c(1.7,1.95,1.63,1.54,1.29) Make new vector fac_heights to record nationalities: fac_heights <- factor(c(“GB”, “IR”, “GB”, “GB”, “IR”)) Useful for statistical tests b/w groups

Introduction to R Finding out about data object mode(): tells you storage mode of object class(): info on object’s class often determines how object is handled by a function You can also set object’s mode, attributes or class using above functions. mode(x) <- “numeric” mode(y) <- “character”

Data input and output R comes with several pre-packaged datasets You can access these datasets with the data function data() gets you a list of all the datasets data(Titanic) loads a dataset about passengers on the Titanic (for example, others include data(tea) etc) summary(Titanic) provides some summary information about the dataset Titanic attributes(Titanic) provides some more information Typing the dataset name on its own (followed by Enter) will display the data

Data input and output Can use function read.table() to read in dataframe: q <- read.table(file.choose(), sep=“\t”, header=T)

CSV data First line = names of variables, separated by commas Variables = proper numbers or plain text - no spaces, funny characters or punctuation Data = mix of numbers and grouping variables = time sequence is ok = dates and times are not Missing data = represented by the two letters NA and nothing else - no dashes, no 999, 77, 88 or anything else

CSV data

Data input and output Writing data to a file (the write() and write.table() functions) Write list out to current directory write (q, file = “filename”, ncol = 2) for vector, ncol specifies #columns in output For a data frame (many optional arguments ... )

For loops Writing your own functions in R Syntax: let x be a vector for ( i in x ) command eg try for(i in 1:10) { print(i) } Writing your own functions in R

For loops Writing your own functions in R Example: add the elements of a vector x <- c (1, 2, 8) sum <- 0 for (i in x) sum <- sum + i sum = 11 sum() does this for you Writing your own functions in R

Conditional statements If statement execution Syntax: if (condition){ command } OR x <- 11 if(x>2){ print (x) } Writing your own functions in R

Conditional execution >, <, <=, >=, !=, == are all used to compare numbers = means assign These can also be used in conditional indexing (see earlier) You can combine conditions with the & or | sign AND OR Writing your own functions in R

Combined loop & conditional execution Example: add positive elements of vector x <- c (1, -2, 8) pos_sum <- 0 for (i in x) if (i > 0) pos_sum <- pos_sum + i pos_sum This is for illustration – there are quicker ways of doing this Writing your own functions in R

Blocks Writing your own functions in R Sometimes you need to execute several commands in a loop or after testing a condition R will group together all commands within { } x <- 11 if(x>2){ print (x) } Writing your own functions in R

Blocks Writing your own functions in R Add up all the numbers between 1 and 10 and multiply by the sum of the numbers between 5 and 10 for ( i in 1:10) { m <- m + i 1, 3, 6, … if ( i > 5 ) { k <- k + i } 6,13,… } s <- m * k s[3]=m[1]*k[3], s[4]=m[2]*k[4] Writing your own functions in R

Writing your own functions in R Making your own function is possible: my_function <- function ( ... ) {} fix ( my_function ) Note: comment your code (anything after the # symbol is ignored by R – this is a place to put ‘notes to self’) for ( i in 1:10) { # for each number 1 to 10 m <- m + i # m is assigned as itself + i if ( i > 5 ) { k <- k + i } # if i>5, k is assigned as itself + i } # end for s <- m * k # s is m by k Writing your own functions in R

Data input and output R comes with several pre-packaged datasets You can access these datasets with the data function eg 1990 Davis PMID 2241138 "Body image and weight preoccupation: A comparison between exercising and non-exercising women" View(Davis) head(Davis) str(Davis) glimpse(Davis) summary(Davis)

Scientific method in biostatistics 1. Defining problems: What type of data is being tested? 2. Assumptions of each model: Is this valid for this data? Why has this particular model been used? 3. Testing hypothesis predictions: If X is true, then what follows? 4. Boundary conditions: When does it fail or not work?

Getting your data into R Let's learn: * Start R Studio and set up a project * Prepare and run R script files * what an R package is and how to install and load them * Load a simple data file * Understand how (& why) R uses dataframes & vectors * Prepare simple tables * Produce good quality graphs, and save these * Locate and perform some statistical tests on your data * Prepare a simple function

Getting your data into R Download -> CSV -> data.frame -> data cleaning At the end of this session, you will be able to: Upload, download, create or import data into R Manipulate large datasets as tables in R Explore datasets using multiple approaches Test for missing, partial or inconsistent data Summarise and compare datasets with R

Data input and output Davis table(Davis$weight) table(Davis$height) data.frame(Davis$height,Davis$weight) # create a dataframe from heights and weights only data.frame(Davis$height,Davis$weight)[1:5,] look at dataframe for heights and weights - samples 1-5

Dplyr -> Data cleaning + EDA => manipulating data: Davis %>% filter(height<quantile(height,0.5)) # subset rows of smallest 50% %>% arrange(desc(height)) # sort rows of smallest 50% %>% select(sex,weight,repwt) # select columns we want %>% mutate(weight_diff=(weight-repwt)) # create new variables, eg "weight_diff" %>% group_by(sex) %>% summarise(mean=mean(weight)) # summarise details of interest

Dplyr -> Data cleaning + EDA => manipulating data: # assign Davis heights vs weights using ggplot2 x <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_jitter() + geom_line() + stat_smooth(span=0.5) + ggtitle('heights v weights') + xlab('height') + ylab('weight') # Higher spans = smoother # plot it (this way avoids errors) ggsave(filename='heights-weights-plot1.png', plot=x, dpi=1200)

Dplyr -> Data cleaning + EDA => manipulating data: # assign Davis heights vs weights using ggplot2 x2 <- ggplot(Davis,aes(x=weight, y=height, colour=sex)) + geom_boxplot() + coord_flip() + facet_wrap(~sex) + ggtitle('heights v weights') # plot it (NB not all plot types are sensible) ggsave(filename = 'heights-weights-plot2.png', plot=x2, dpi=1200)

Dplyr -> Data cleaning + EDA => manipulating data: scale_h <- function(height) { return(height/185200) } # create function to scale heights as nautical miles Davis$height_nm <-scale_h(Davis$height) # assign a new variable with nautical mile heights Davis$height_nm[1:5] # always check the output