Introduction to R Workshop By Stephen O. Opiyo MCIC-Computational Biology Laboratory Day 1
Outline R workshop outline ( tps.osu.edu/Routline.pdf) Introduction to R (Day 1) Graphics (data visualization) in R (Day 2) Basic Statistics using R (Day 2)
Getting Started Assume that R for Windows and Macs already installed on your laptop. (Instructions for installations given at the webpage below) Open R on your Windows or Mac
R on Windows
R on MACs
History of R Idea of R came from S developed at Bell Labs since 1976 S intended to support research and data analysis projects S to S-Plus licensed to Insightful (“S-Plus”) R: Open source platform similar to S developed by Robert Gentleman and Ross Ihaka (U of Auckland, NZ) during the 1990s. Since 1997: international “R-core” developing team Updated versions available every two months
What is R? Data handling and storage: numeric, textual Matrix algebra Tables and regular expressions Graphics Data analysis
R is not –a database –a collection of “black boxes” –a spreadsheet software package –commercially supported
Useful reading materials R for Beginners An Introduction to R” by Longhow Lam Practical Regression and Anova using R An R companion to ‘Experimental Design The R Guide R for Biologists
Useful reading materials Multilevel Modeling in R R reference cards R reference card data mining RStudio - Documentation
Useful books Learning Rstudio for R Statistical Computing by Mark Van Der Loo, Edwin De Jonge Paperback, 126 pages Published December 25th 2012 by Packt Publishing ISBN (ISBN13: ) Getting Started with RStudio By: John Verzani Publisher: O'Reilly Media, Inc. Pub. Date: September 22, 2011 Print ISBN-13: R Graphics Cookbook by Winston Chang (Jan 3, 2013) R For Dummies by Meys, Joris, de Vries The R Book by Michael J. Crawley
RStudio
RStudio is a free open source integrated development environment for R ( Assume RStudio for Windows and Macs already installed on your laptop. (Instructions for installations given at the webpage below)
Using RStudio Script editor View help, plots & files; manage packages View variables in workspace and history file R console
Open RStudio and set up a working directory
Datasets and R scripts Save all the files to R_workshop directory - Day 1 R files (D1_Example_1.R to D1_Example_12.R ) = 12 files - Day 1 data files (D1_Data_1.csv or txt to D1_Data_2.csv or txt) = 4 files - Day 2 R files (D2_Example_1.R to D2_Example_3.R ) = 3 files - Day 2 data files (D1_Data_1.csv to D1_Data_3.csv) = 3 files
R: Session management
R: session management You can enter a command at the command prompt in a console (>). To quit R, use >q(). Result is stored in an object using the assignment operator: (<-) or the equal character (=). Test<-2 and Test=2 Test is an object with a value of 2 To print (show) the object just enter the name of the object Test
Naming object in R Object names cannot contain `strange' symbols like !, +, -, #. A dot (.) and an underscore (_) are allowed, also a name starting with a dot. Object names can contain a number but cannot start with a number.(E.g., Example_1, not 1Example_1) R is case sensitive, X and x are two different objects, as well as temp.1 and temP.1
R_Example_1 Example1<-c(1,2,3,4,5) Example1 1Example<-c(1,2,3,4,5) Example2<-c("Second", "Example") Example2 2Example<-c("Second", "Example") Temp.1<-c("Second", "Example") Temp.1
R: session management Your R objects are stored in a workspace Workspace is not saved on disk unless you tell R to do so. Objects are lost when you close R and not save the objects. R session are saved in a file.RData.
R: session management Just checking what the current working directory type getwd() To list the objects in your workspace: ls() To remove objects you no longer need: rm() To remove ALL objects in your workspace: rm(list=ls()) or use Remove all objects To save your workspace to a file, you may type save.image() or use Save Workspace… in the File menu The default workspace file is called.RData
Basic data types
Basic data types: Logical, numeric, and character, etc.
Basic data type (Example 2) Logical (True or False) Numerical (relating to numbers) Characters (letters, numerical digits, common punctuation marks (such as "." or "-”), etc.
Vectors and Matrices A vector –Ordered collection of data of the same type –Example: last names of all students in this workshop –In R, single number is a vector of length 1 A matrix –Rectangular table of data of the same type –Example: Mean intensities of all genes measured during a microarray experiment
Vectors (Example 3) Vector: Ordered collection of data of the same data type > x <- c(5.2, 1.7, 6.3) > log(x) [1] > y <- 1:5 > z <- seq(1, 1.4, by = 0.1) > yz<-y + z [1] > length(y) [1] 5 > M<-mean(y + z) [1] 4.2
Operation on vector elements (Example 4) Mydata <- c(2,3.5,-0.2) Vector (c=“concatenate”) > Mydata [1] > x4 0 [1] TRUE TRUE FALSE > x5 0] [1] > x6<-Mydata[-c(1,3)] [1] 3.5 Test on the elements Extract the positive elements Remove elements 1 and 3
Operation on vector elements (Example 4) > Colors <- c("Red","Green","Red") Character vector > Colors[2] [1] "Green" > x1 <- 25:30 > x1 [1] Number sequences > x2<-x1[3:5] [1] Various elements 3 to 5 >x3<-x1[c(2,6)]Elements 2 and 6
> x <- c(5,-2,3,-7) > y <- c(1,2,3,4)*10Operation on all the elements > y [1] > sort(x)Sorting a vector in ascending order [1] > sort(x, decreasing=TRUE) Sort in descending order [1] > order(x) [1] Element order for sorting > y[order(x)] [1] Operation on all the components > rev(x)Reverse a vector [1] Operation on vector elements (Example 5)
Matrices (Example 6) Matrix: Rectangular table of data of the same type > m <- matrix(1:12) Create matrix using the “matrix function” m [,1] [1,] 1 [2,] 2 [3,] 3 [4,] 4 [5,] 5 [6,] 6 [7,] 7 [8,] 8 [9,] 9 [10,] 10 [11,] 11 [12,] 12 V<-1:12 V1<-c(1,2,3,4,5,6,7,8,9,10,11,12)
Matrices (Example 6) Matrix: Rectangular table of data of the same type > mr <- matrix(1:12, 4) four rows mr [,1] [,2] [,3] [1,] [2,] [3,] [4,]
Matrices (Example 6) Matrix: Rectangular table of data of the same type > mm <- matrix(1:12, 4, byrow = T); m By row creation [,1] [,2] [,3] [1,] [2,] [3,] [4,] > tmm<-t(mm) t is transpose [,1] [,2] [,3] [,4] [1,] [2,] [3,] > dim(mm) dim is dimension [1] 4 3four rows and three columns > dim(t(mm)) [1] 3 4 three rows and four columns
Matrices (Example 6) Matrix: Rectangular table of data of the same type > x <- c(3,-1,2,0,-3,6) > x.mat <- matrix(x,ncol=2) Matrix with 2 cols > x.mat [,1] [,2] [1,] 3 0 [2,] [3,] 2 6 x.matr <- matrix(x,ncol=2, byrow=T)By row creation > x.matr [,1] [,2] [1,] 3 -1 [2,] 2 0 [3,] -3 6
0peration on matrices (Example 6) > x.matr[,2] 2 nd col [1] > x.matr[c(1,3),]1 st and 3 rd lines [,1] [,2] [1,] 3 -1 [2,] -3 6 > x.mat[-2,]remove second row (No 2 nd line) [,1] [,2] [1,] 3 -1 [2,] -3 6
R as a calculator (Example 7) In R, plus (+), minus (-), division(/), and multiplication (*) > 5 + (6 + 7) * pi^2 [1] > log(exp(1)) [1] 1 > sin(pi/3)^2 + cos(pi/3)^2 [1] 1 > Sin(pi/3)^2 + cos(pi/3)^2 Error: couldn't find function "Sin” Take few minutes to do any calculations
Lists and data frames
Lists A list is an ordered collection of data of arbitrary types. Lists don’t need to be of the same mode or type and they can be a numeric vector, a logical value and a function and so on. A component of a list can be referred as aa[[I]] or aa$x, where aa is the name of the list and x is a name of a component of aa.
Lists (Example 8) aa[[1]] is the first component of aa, while aa[1] is the sublist consisting of the first component of aa only. list: an ordered collection of data of arbitrary types. > doe = list(name="john", age=28, married=F) doe[[1]] [1] "john" > doe[1] $name [1] "john” > doe$name [1] "john“ > doe$age [1] 28 Typically, vector elements are accessed by their index (an integer), list elements by their name (a character string). But both types support both access methods.
Lists are very flexible (Example 8) > my.list <- list(c(5,4,-1),c("X1","X2","X3")) > my.list [[1]]: [1] [[2]]: [1] "X1" "X2" "X3" > my.list[[1]] [1] > my.list1 <- list(c1=c(5,4,-1),c2=c("X1","X2","X3")) > my.list1$c2[2:3] [1] "X2" "X3"
Lists: Session (Example 8) Empl <- list(employee=“Anna”, spouse=“Fred”, children=3, child.ages=c(4,7,9)) Empl[[4]] [1] Empl$child.ages [1] Empl[4] # a sublist consisting of the 4 th component of Empl $child.ages [1] #letters is a function for a to z #toupper(letters) for capital A to Z #LETTERS for capital A to Z names(Empl) <- letters[1:4] names(Empl) [1] "a" "b" "c" "d" Empl1 <- c(Empl, service=8) $a [1] "Anna" $b [1] "Fred" $c [1] 3 $d [1] $service [1] 8 unl<-unlist(Empl) # converts it to a vector. Mixed types will be converted to character, giving a character vector. unl a b c d1 d2 d3 service "Anna" "Fred" "3" "4" "7" "9" "8"
More lists (Example 8) > x <- c(3,-1,2,0,-3,6) > x.mat <- matrix(x,ncol=2) Matrix with 2 cols > x.mat [,1] [,2] [1,] 3 -1 [2,] 2 0 [3,] -3 6 > dimnames(x.mat) <- list(c("L1","L2","L3"), c("R1","R2")) > x.mat R1 R2 L L2 2 0 L3 -3 6
Data frame Data frame: Rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example of a data frame with 10 rows and 3 columns
Importing and exporting data in R
Importing Data (Exercise 9) The easiest way to enter data in R is to work with a text file, in which the columns are separated by tabs; or comma-separated values (csv) files. Example of importing data are provided below. mydata <- read.table("D1_Data_1.csv", header=TRUE) mydata_correct <- read.table("D1_Data_1.csv", sep=”, ", header=TRUE) mydata <- read.csv("D1_Data_1.csv", header=TRUE) mydatab<- read.table("D1_Data_1.txt", header=TRUE) Mydatab_correct<- read.table("D1_Data_1.txt",, sep=”,\t", header=TRUE) mydatab<- read.delim("D1_Data_1.txt", header=TRUE) Import from Web URL Data_URL<-read.table(file= header=TRUE, sep=“,”) Importing data in Rstudio using (Import Dataset) on the Workspace Importing data from Excel, SAS, SPSS see the link below
Keyboard Input (Exercise 10) # create a data frame from scratch age <- c(25, 30, 56, 49, 12, 16, 60, 34, 45, 22) gender <- c("male", "female", "male","male", "female", "male","male", "female", "male","male") weight <- c(160, 110, 220, 100, 65, 120, 179, 134, 165, 153) mydata <- data.frame(age,gender,weight)
Viewing Data (Exercise 10) There are a number of functions for listing the contents of an object or dataset. # list the variables in mydata names(mydata) # list the structure of mydata str(mydata) # dimensions of an object dim(mydata)
Viewing Data (Example 10) # class of an object (numeric, matrix, dataframe, etc) class(mydata) # print mydata mydata # print first 6 rows of mydata head(mydata) # print first 2 rows of mydata head(mydata, n=2) print last 6 rows of mydata tail(mydata) # print last 2 rows of mydata tail(mydata, n=2)
Missing Data (Example 11) In R, missing values are represented by the symbol NA (not available). Impossible values (e.g., dividing by zero) are represented by the symbol NaN (not a number). Testing for Missing Values is.na(x) # returns TRUE of x is missing y <- c(1,2,3,NA) is.na(y) # returns a vector (F F F T) Excluding missing values from analyses Arithmetic functions on missing values yield missing values. x <- c(1,2,NA,3) mean(x) # returns NA mean(x, na.rm=TRUE) # returns 2
Exporting Data (Example 12) To A csv File write.table(mydata, "c: mydata.csv", sep=", ") To To A Tab Delimited Text File write.table(mydata, "c: mydata.csv”,, sep=“\t ") Exporting R objects into other formats. For SPSS, SAS and Stata. you will need to load the foreign packages. foreign For Excel, you will need the xlsReadWrite package.xlsReadWrite
Exercise
Exercise 1 1.Download/import data D1_Data_2.csv and name it Data_1. 2.Get the names of the variables in the Data_1. 3.2)Create a new data from the first four column of Data_1 and named it Data_2. 4.Get the names of the variable of Data_2. 5.What is the length of Data_2? 6.Remove the second column from Data_2 and create a new data called Data_3. 7.List the first six rows (lines) of Data_3. 8.List the last six rows of Data_3. 9.List the first 4 rows of Data_3. 10.Sort the third column of Data_3 and called it Data_4.