Download presentation
Presentation is loading. Please wait.
Published byJoy Tamsyn Kelly Modified over 8 years ago
1
1 Introduction to Perl / R 26 – 27 April, 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology
2
2 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics mshmoish@cs.technion.ac.ilmshmoish@cs.technion.ac.il) Michael Shmoish (mshmoish@cs.technion.ac.il)mshmoish@cs.technion.ac.il Bioinformatics Knowledge Unit The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - IIT
3
3 We’ll learn 1. What is R? 2. Data input/output, basic syntax 3. Data types: vectors, matrices/data frames, lists 4. How to manipulate data; vectorized calculations 5. Graphics in R: plot, barplot, heatmap 6. Statistics in R: regression, histograms, some classical tests 6. Statistics in R: regression, histograms, some classical tests 7. Packages in R
4
4 What is R Software for Statistical Data Analysis and Visualization Based on S ( Based on S (Bell Lab, since 1976) Programming Environment Interpreted Language. Object Oriented Data Storage, Analysis, Visualization Free and Open Source Software
5
5 R vs. Perl Interactive Graphics Statistics Higher level language Bioinformatics-oriented Manipulating data Easier to grasp Very general
6
6 R vs. Matlab Data frames, lists, vectors Strong statistics Bioinformatics-oriented Free open source Great community support “Everything is matrix” Great graphics Very general Expensive Professional support
7
7 What R does and does not data handling and storage: numeric, textual, regular expressions matrix algebra high-level data analytic and statistical functions classes (“OO”) graphics programming language: loops, branching, subroutines is not a database, but connects to DBMSs has no graphical user interfaces, but connects to Java, TclTk language interpreter can be very slow, but allows to call own C/C++ code
8
8 Why R ? Strengths Free and Open Source Strong User Community Highly flexible Implementation of cutting-edge statistical/bioinformatics methods Good graphics and intelligent defaults Continuous improvement Weakness Steep learning curve Slow for large datasets – improving!
9
9 Example: seqinr R-package >proteinInFastaFormat MPRLFSYLLGVWLLLSQLPREIPGQSTNDFIKACGRELVRLWVEICGSVSWGRTAL SLEEPQLETGPPAETMPSSITKDAEILKMMLEFVPNLPQELKATLSERQPSLREL QQSASKDSNLNFEEFKKIILNRQNEAEDKSLLELKNLGLDKHSRKKRLFRMTLS EKCCQVGCIRKDIARLC
10
10 Example: seqinr R-package(cont.) > seqAA seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA") A function AAstat of package ‘seqinr’ returns simple protein sequence information including the number of residues, the percentage physico-chemical classes and the theoretical isoelectric point ; > AAstat(seqAA[[1]]) $Compo A C D E F G H I K L M N P Q R S T V W Y A C D E F G H I K L M N P Q R S T V W Y 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1…$Prop$Aliphatic [1] 0.2404372 $Prop$Aromatic [1] 0.06010929...$Pi [1] 8.534902
11
11 Example: seqinr R-package(cont.)
12
12 R Project home page
13
13 Basic R resources Getting R: http://cran.r-project.org/ In “Download and Install R” section choose “Windows” and next screen choose “base” or go directly to http://cran.r-project.org/bin/windows/base/ and click on Download R 2.11.0 for Windows. and click on Download R 2.11.0 for Windows.Download R 2.11.0 for WindowsDownload R 2.11.0 for Windows Then a regular installation procedure should apply (sometimes it is required to choose a mirror site). Then a regular installation procedure should apply (sometimes it is required to choose a mirror site). Editors: Tinn-R - http://www.sciviews.org/Tinn-R/ Eclipse - http://www.eclipse.org/ http://www.eclipse.org/
14
14
15
15 Installation For your home/lab installation Choose a "save" option and download the file to the directory where you could write the data to in this room Go to this directory, double-click the downloaded exe file and proceed with a wizard After clicking on "Finish" you should get a shortcut to Rgui.exe file on your desktop Right click on this shortcut, choose 'Properties', then in the 'target' field add ''--internet2" flag to get something like: "C:\Program Files\R\R- 2.6.1\bin\Rgui.exe --internet2" and click OK (this should enable internet download functions; if it doesn't work a proxy needs to be set up, see 'R for Windows FAQ')
16
16 Opening R In Windows, double-click on desktop icon or.RData
17
17 Opening R – the Console
18
18 R Working Area This is the area where all commands are issued, and non-graphical outputs observed when run interactively
19
19 RGui Under File Change dir Load Workspace Save Workspace Load History Save History Under Edit GUI Preferences Under Help
20
20 Simple R-session ‘Change dir’ to right one (to be able to write) > a a <- c(1.1, 2.3, 5.4) >length(a) >b = seq(1,11,2); d = 1:10 >rT = rep(1:5, times = 3) ; rEach rT = rep(1:5, times = 3) ; rEach <- rep(1:5, each = 3) >max(a); min(a) >ls() # list objects in your working environment mean(d) > summary(b) q(save = “yes”)
21
21 In an R Session usually we: First, read data from other sources Use packages, libraries, and functions Write functions wherever necessary Conduct statistical data analysis and produce plots Save outputs to files, write tables Save R workspace if necessary
22
22 R documentation/manuals/books http://www.r-project.org/ Under documentation on the left (Wiki?) http://www.r-project.org/ http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf “R Programming” by Stephen Eglen “Bioinformatics and Computational Biology Solutions Using R and Bioconductor”, by Robert Gentlemen, Vincent J. Carey, et al.
23
23 Getting help in R >example(‘function.name’) >apropos(‘keyword’) >help.start() ?mean, ?”:” – short-cut for help(“:”) demo(graphics) taskView at CRAN page ‘sos’
24
24 Data creation and input/output Create Vectors: c(1,2,3) Matrices: matrix(c(1,2,3,4,5,6),2,3,byrow=T) (2=#rows) Lists: myList = list(num = 1:5, let = letters[1:5]) Read data from file: read.delim, read.csv, readLines, scan from web (later) Write/save data write.csv, write.table save
25
25 Variables > a = 49 > sqrt(a) [1] 7 > sentence <- "The dog ate my homework" > sub("dog","cat", sentence) [1] "The cat ate my homework“ >sentence ? > b = (1 + 1 == 3) > b [1] FALSE > flag <- (1:5 == 3) > flag [1] FALSE FALSE TRUE FALSE FALSE numeric character string logical
26
26 Variables > a = 49 > sqrt(a) [1] 7 > sentence <- "The dog ate my homework" > sub("dog","cat", sentence) [1] "The cat ate my homework“ >sentence [1] “The dog ate my homework” > b <- (1 + 1 == 3) > b [1] FALSE > flag = (1:5 == 3) > flag [1] FALSE FALSE TRUE FALSE FALSE numeric character string logical
27
27 Missing values Variables of each data type (numeric, character, logical) can also take the value NA: not available. NA is not the same as 0 or as “” or as FALSE Any operations (calculations, comparisons) that involve NA may or may not produce NA: > NA==1 [1] NA > 1+NA [1] NA > max(c(NA, 4, 7)) [1] NA > max(c(NA, 4, 7), na.rm=T) [1] 7 > NA | TRUE [1] TRUE > NA & TRUE [1] NA
28
28 Other special values > pi / 0 [1] ? Inf, -Inf > 0/0 [1] NaN Inf and -Inf are positive and negative infinity whereas NaN means ‘Not a Number’
29
29 Functions Functions do things with data “Input”: function arguments (0 – no arguments, or 1, or 2, etc.) “Output”: function result (exactly one) There are many basic R- functions (which make R a very proficient calculator) but everybody could add easily new ones. Example: memutza = function(a,b) { result = (a+b)/2 return(result) } >memutza(4,8) [1] 6
30
30 Functions and operators Functions do things with data “Input”: function arguments (0 – no arguments, or 1, or 2, etc.) “Output”: function result (exactly one) Exceptions to the rule: Functions may also use data that sits around in other places, not just in their argument list: “scoping rules” Functions may also do other things than returning a result. E.g., plot something on the screen: “side effects” Operators: Short-cut for frequently used functions of one or two arguments: + - * / % (arithmetic); ! & | (logical) <- (assignment)
31
31 Some built-in R-functions ‘load’, ‘save’ ‘ls()’ vs. ‘dir()’ ‘c’, ‘rep’, ‘seq’, ‘rm’ ‘log10’, ‘log2’, ‘sin’, etc. ‘sum’, ‘mean’, ‘var’ ‘print’ ‘plot’ ‘summary’ readLines ‘read.table’, ‘read.csv’ ‘read.delim’ and read.delim(“clipboard”) ‘write.table’, ‘write.csv’ q(save = “yes”) Google “RQuickReference” and “R-refcard” (PDF) for much more functions
32
32 Exercise 1 a) Open an Excel sheet and fill in: Name Faculty Lab Phone (last 4 digits) Age (when available) b) Save the file as.csv c) Read it to R using read.csv d) Print to the screen two first rows e) Print 3 first columns only f) Write your R-object to.csv file and compare with the original one
33
33 R as a calculator > log2(32) [1] 5 > 5^5 [1] 3125 > 2^(5:1) [1] 32 16 8 4 2 # simple example of vectorized # calculations > sqrt(2) [1] 1.414214 > sum(1:10) [1] 55 > plot(sin(seq(0, 2*pi, length=100)))
34
34 When things go wrong Syntax errors (typing mistakes) Logical errors (harder to find!) Common problems: 1) missing close bracket leads to continuation line. > x x <- (1 + (2 * 3)+ Hit Ctrl C (below) or keep typing! 2) too many brackets: 2 + (2*3)) When things seem to take too long, try C-c [Ctrl and C, together] or red ‘STOP’ button When things seem to take too long, try C-c [Ctrl and C, together] or red ‘STOP’ button
35
35 Exercise 2 a) Get printed first 10 powers of 3 b) Predict what is pi:6? And 6:pi? c) Calculate sum of 1*1+2*2+…+20*20 d) Predict what is log10(c(1, 10, 10^(1:10))) e) Plot first 100 integers
36
36 Data structures in R LinearRectangular All Same Type VECTORSMATRIX MixedLIST DATA FRAME
37
37 Vectors Vector: an ordered collection of data of the same type > a = c(1,2,3,4,5) # same as 1:5 # same as seq(1,5) > a*2 [1] 2 4 6 8 10 Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488 numbers In R, a single number is the special case of a vector with 1 element. Besides numbers other vector types: character strings, logical >sentence <- "The dog ate my homework“ >nchar(sentence) [1] 23 >length(sentence) [1] ?
38
38 Variables > a = 49 > sqrt(a) [1] 7 > sentence <- "The dog ate my homework" > sub("dog","cat", sentence) [1] "The cat ate my homework“ >sentence ? > b = (1 + 1 == 3) > b [1] FALSE > flag <- (1:5 == 3) > flag [1] FALSE FALSE TRUE FALSE FALSE numeric character string logical Logical vector
39
39 Vectors and matrices Matrix: a rectangular table of data of the same type Example: the expression values for 10,000 genes for 30 tissue biopsies: a matrix with 10,000 rows and 30 columns > mdat <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3, byrow=TRUE) > mdat [,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13 > mdat2 <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3) > mdat2 ?
40
40 Vectors and matrices Matrix: a rectangular table of data of the same type Example: the expression values for 10,000 genes for 30 tissue biopsies: a matrix with 10,000 rows and 30 columns How to make a matrix from a vector? > mdat <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3, byrow=TRUE) > mdat [,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13 > mdat2 <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3) > mdat2 [,1] [,2] [,3] [1,] 1 3 12 [2,] 2 11 13
41
41 Lists Vector: an ordered collection of data of the same type. > a = c(7,5,1) > a[2] [1] 5 List: an ordered collection of data of arbitrary types. > Smith = list(name=“John”, married=TRUE, kids = c(“Ann”, “Pete”)) > Smith$name [1] “John“ > Smith$kids[2] [1] “Pete” Typically, vector elements are accessed by their index (an integer), list elements by their name (a character string). But both types support both access methods.
42
42 Data frames Data frame: is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: > a localization tumor.size progress ID348 proximal 6.3 0 ID234 distal 8.0 1 ID987 proximal 10.0 0
43
43 Data structures in R LinearRectangular All Same Type VECTORMATRIX MixedLIST DATA FRAME
44
44 Assesing data how many elements? length(x) i-th element: x[2] (i = 2) all but i-th element: x[-2] (i = 2) first k elements: x[1:5] (k = 5) last k elements: x[(length(x)-5):length(x)] (k = 5) specific elements: x[c(1,3,5)] (First, 3rd and 5th) all greater than some value: x[x>3] (the value is 3) bigger than or less than some values: x[ x 2] which indices are largest: which(x == max(x))
45
45 Accessing Variables Subscripts x[1] identifies first element in vector x y[1,] identifies first row in matrix or dataframe y y[,1] identifies first column in matrix or dataframe y $ sign for lists and data frames myframe$age gets age variable of myframe a$tumor.size Smith$kids
46
46 Data frames Data frame: is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: > a localization tumor.size progress ID348 proximal 6.3 0 ID234 distal 8.0 1 ID987 proximal 10.0 0
47
47 SubsettingSubsetting > a[c(1,3),] localization tumor.size progress ID348 proximal 6.3 0 ID987 proximal 10.0 0 > a[c(T,F,T),] localization tumor.size progress ID348 proximal 6.3 0 ID987 proximal 10.0 0 > a$localization [1] "proximal" "distal" "proximal" > a$localisation=="proximal" [1] TRUE FALSE TRUE > a[ a$localization=="proximal", ] localization tumor.size progress ID348 proximal 6.3 0 ID987 proximal 10.0 0 subset rows by a vector of indices subset rows by a logical vector subset a column comparison resulting in logical vector subset the selected rows
48
48 Subsetting (cont.) Using subset function subset() will subset the dataframe Subscripting from data frames myframe[,1] gives first column of myframe Specifying a vector myframe[1:5] gives first 5 rows of data Using logical expressions myframe[myframe[,1], < 5,] gets all rows of the first column that contain values less than 5
49
49 Specific Tasks To see which directories and data are loaded, type: search() To see which objects are stored, type: ls() To include a dataset in the searchpath for analysis, type: attach(NameOfTheDataset, expression) To detach a dataset from the searchpath after analysis, type: detach(NameOfTheDataset)
50
50 Reading data into R R not well suited for data preprocessing Preprocess data elsewhere (SPSS, etc…) Easiest form of data to input: text file Spreadsheet like data: Small/medium size: use read.table(), read.delim(), read.csv() Large data: use readLines(), scan(), readFasta() (of ‘seqinr’ package) Read from other systems: Use the library “foreign”: library(foreign) Can import from SAS, SPSS, Epi Info Can export to STATA
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.