Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Perl / R 26 – 27 April, 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary.

Similar presentations


Presentation on theme: "1 Introduction to Perl / R 26 – 27 April, 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary."— Presentation transcript:

1 1 Introduction to Perl / R 26 – 27 April, 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - Israel Institute of Technology

2 2 Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics Introduction to R A Language and Environment for Statistical Computing, Graphics & Bioinformatics mshmoish@cs.technion.ac.ilmshmoish@cs.technion.ac.il) Michael Shmoish (mshmoish@cs.technion.ac.il)mshmoish@cs.technion.ac.il Bioinformatics Knowledge Unit The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion - IIT

3 3 We’ll learn 1. What is R? 2. Data input/output, basic syntax 3. Data types: vectors, matrices/data frames, lists 4. How to manipulate data; vectorized calculations 5. Graphics in R: plot, barplot, heatmap 6. Statistics in R: regression, histograms, some classical tests 6. Statistics in R: regression, histograms, some classical tests 7. Packages in R

4 4 What is R  Software for Statistical Data Analysis and Visualization  Based on S (  Based on S (Bell Lab, since 1976)  Programming Environment  Interpreted Language. Object Oriented  Data Storage, Analysis, Visualization  Free and Open Source Software

5 5 R vs. Perl   Interactive   Graphics   Statistics   Higher level language   Bioinformatics-oriented   Manipulating data   Easier to grasp   Very general

6 6 R vs. Matlab   Data frames, lists, vectors   Strong statistics   Bioinformatics-oriented   Free open source   Great community support   “Everything is matrix”   Great graphics   Very general   Expensive   Professional support

7 7 What R does and does not  data handling and storage: numeric, textual, regular expressions  matrix algebra  high-level data analytic and statistical functions  classes (“OO”)  graphics  programming language: loops, branching, subroutines  is not a database, but connects to DBMSs  has no graphical user interfaces, but connects to Java, TclTk  language interpreter can be very slow, but allows to call own C/C++ code

8 8 Why R ?  Strengths  Free and Open Source  Strong User Community  Highly flexible  Implementation of cutting-edge statistical/bioinformatics methods  Good graphics and intelligent defaults  Continuous improvement  Weakness  Steep learning curve  Slow for large datasets – improving!

9 9 Example: seqinr R-package >proteinInFastaFormat MPRLFSYLLGVWLLLSQLPREIPGQSTNDFIKACGRELVRLWVEICGSVSWGRTAL SLEEPQLETGPPAETMPSSITKDAEILKMMLEFVPNLPQELKATLSERQPSLREL QQSASKDSNLNFEEFKKIILNRQNEAEDKSLLELKNLGLDKHSRKKRLFRMTLS EKCCQVGCIRKDIARLC

10 10 Example: seqinr R-package(cont.) > seqAA seqAA <- read.fasta(file = system.file("sequences/seqAA.fasta", package = "seqinr"), seqtype = "AA") A function AAstat of package ‘seqinr’ returns simple protein sequence information including the number of residues, the percentage physico-chemical classes and the theoretical isoelectric point ; > AAstat(seqAA[[1]]) $Compo A C D E F G H I K L M N P Q R S T V W Y A C D E F G H I K L M N P Q R S T V W Y 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1 8 6 6 18 6 8 1 9 14 29 5 7 10 9 13 16 7 6 3 1…$Prop$Aliphatic [1] 0.2404372 $Prop$Aromatic [1] 0.06010929...$Pi [1] 8.534902

11 11 Example: seqinr R-package(cont.)

12 12 R Project home page

13 13 Basic R resources  Getting R: http://cran.r-project.org/ In “Download and Install R” section choose “Windows” and next screen choose “base”‏ or go directly to http://cran.r-project.org/bin/windows/base/ and click on Download R 2.11.0 for Windows. and click on Download R 2.11.0 for Windows.Download R 2.11.0 for WindowsDownload R 2.11.0 for Windows Then a regular installation procedure should apply (sometimes it is required to choose a mirror site). Then a regular installation procedure should apply (sometimes it is required to choose a mirror site).  Editors: Tinn-R - http://www.sciviews.org/Tinn-R/ Eclipse - http://www.eclipse.org/ http://www.eclipse.org/

14 14

15 15 Installation For your home/lab installation  Choose a "save" option and download the file to the directory where you could write the data to in this room  Go to this directory, double-click the downloaded exe file and proceed with a wizard  After clicking on "Finish" you should get a shortcut to Rgui.exe file on your desktop  Right click on this shortcut, choose 'Properties', then in the 'target' field add ''--internet2" flag to get something like: "C:\Program Files\R\R- 2.6.1\bin\Rgui.exe --internet2" and click OK (this should enable internet download functions; if it doesn't work a proxy needs to be set up, see 'R for Windows FAQ')

16 16 Opening R In Windows, double-click on desktop icon or.RData

17 17 Opening R – the Console

18 18 R Working Area This is the area where all commands are issued, and non-graphical outputs observed when run interactively

19 19 RGui Under File  Change dir  Load Workspace  Save Workspace  Load History  Save History Under Edit  GUI Preferences Under Help

20 20 Simple R-session  ‘Change dir’ to right one (to be able to write)  > a a <- c(1.1, 2.3, 5.4)  >length(a)  >b = seq(1,11,2); d = 1:10  >rT = rep(1:5, times = 3) ; rEach rT = rep(1:5, times = 3) ; rEach <- rep(1:5, each = 3)  >max(a); min(a)  >ls() # list objects in your working environment  mean(d)  > summary(b)  q(save = “yes”)

21 21 In an R Session usually we:  First, read data from other sources  Use packages, libraries, and functions  Write functions wherever necessary  Conduct statistical data analysis and produce plots  Save outputs to files, write tables  Save R workspace if necessary

22 22 R documentation/manuals/books  http://www.r-project.org/ Under documentation on the left (Wiki?) http://www.r-project.org/  http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf  “R Programming” by Stephen Eglen  “Bioinformatics and Computational Biology Solutions Using R and Bioconductor”, by Robert Gentlemen, Vincent J. Carey, et al.

23 23 Getting help in R  >example(‘function.name’)  >apropos(‘keyword’)  >help.start()  ?mean, ?”:” – short-cut for help(“:”)  demo(graphics)  taskView at CRAN page  ‘sos’

24 24 Data creation and input/output  Create  Vectors: c(1,2,3)  Matrices: matrix(c(1,2,3,4,5,6),2,3,byrow=T) (2=#rows)  Lists: myList = list(num = 1:5, let = letters[1:5])  Read data  from file: read.delim, read.csv, readLines, scan  from web (later)  Write/save data  write.csv, write.table  save

25 25 Variables > a = 49 > sqrt(a) [1] 7 > sentence <- "The dog ate my homework" > sub("dog","cat", sentence) [1] "The cat ate my homework“ >sentence ? > b = (1 + 1 == 3) > b [1] FALSE > flag <- (1:5 == 3) > flag [1] FALSE FALSE TRUE FALSE FALSE numeric character string logical

26 26 Variables > a = 49 > sqrt(a) [1] 7 > sentence <- "The dog ate my homework" > sub("dog","cat", sentence) [1] "The cat ate my homework“ >sentence [1] “The dog ate my homework” > b <- (1 + 1 == 3) > b [1] FALSE > flag = (1:5 == 3) > flag [1] FALSE FALSE TRUE FALSE FALSE numeric character string logical

27 27 Missing values Variables of each data type (numeric, character, logical) can also take the value NA: not available. NA is not the same as 0 or as “” or as FALSE Any operations (calculations, comparisons) that involve NA may or may not produce NA: > NA==1 [1] NA > 1+NA [1] NA > max(c(NA, 4, 7)) [1] NA > max(c(NA, 4, 7), na.rm=T) [1] 7 > NA | TRUE [1] TRUE > NA & TRUE [1] NA

28 28 Other special values > pi / 0 [1] ?  Inf, -Inf > 0/0 [1] NaN Inf and -Inf are positive and negative infinity whereas NaN means ‘Not a Number’

29 29 Functions Functions do things with data “Input”: function arguments (0 – no arguments, or 1, or 2, etc.) “Output”: function result (exactly one) There are many basic R- functions (which make R a very proficient calculator) but everybody could add easily new ones. Example: memutza = function(a,b) { result = (a+b)/2 return(result) } >memutza(4,8) [1] 6

30 30 Functions and operators   Functions do things with data   “Input”: function arguments (0 – no arguments, or 1, or 2, etc.)   “Output”: function result (exactly one)   Exceptions to the rule:   Functions may also use data that sits around in other places, not just in their argument list: “scoping rules”   Functions may also do other things than returning a result. E.g., plot something on the screen: “side effects”   Operators: Short-cut for frequently used functions of one or two arguments: + - * / % (arithmetic); ! & | (logical) <- (assignment)

31 31 Some built-in R-functions  ‘load’, ‘save’  ‘ls()’ vs. ‘dir()’  ‘c’, ‘rep’, ‘seq’, ‘rm’  ‘log10’, ‘log2’, ‘sin’, etc.  ‘sum’, ‘mean’, ‘var’  ‘print’  ‘plot’  ‘summary’  readLines  ‘read.table’, ‘read.csv’  ‘read.delim’ and read.delim(“clipboard”)  ‘write.table’, ‘write.csv’  q(save = “yes”) Google “RQuickReference” and “R-refcard” (PDF) for much more functions

32 32 Exercise 1 a) Open an Excel sheet and fill in: Name Faculty Lab Phone (last 4 digits) Age (when available) b) Save the file as.csv c) Read it to R using read.csv d) Print to the screen two first rows e) Print 3 first columns only f) Write your R-object to.csv file and compare with the original one

33 33 R as a calculator > log2(32) [1] 5 > 5^5 [1] 3125 > 2^(5:1) [1] 32 16 8 4 2 # simple example of vectorized # calculations > sqrt(2) [1] 1.414214 > sum(1:10) [1] 55 > plot(sin(seq(0, 2*pi, length=100)))

34 34 When things go wrong  Syntax errors (typing mistakes)  Logical errors (harder to find!)  Common problems: 1) missing close bracket leads to continuation line. > x x <- (1 + (2 * 3)+ Hit Ctrl C (below) or keep typing! 2) too many brackets: 2 + (2*3)) When things seem to take too long, try C-c [Ctrl and C, together] or red ‘STOP’ button When things seem to take too long, try C-c [Ctrl and C, together] or red ‘STOP’ button

35 35 Exercise 2 a) Get printed first 10 powers of 3 b) Predict what is pi:6? And 6:pi? c) Calculate sum of 1*1+2*2+…+20*20 d) Predict what is log10(c(1, 10, 10^(1:10))) e) Plot first 100 integers

36 36 Data structures in R LinearRectangular All Same Type VECTORSMATRIX MixedLIST DATA FRAME

37 37 Vectors Vector: an ordered collection of data of the same type > a = c(1,2,3,4,5) # same as 1:5 # same as seq(1,5) > a*2 [1] 2 4 6 8 10 Example: the mean spot intensities of all 15488 spots on a chip: a vector of 15488 numbers In R, a single number is the special case of a vector with 1 element. Besides numbers other vector types: character strings, logical >sentence <- "The dog ate my homework“ >nchar(sentence) [1] 23 >length(sentence) [1] ?

38 38 Variables > a = 49 > sqrt(a) [1] 7 > sentence <- "The dog ate my homework" > sub("dog","cat", sentence) [1] "The cat ate my homework“ >sentence ? > b = (1 + 1 == 3) > b [1] FALSE > flag <- (1:5 == 3) > flag [1] FALSE FALSE TRUE FALSE FALSE numeric character string logical Logical vector

39 39 Vectors and matrices Matrix: a rectangular table of data of the same type Example: the expression values for 10,000 genes for 30 tissue biopsies: a matrix with 10,000 rows and 30 columns > mdat <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3, byrow=TRUE) > mdat [,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13 > mdat2 <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3) > mdat2 ?

40 40 Vectors and matrices Matrix: a rectangular table of data of the same type Example: the expression values for 10,000 genes for 30 tissue biopsies: a matrix with 10,000 rows and 30 columns How to make a matrix from a vector? > mdat <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3, byrow=TRUE) > mdat [,1] [,2] [,3] [1,] 1 2 3 [2,] 11 12 13 > mdat2 <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3) > mdat2 [,1] [,2] [,3] [1,] 1 3 12 [2,] 2 11 13

41 41 Lists Vector: an ordered collection of data of the same type. > a = c(7,5,1) > a[2] [1] 5 List: an ordered collection of data of arbitrary types. > Smith = list(name=“John”, married=TRUE, kids = c(“Ann”, “Pete”)) > Smith$name [1] “John“ > Smith$kids[2] [1] “Pete” Typically, vector elements are accessed by their index (an integer), list elements by their name (a character string). But both types support both access methods.

42 42 Data frames Data frame: is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: > a localization tumor.size progress ID348 proximal 6.3 0 ID234 distal 8.0 1 ID987 proximal 10.0 0

43 43 Data structures in R LinearRectangular All Same Type VECTORMATRIX MixedLIST DATA FRAME

44 44 Assesing data  how many elements? length(x)  i-th element: x[2] (i = 2)  all but i-th element: x[-2] (i = 2)  first k elements: x[1:5] (k = 5)  last k elements: x[(length(x)-5):length(x)] (k = 5)  specific elements: x[c(1,3,5)] (First, 3rd and 5th)  all greater than some value: x[x>3] (the value is 3)  bigger than or less than some values: x[ x 2]  which indices are largest: which(x == max(x))

45 45 Accessing Variables  Subscripts  x[1] identifies first element in vector x  y[1,] identifies first row in matrix or dataframe y  y[,1] identifies first column in matrix or dataframe y  $ sign for lists and data frames  myframe$age gets age variable of myframe  a$tumor.size  Smith$kids

46 46 Data frames Data frame: is supposed to represent the typical data table that researchers come up with – like a spreadsheet. It is a rectangular table with rows and columns; data within each column has the same type (e.g. number, text, logical), but different columns may have different types. Example: > a localization tumor.size progress ID348 proximal 6.3 0 ID234 distal 8.0 1 ID987 proximal 10.0 0

47 47 SubsettingSubsetting > a[c(1,3),] localization tumor.size progress ID348 proximal 6.3 0 ID987 proximal 10.0 0 > a[c(T,F,T),] localization tumor.size progress ID348 proximal 6.3 0 ID987 proximal 10.0 0 > a$localization [1] "proximal" "distal" "proximal" > a$localisation=="proximal" [1] TRUE FALSE TRUE > a[ a$localization=="proximal", ] localization tumor.size progress ID348 proximal 6.3 0 ID987 proximal 10.0 0 subset rows by a vector of indices subset rows by a logical vector subset a column comparison resulting in logical vector subset the selected rows

48 48 Subsetting (cont.)  Using subset function  subset() will subset the dataframe  Subscripting from data frames  myframe[,1] gives first column of myframe  Specifying a vector  myframe[1:5] gives first 5 rows of data  Using logical expressions  myframe[myframe[,1], < 5,] gets all rows of the first column that contain values less than 5

49 49 Specific Tasks  To see which directories and data are loaded, type: search()  To see which objects are stored, type: ls()  To include a dataset in the searchpath for analysis, type: attach(NameOfTheDataset, expression)  To detach a dataset from the searchpath after analysis, type: detach(NameOfTheDataset)

50 50 Reading data into R  R not well suited for data preprocessing  Preprocess data elsewhere (SPSS, etc…)  Easiest form of data to input: text file  Spreadsheet like data:  Small/medium size: use read.table(), read.delim(), read.csv()  Large data: use readLines(), scan(), readFasta() (of ‘seqinr’ package)  Read from other systems:  Use the library “foreign”: library(foreign)  Can import from SAS, SPSS, Epi Info  Can export to STATA


Download ppt "1 Introduction to Perl / R 26 – 27 April, 2010 By Fabian Glaser and Michael Shmoish Bioinformatics Knowledge Unit, The Lorry I. Lokey Interdisciplinary."

Similar presentations


Ads by Google