To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February.

Slides:

Advertisements

Similar presentations

This demo will show the analysis functionality of Phenom-Networks based on a dataset generated in the Hebrew University, the Faculty of Agriculture in.

Advertisements

Introduction to VistaPHw Charting Function

Introduction to Formatting VistaPHw Charts Brought to you by: The Vista Partnership February 2007.

To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, June.

Statistical Methods Lynne Stokes Department of Statistical Science Lecture 7: Introduction to SAS Programming Language.

Jack Davis Andrew Henrey FROM N00B TO PRO. PURPOSE Create a simulator from scratch that: Generates data from a variety of distributions Makes a response.

Basics of Using R Xiao He 1. AGENDA 1.What is R? 2.Basic operations 3.Different types of data objects 4.Importing data 5.Basic data manipulation 2.

1 Creating and Tweaking Data HRP223 – 2010 October 24, 2011 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.

1 An Introduction to IBM SPSS PSY450 Experimental Psychology Dr. Dwight Hennessy.

Detecting univariate outliers Detecting multivariate outliers

XP New Perspectives on Microsoft Excel 2002 Tutorial 1 1 Microsoft Excel.

1 Committed to Shaping the Next Generation of IT Experts. Chapter 3 – Graphs and Charts: Delivering a Message Robert Grauer and Maryann Barber Exploring.

1 Summary Statistics Excel Tutorial Using Excel to calculate descriptive statistics Prepared for SSAC by *David McAvity – The Evergreen State College*

SW388R7 Data Analysis & Computers II Slide 1 Computing Transformations Transforming variables Transformations for normality Transformations for linearity.

SPSS Statistical Package for the Social Sciences is a statistical analysis and data management software package. SPSS can take data from almost any type.

Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.

Creating Web Page Forms

Adding Automated Functionality to Office Applications.

Introduction to R Statistical Software Anthony (Tony) R. Olsen USEPA ORD NHEERL Western Ecology Division Corvallis, OR (541)

CIS*1000*DE – Databases Microsoft Access (Part 2).

How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.

Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.

Chapter 5 Review: Plotting Introduction to MATLAB 7 Engineering 161.

Microsoft Excel Part 2 Kin 260 Adapted from Daniel Frankl, Ph.D. Revised by Jackie Kiwata 10/07.

STATISTICS Microsoft Excel “Frequency Distribution”

Carolina Environmental Program UNC Chapel Hill The Analysis Engine – A New Tool for Model Evaluation, Sensitivity and Uncertainty Analysis, and more… Alison.

Introduction to Dror Hollander Gil Ast Lab Sackler Medical School

Data, graphics, and programming in R 28.1, 30.1, Daily:10:00-12:45 & 13:45-16:30 EXCEPT WED 4 th 9:00-11:45 & 12:45-15:30 Teacher: Anna Kuparinen.

A very brief introduction to R

A very brief introduction to using R & MX - Matthew Keller Some material cribbed from: UCLA Academic Technology Services Technical Report Series (by Patrick.

1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.

Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint.

Excel Worksheet # 5 Class Agenda Formulas & Functions

Analyzing and Interpreting Quantitative Data

I❤RI❤R Kin Wong (Sam) Game Plan Intro R Import SPSS file Descriptive Statistics Inferential Statistics GraphsQ&A.

1 Lab 2 and Merging Data (with SQL) HRP223 – 2009 October 19, 2009 Copyright © Leland Stanford Junior University. All rights reserved. Warning:

Microsoft Access 2007 Tutorial (Part II) CIS*1000*DE.

What is SPSS  SPSS is a program software used for statistical analysis.  Statistical Package for Social Sciences.

Introduction to SPSS. Object of the class About the windows in SPSS The basics of managing data files The basic analysis in SPSS.

Advanced Stata Workshop FHSS Research Support Center.

Introduction to Enterprise Guide Jennifer Schmidt Rhonda Ellis Cassandra Hall.

1 Data Manipulation (with SQL) HRP223 – 2010 October 13, 2010 Copyright © Leland Stanford Junior University. All rights reserved. Warning: This.

June 21, Objectives  Enable the Data Analysis Add-In  Quickly calculate descriptive statistics using the Data Analysis Add-In  Create a histogram.

SP5 - Neuroinformatics SynapsesSA Tutorial Computational Intelligence Group Technical University of Madrid.

11/23/2015Slide 1 Using a combination of tables and plots from SPSS plus spreadsheets from Excel, we will show the linkage between correlation and linear.

Performing statistical analyses using the Rshell processor Original material by Peter Li, University of Birmingham, UK Adapted by Norman.

SPSS Workshop Day 2 – Data Analysis. Outline Descriptive Statistics Types of data Graphical Summaries –For Categorical Variables –For Quantitative Variables.

Excel 2007 Part (3) Dr. Susan Al Naqshbandi

SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.

Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.

Descriptive Statistics. Outline of Today’s Discussion 1.Central Tendency 2.Dispersion 3.Graphs 4.Excel Practice: Computing the S.D. 5.SPSS: Existing Files.

1 An Introduction to R © 2009 Dan Nettleton. 2 Preliminaries Throughout these slides, red text indicates text that is typed at the R prompt or text that.

T U T O R I A L  2009 Pearson Education, Inc. All rights reserved Student Grades Application Introducing Two-Dimensional Arrays and RadioButton.

Extracting Information from an Excel List The purpose of creating a database, or list in Excel, is to be able to manipulate the data elements in ways that.

R objects  All R entities exist as objects  They can all be operated on as data  We will cover:  Vectors  Factors  Lists  Data frames  Tables 

Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.

For Datatel and other applications Presented by Cheryl Sullivan.

Formulas, Functions, and other Useful Features

Programming in R Intro, data and programming structures

Business Objects Overview

Performing statistical analyses using the Rshell processor

A very brief introduction to R

By Dr. Madhukar H. Dalvi Nagindas Khandwala college

DEPARTMENT OF COMPUTER SCIENCE

Statistical Analysis with Excel

Statistical Analysis with Excel

MIS2502: Data Analytics Introduction to R and RStudio

HIMS 650 Homework set 5 Putting it all together

Data analysis with R and the tidyverse

Presentation transcript:

To err is human – to R is divine R from step 1 for the experimental biologist with an eye on the tomoRRow! Schraga Schwartz, Bioinformatic Workshop, February 2010

Outline Why R How R iRis Down syndRome WheRe R

Why ?

R from step 1 for the experimental biologist with an eye on the tomoRRow! R programming language is a lot like magic... except instead of spells you have functions.

= muggle SPSS and Excel users are like muggles. They are limited in their ability to change their environment. The way they approach a problem is constrained by how SPSS/Microsoft employed programmers thought to approach them. And they have to pay money to use these constraining softwares.

= wizard R users are like wizards. They can rely on functions (spells) that have been developed for them by statistical researchers, but they can also create their own. They don’t have to pay for the use of them, and once experienced enough (like Dumbledore), they are almost unlimited in their ability to change their environment.

R’s strengths Data management & manipulation Statistics Graphics Programming language Active user community Free!

R’s weakness Not user friendly at start. Minimal GUI. No commercial support Substantially slower than programming languages (e.g. perl, java, C++).

R graphics: the sky's the limit!

How R?

R as a calculator Calculator +, -, /, *, ^, log(), exp(), sqrt(), …: (17*0.35)^(1/3) log(10) exp(1) 3^-1

Variables in R Variables are assigned using either “=“ or “ <- ” x=12.6 x [1] 12.6

Numeric vectors A vector composed of numbers. Such a vector may be created: 1. Using the c() (short for concatenate) function: y=c(3,7,9,11) > y [1] Using the rep(what,how_many_times) function: y=rep(3,30) 3. Using the “:” operator, signifiying “a series of integers between” y=1:30

Boolean vectors A boolean variable can be either TRUE or FALSE. b=c(TRUE,FALSE,TRUE,FALSE,TRUE,TRU E) sum(b) #number of "TRUE" elements

Vector manipulation n=c(1,4,5,6,7,2,3,4,5,6) #creates a vector with the numbers in the brackets, stores it in y length(n) #number of elements n[3] #extract 3 rd element in y n[-2] #extract all of y but 2 nd element n[1:3] #extract first three element of y n[c(1,3,4)] #extract first, third, and fourth element of y

Vector manipulation… n+1 #add 1 to all elements in y n*2 #multiply by two all elements in y sum(n) mean(n) median(n) var(n) min(n) max(n) log(n) #extract logs from all variables in y

More advanced manipulation n<4 #returns boolean vector of same length as n, with "TRUE" for each value smaller than 4 and FALSE for all other values. n[n<4] #extract all elements in y smaller than 4 n[n<4 & n!=1] #extract element smaller than 4 AND different from 1 n[n<4 | n!=1] #extract element smaller than 4 OR different from 1 sum(n[n<4]) #sum of elements in n with values smaller than 4

Fuctions (spells…) in R - Functions are bits of code which receive something as input (termed: arguments), and produce something as output (termed: return value). -A function can be recognized by the round brackets "()" following the function name. -The arguments of the "mean" function is a vector of numbers; the return value is their average.

Basic visualization of numbers barplot(n) plot(n) hist(n) boxplot(n) pie(n)

barplot(n,col="red")

plot(n,col="red")

hist(n,col="red")

boxplot(n,col="red")

pie(n[1:3])

Help in R Click ? + function_name. ? barplot Help pages contain the following components: -function_name(package) – if the package is not installed, this is the time to install it and call it (using "library") -Description: brief overview -Usage -Description of arguments (input) -Details: more information -Value: value returned by the function (output) -See also: great way to learn new stuff you didn't even know you wanted to do! -Examples: Can be copy-pasted as is! Highly informative!

Other vectors Character vectors: nms=c("miriam","schragi","chaim","joc hanan","ephraim","avraham","yemima", "shakked","ayala","adi") names(n)=nms #giving names to each value in numeric vector y n["shakked"] Class Exercise: Redraw some of the previous plots with modified n!

The paste() function Concatenates different characters into a single character, separated by the variable defined by sep argument (default: sep=" ") paste("To","err","is human.","To R is","divine!",sep="_")

Factor vectors (We love factors!) f=as.factor(c("stupid","stupid","s mart","stupid","imbecile","smart ","smart","imbecile")) levels(f) #possible values a variable in y can have summary(f) #provides the number of time each factor occurs Class Exercise: Compare summary(n), summary(b), and summary(f) – note difference in output!

The data.frame Class (We also love data.frames!) A data.frame is simply a table Each column may be of a different class (i.e. one column may be numeric, another may be a character, a third may be boolean and a fourth may be a factor) All rows in a given column must be of the same class The number of rows in each column must be identical.

Iris database Petal (עלה כותרת) Sepal (עלה גביע)

The iris dataset

The fascinating questions What are typical lengths and widths of sepals and petals? Do these change from one family of irises to another? Do longer petals tend to be wider? Do longer petals tend to correlate with longer (or wider) sepals? Do such correlations change from one family of irises to another?

Playing with data frames - I 1. Set the work directory to the directory you're working in: setwd("F:/presentations/R presentation") (Note: getwd() tells you which directory you're in) 2. Load the table you want to work with (make sure you saved it as tab delimited file!): ir=read.table(file="iris_dataset.txt",sep="\t",header=T) #loads iris_dataset.txt into variable "ir". Assumes that the file is tab delimited, and that the first line is a header.

Playing with data frames II class(ir) #shows the class of ir dim(ir) #returns the number of rows and columns in ir ir[1,2] #first line, second column in ir ir[1,] #all columns in first line in ir ir[,1] #all rows in first column of ir ir$seplen #same as above ir[,"seplen"] #same as above ir[,c("seplen","sepwid")] OR ir[,1:2] #first two columns of ir summary(ir) #each of the columns is summarized according to its class

Playing with data frames - III ir$seplen>6 #returns a boolean vector with TRUE and FALSE values depending on whether seplen is greater than 6 ir[ir$seplen>6,] #returns a subset of ir containing all columns of all rows in which seplen is greater than 6 ir[ir$seplen>6,c("seplen","sepwid")] #returns same rows as above, but only "seplen" and "sepwid" columns ir[ir$seplen>6 & ir$sepwid >3,c("seplen","sepwid")] #returns same columns as above, but only rows in which seplen is greater than 6 and sepwid is greater than 3

Visualization hist(ir$seplen) #histogram of seplen

Visualization - II hist(ir$seplen,30) #histogram of seplen

Visualization - III mean_seplen=mean(ir$seplen) hist(ir$seplen,20,col="light blue", main ="Distribution of Septal lengths", xlab ="Lengths of septal (cm)", sub =paste("Mean septal length is",mean_seplen))

The tapply() function Suppose you want to obtain average ages of patients (a numeric) variable, as a function of their gender (a factor) variable. And suppose the data is stored in the data frame data. The magic spell is: tapply(data$age,data$gender,mean) The tapply function – receives three parameters: -A numeric distribution -A factor variable, dividing the numeric distribution into groups -A function (mean,min,max,sd,sum)

mean_per_species=tapply(ir$seplen,ir$species,mean) #calculates the mean value of ir$seplen after dividing it into three groups based on ir$species barplot(mean_per_species,col="red") Visualization - IV

Adding packages

Select mirror

Select library

Class exercise Install the following three libraries: gplots, lattice,car These libraries will be used in subsequent examples.

Visualization - V sd_per_species=tapply(ir$seplen,ir$species,sd) #caculate standard deviation library(gplots) #loads all functions in gplots into workspace (including the barplot2 function) barplot2(mean_per_species, plot.ci = T, ci.l = mean_per_species-sd_per_species, ci.u = mean_per_species+sd_per_species,col="red",ylab="Mean septal lengths")

Visualization - VI library(gplots) plotmeans(ir$seplen~ir$species,xlab="species",ylab=" Sepal length")

Looking at correlations plot(ir$petlen,ir$petwid) #plotting one set of numbers as a function of another

Arguments of the plot function Some parameters of plot() function (get more by typing "? plot.default"): x – x values (defaults 1:number of points) y – the distribution type – type: can be either "l" (line), "p" (points) or more pch – type of bullets (values from 19-25) col – color (either numbers of names of colors) – can receive multiple colors lwd – line width lty – line type xlab,ylab – X and Y labels main, sub – main title (top of chart) and subtitle (beneath the X label)

More sophisticated plotting plot(ir$petlen,ir$petwid,col=as.numeric(ir$species),p ch=19,xlab="Petal width",ylab="Petal length")

And more sophisticated plot, with legend and P values stat=cor.test(ir$petlen,ir$petwid) rval=stat$estimate pval=stat$p.value plot(ir$petlen,ir$petwid,col=as.numeric(ir$species),pch=19,xlab=" Petal width",ylab="Petal length",main=paste("R=",rval," ; P=",pval,sep="")) legend(x="topleft",legend=levels(ir$species),col=1:3,lty=1,lwd=2) #adding a legend

Plotting correlations as a function of a third factor variable library("lattice") xyplot(ir$seplen ~ ir$sepwid | ir$species)

Looking at everything as a function of everything else pairs(ir[,1:4]) pairs(ir[,1:4],col=ir$Species,upper.panel=NULL)

Even more sophisticated… library(car) scatterplot.matrix(ir[,1:4],groups=ir$Species,ellipse=T,le vels=0.95,upper.panel=NULL, smooth=F)

And more (for the highly motivated or extremly bored…) upperpanel.cor <- function(x, y,method="pearson",digits=2,...) { points(x,y,type="n"); usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)); correl <- cor.test(x, y,method=method); r=correl$estimate; pval=correl$p.value; color="black"; if (pval<0.05) color="blue"; txt <- format(r,digits=2) pval <- format(pval,digits=2) txt <- paste("r=", txt, "\npval=",pval,sep="") text(0.5, 0.5, txt,col=color) } scatterplot.matrix(ir[,1:4],groups=ir$Species,ellipse=T,levels=0. 95,upper.panel=upperpanel.cor,cex=0.3,smooth=F,main="This is cool!!!")

Final output

Saving Graphics to Files Before running the visualizing function, redirect all plots to a file of a certain type. Possibilities: –jpeg(filename) –png(filename) –pdf(filename) –postscript(filename) After running the visualization function, close graphic device using dev.off()

Saving graphics Example: pdf("F:/test.pdf") barplot(1:10,col="red") dev.off() Note:Different graphic functions can also receive arguments regarding width and height of canvas. Use "?" + function name (e.g. ?jpeg to obtain arguments)

Statistics t.test #Student t test wilcox.test #Mann-Whitney test kruskal.test #Kruskal-Wallis rank sum test chisq.test #chi squared test cor.test #pearson/spearman correlations lm(),glm() #linear and generalized linear models p.adjust #adjustment of P values to multiple testing using FDR, bonferroni, or whatnot…

Down Syndrome

The fascinating research question Do genes from any particular chromosome alter their expression levels in Down syndrome?

GEO database: A paradise of numbers

Getting the data!

Loading the data (look at it first!) setwd("F:/Presentations/R presentation/") #sets the work directory a=read.table(file="GSE5390_series_matrix.txt ",sep="\t",header=T,comment.char="!") #loads the gene expression values and stores them in a names(a)=c("id","down1","down2","down3","dow n4","down5","down6","down7","healty1","hea lty2","healty3","healty4","healty5","healt y6","healty7","healty8") #give informative names to columns in a

The merge() function a= b= merge(a,b,by="name") OR merge(a,b,by.x="name",by.y="name")

Merging data convert=read.table(file="convert_affyprobes_ 2_chromosome_location_from_UCSC.txt",sep=" \t",header=T) b=merge(a,convert,by="id") #merges a and convert by the columns indicated by the by arguments. In other words, the column "id" in "a" is compared to the column "id" in "convert". Only lines in which the two values are identical are retained, yielding a new data frame with shared values & shared information.

Assign informative names downcols=2:8 healthycols=9:16 allarraycols=c(downcols,healthycols)

Calculate Fold Change between disease and healthy Step 1: calculate mean expression values for all patients with Down syndrome b$meandown=apply(b[,downcols],1,mean) Step 2: calculate mean expression values for all healthy subjects b$meanhealthy=apply(b[,healthycols],1,mean) Step 3: Calculate difference between the two (since data is log transformed) b$dif=b$meandown-b$meanhealthy Step 4: anti-log the fold change b$foldchange=2^b$dif

Calculate P values Step 1: Create function which receives a line as input, and knows how to break it up into disease and control groups and yield a p value GetPval=function(line) { ttest=t.test(line[downcols-1],line[healthycols- 1]) ttest$p.value } Step 2: Apply this function to all rows of the data frame b$pval=apply(b[,allarraycols],1,GetPval) Step 3: Adjust P value to multiple testing b$adjustedPval=p.adjust(b$pval,method="fdr")

Saving data frames to a file write.table(b,file="DownWithPvals.txt",sep=" \t",row.names=F,col.names=T) #generates a tab- delimited file with column names, without row names containing the data in the data frame b

Finding significant events sigs=b[b$foldchange>1.75 & b$adjustedPval<0.01,] #finding events with significant fold change and significant P values sigs=sigs[order(sigs$adjustedPval,decreasing=T),] #sorting table based on P values

Finding and plotting % significantly over/under expressed genes per chromosome percentages=summary(sigs$chr)*100/sum mary(b$chr) #divides the number of times each chrosome appears in "sigs" by number of time it appears in original data barplot(percentages,las=3,col="light blue",ylab="% significant genes",main="To R is divine!") #barplot depicting the percentage of genes from each chromosome within sig

Even better plot… validchrs=c(paste("chr",1:22,sep =""),"chrX","chrY") percentages=percentages[validchr s] barplot(percentages,las=3,col="l ight blue",ylab="% significant genes",main="R - for a better tomoRRow!")

Results…

Volcano plots: P values as a measure of fold change plot(log2(b$foldchange),- log2(b$pval),col=(b$chr=="chr21")+1,pch=19,xlab="log foldchange",ylab="-log P value") legend(x="topleft",legend=c("non chr 21","chr 21"),lty=1,col=1:2,lwd=3) abline(h=-log2(0.001),col="blue",lty=3) abline(v=c(log2(1.75),-log2(1.75)),col="blue",lty=3) text(2,17,"Significantly\nOver-represented",col="blue") text(-1.4,17,"Significantly\nUnder-represented",col="blue") abline() function: adds either horizontal or vertical line/s (as well as more sophisticated stuff as well), depending on whether the "h" or "v" arguments are populated text() function: receives x,y coordinates |on plot, as well as text to plot

Volcano plot

A particular R strength: genetics Bioconductor is a suite of additional functions and some 200 packages dedicated to analysis, visualization, and management of genetic data Much more functionality than software released by Affy or Illumina

Where R?

R homepage: project.org/ project.org/

Choose server…

Click on “Windows”

Click “base”

Click on “Download” link and follow installation guidelines…

There you R!

Installing Tinn-R Go to: Scroll to bottom of page

Loading R from within Tinn-R

Configuring Tinn-R hotkeys

Write text in Tinn-R; send to R

Final Tips Use & google for finding help on what you wanthttp:// Know your objects’ classes: class(x) Know your functions arguments. Use "? function_name" to learn what arguments a function receives & what its return values are. Each help files provides examples, which can be copy-pasted into R as is. Extremely useful! MOST IMPORTANT - the more time you spend using R, the more comfortable you become with it. DESPAIR NOT – and you will never look back!

Final Words of Warning “Using R is a bit akin to smoking. The beginning is difficult, one may get headaches and even gag the first few times. But in the long run,it becomes pleasurable and even addictive. Yet, deep down, for those willing to be honest, there is something not fully healthy in it.” --Francois Pinard R

Thank you! May the R be with you!

Quick hands-on Generate a numeric vector called a containing the number 1,3,4,5,9. Calculate the square root (sqrt) of the values in a. Create a barplot displaying a Show a as a regular plot, showing the values in red. Label the x-axis of the plot "R is gReat", and the y-axis "I love R".

Hands-On - II Based on the down syndrome microarrays: -Find the 10 genes showing the highest differences between healthy and sick. -Create bar plots showing the average values in sick, and in healthy for those ten genes. -For true geeks: Add error bars to the graph.

Todo multiple panels lists, loops, lapply, sapply regular expressions