Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to R Aedín Culhane

Similar presentations


Presentation on theme: "Introduction to R Aedín Culhane"— Presentation transcript:

1 Introduction to R Aedín Culhane aedin@jimmy.harvard.edu http://bcb.dfci.harvard.edu/~aedin http://www.hsph.harvard.edu/research/aedin-culhane/

2 Jan 2009 Data Analysts Captivated by R’s Power "R is really important to the point that it’s hard to overvalue it,” said Daryl Pregibon, a research scientist at Google, which uses the software widely. “It allows statisticians to do very intricate and complicated analyses without knowing the blood and guts of computing systems.” Nov 10 2010 Names You Need to Know in 2011: R Data Analysis Software "R is rapidly augmenting or replacing other statistical analysis packages at universities"

3 ▫Open source, development- flexible, extensible ▫Large number of statistical and numerical methods ▫High quality visualization and graphical tools ▫Extended by a very large collection of rapidly developing packages

4 R Why is it called R? ▫The name is partly based on the (first) names of the first two R authors and partly a play on the name of the Bell Labs language ‘S ▫Initially written by Robert Gentleman, & Ross Ihaka, Dept of Statistics, University of Auckland, New Zealand (1996)

5 Short R History ˆ1991: Ross Ihaka, Robert Gentleman begin work on a project that will become R 1993: The first announcement of R 1995: R available by ftp 1996: A mailing list is started and maintained by Martin Maechler at ETH 1997: The R core group is formed 2000: R 1.0.0 is released

6 Short R History Continued 2001: Bioconductor for the analysis and comprehension of genomic data using R 2008: The Omegahat project to enable connectivity between R and other languages 2010: Former co-founder and employees of SPSS found Revolution Analytics, a company which offers a commerical package around R. 2011: Rstudio Project provide a free open source integrated development environment (IDE) for R

7 R R project (v2.15 April 2012) ‏ pre v2.15 biannual release (April, October) post v2.15 annual release (April) Download core and contributed packages from CRAN Link: R Task Views R Task Views

8 R Interface Default R interface Rstudio ▫www.rstudio.orgwww.rstudio.org ▫Cross platform, Windows/Mac/Linux Others ▫Notepad++, TinnR, RCMDR, etc

9 RStudio 4 windows -Editor, Console, History, Files/plots Code completion Easy access to help (F1) One step Sweave pdf generation Searchable history Keyboard Shortcuts ▫http://www.rstudio.org/docs/using/keyboard_shortcu tshttp://www.rstudio.org/docs/using/keyboard_shortcu ts

10 Starting with R The R environment is controlled by hidden files in the startup directory:.RData,.Rhistory and.Rprofile (optional) These are very useful. History means you can automatically save all commands you type Rdata saves everything in memory (can be large- be careful) Best to rename these using ▫save.image(file=“S01_GeneProjectMay2012.RData”) ▫save(myVec, file=“S01_GeneProjectMay2012.RData”) ▫savehistory(file=“S01_GeneProjectMay2012.Rhistory”)

11 Tips for projects management Save commands to a script myscript.R ## In R source(“myscript.R”) ## Or from the command line R CMD BATCH myscript.R Save scripts, S01_xxxDate.R, S02_xxxDate.R, etc where xxx is project name Use Folders or Projects in Rstudio getwd() setwd()

12 Overview of Bioconductor Aedín Culhane aedin@jimmy.harvard.edu http://bcb.dfci.harvard.edu/~aedin http://www.hsph.harvard.edu/research/aedin-culhane

13 Bioconductor Release coincides with R release. Current: Bioconductor 2.10 (release coincide with R 2.15) To install use script on Bioconductor Website source("http://www.bioconductor.org/biocLite.R") biocLite()

14 What Packages do I need? Specific to you data and analysis pipeline but for examples: Bioconductor Workshops Bioconductor Workflows

15 Packages Overview BioConductor web site Bioconductor BiocViews Task viewTask view Software Annotation Data Experimental Data

16 Main types of Annotation Packages Gene centric AnnotationDbi packages: ▫Organism: org.Mm.eg.db. ▫Technology/Platform: hgu133plus2.db. ▫GeneSets and Pathway (biology level): GO.db or KEGG.db ▫.db packages can be queried with sql or accessed using annotation package (totable, get, mget) Genome centric GenomicFeatures packages: ▫Transriptome level: TxDb.Hsapiens.UCSC.hg19.knownGene ▫Generic features: Can generate via GenomicFeatures biomaRt: ▫Query web-based `biomart' resource for genes, sequence, SNPs, and etc. See http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf http://www.bioconductor.org/help/course-materials/2011/BioC2011/LabStuff/AnnotationSlidesBioc2011.pdf

17 Bioconductor resources Mailing List (sign up for daily digest) Documentation, workshop/course material online ▫Slides from talks, pdf of tutorials, R code Help available for each software package ▫Each package MUST contain vignette (howto) ‏ Other resources ww.Rseek.org www.r-bloggers.com ww.Rseek.orgwww.r-bloggers.com

18 Vignette Tutorials, provide worked example of package Required in Bioconductor packages Written in Sweave (Leisch, 2002). ▫ L A T E X dynamic reports in which R code is embedded and executable ▫All R code in vignette is checked (and executed) by R CMD check ▫http://www.bioconductor.org/docs/vignettes.html library("Biobase") library("GOstats") # Load package of interest openVignette()

19

20 Getting Data into R & Bioconductor Aedín Culhane aedin@jimmy.harvard.edu http://www.hsph.harvard.edu/research/aedin-culhane/

21 Simple Excel SpreadSheet data Simple table ▫read.table() ▫read.csv() ▫scan() However more datatype specialized. See Technologies on BiocViews. ▫http://www.bioconductor.org/packages/release/Bioc Views.htmlhttp://www.bioconductor.org/packages/release/Bioc Views.html Large data files. Also see http://www.revolutionanalytics.com 21

22 Some common data types Microarray SNP NGS May 2011 22

23 A Microarray Overview 23

24 Reading Affymetrix Data library(affy) require(affy) # Alternative affybatch <- ReadAffy(celfile.path="[Location of your data]") eSet<-justRMA() May 2011 24

25 Sample R code 25

26 Other Arrays Illumina ▫Lumi package 2 color spotted arrays ▫Limma package Other arrays ▫http://www.bioconductor.org/help/workflows/oli go-arrays/ May 2011 26

27 Next Generation Sequencing Data

28 Public Microarray Data ArrayExpress  21997 Studies (622,617 profiles,) GEO  22,735 Studies (558,074 profiles) Statistics May 2011

29 R Code May 2011 29

30 More on GEOquery May 2011 30 require(GEOquery) Let's try to load the GDS810 dataset which contains data on Alzheimer's disease at various stages of severity.GDS810 GDS810<-getGEO("GDS810") The getGEO function returns an object of class GEOData. You can get a description of this class like this: help("GEOData-class") Meta(GDS810) Columns(GDS810) head(Table(GDS810))

31 Assessing Data Quality May 2011 31

32 ExpressionSet Class in R May 2011 32

33 R basics: Getting help To get help ▫?mean ▫help(mean) help.search(“mean”) ‏ apropos("mean") example(mean) ‏ http://www.bioconductor.org/help/


Download ppt "Introduction to R Aedín Culhane"

Similar presentations


Ads by Google