Introduction to R and Bioconductor BMI 731 Winter 2005

Slides:



Advertisements
Similar presentations
BioConductor Steffen Durinck Robert Gentleman Sandrine Dudoit November 28, 2003 NETTAB Bologna.
Advertisements

“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
How to improve your Data Analysis Processes in your Web Application / ERP using RClass Juan Antonio Breña Moral
Overview of Bioconductor
An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 4, 2013.
Bioconductor Course in Practical Microarray Analysis Heidelberg Slides ©2002 Sandrine Dudoit, Robert Gentleman. Adapted by Wolfgang Huber.
Introduction to BioConductor Friday 23th nov 2007 Ståle Nygård Statistical methods and bioinformatics for the analysis of microarray.
How to Work With Affymetrix .Cel Files in geWorkbench
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Experiences in Integration of the 'R' System into Kepler Dan Higgins – National Center for Ecological Analysis and Synthesis (NCEAS), UC Santa Barbara.
Getting the numbers comparable
Probe Level Analysis of AffymetrixTM Data
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Bio277 Lab 1: Implementing Microarray Analysis Jess Mar Department of Biostatistics Quackenbush Lab DFCI
Low-Level Analysis and QC Regional Biases Mark Reimers, NCI.
ONCOMINE: A Bioinformatics Infrastructure for Cancer Genomics
Introduction to microarray data analysis with Bioconductor Katherine S. Pollard March 11, 2004 © Copyright 2004, all rights reserved.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Microarray Data Analysis - A Brief Overview R Group Rongkun Shen
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Tutorial - Analysis of Microarray Data Microarray Core E Consortium for Functional Glycomics Funded by the NIGMS.
What is R By: Wase Siddiqui. Introduction R is a programming language which is used for statistical computing and graphics. “R is a language and environment.
An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 9, 2014.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez
A B C Q R S! Coilín Minto Department of Biology, Dalhousie University.
Bioconductor Packages for Pre-processing DNA Microarray Data affy and marray Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Introduction to BioConductor 許家維 許文馨 游崇善 陳彥如. Bioconductor BioConductor 起初是由 Fred Hutchinson 癌症研究 中心發起的計畫,之後有許多來自不同國家的研 究人員參與,這個計畫是一個為了分析理解基因 體資料的開放源碼計劃。
Agenda Introduction to microarrays
Taverna and SoapLab Elda Rossi – CINECA (Italy)
Introduction to R / sma / Bioconductor Statistics for Microarray Data Analysis The Fields Institute for Research in Mathematical Sciences May 25, 2002.
R and the Bioconductor project Sandrine Dudoit and Robert Gentleman Bioconductor short course Summer 2002 © Copyright 2002, all rights reserved.
Bioconductor Course in Practical Microarray Analysis Heidelberg, 8 Oct 2003 Slides ©2002 Sandrine Dudoit, Robert Gentleman. Adapted by Wolfgang Huber.
Introduction to DNA microarray technologies Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
1 Example Analysis of an Affymetrix Dataset Using AFFY and LIMMA 4/4/2011 Copyright © 2011 Dan Nettleton.
SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Project of CZ5225 Zhang Jingxian:
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
基于 R/Bioconductor 进行生物芯片数据分析 曹宗富 博奥生物有限公司
Taverna and SoapLab Elda Rossi – CINECA (Italy)
A graph-based integration of multiple layers of cancer genomics data (Progress Report) Do Kyoon Kim 1.
Using ArrayStar with a public dataset
R Programming.
Affymetrix and BioConductor
Microarrays 1/31/2018.
Analysis of Affymetrix GeneChip Data
Getting the numbers comparable
Affymetrix and BioConductor
Course: Statistics in Bioinformatics Date: 指導教授: 陳光琦 學生: 吳昱賢
> Introduction to Nelson Rios, Tulane University
Pre-processing AFFY data
Presentation transcript:

Introduction to R and Bioconductor BMI 731 Winter 2005 Catalin Barbacioru Department of Biomedical Informatics Ohio State University

References R Project (www.r-project.org): open-source language and environment for statistical computing and graphics. Comprehensive R Archive Network, CRAN (cran.r-project.org): source code and precompiled binary distributions for Linux, Windows, MacOS; base and contributed packages. Bioconductor Project (www.bioconductor.org) open-source software for the analysis of biomedical and genomic data, mainly R packages.

R Project R is a language and environment for statistical computing and graphics. It is a open source project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

R Project R can be extended (easily) via packages. An R package is a structured collection of code (R, C, or other), documentation, and/or data for performing specific types of analyses. Packages only need to be installed once, but ... they must be loaded with each new R session. Loading: R function library, e.g., library(Biobase); Various functions are available to obtain information on a package. For example, packageDescription returns the content of the DESCRIPTION file and .find.package returns the directory where the package was installed. > packageDescription("hgu95av2")

R Packages Analysis packages: implementation of statistical and graphical methods. E.g. cluster , glm, graph, hexbin, lattice, rpart. Data packages: Biological metadata packages consisting of environment objects for mappings between dierent gene identifiers (e.g., Aymetrix ID, GO ID, LocusLink ID, PubMed ID), CDF and probe sequence information for Aymetrix chips. E.g. GO, hgu95av2 , humanLLMappings, KEGG. Specialized/custom packages: code, data, documentation, and exercises, for a particular project, article, or course. E.g. EMBO03 : Bioconductor course package; golubEsets: Golub et al. (2000) ALL/AML dataset; yeastCC: Spellman et al. (1998) yeast cell cycle dataset.

R Packages Base packages (CRAN). E.g. base, graphics, RPackmethods, stats. Contributed packages (CRAN). E.g. ellipse, XML. Bioconductor packages. E.g. annotate, affy, marray, multtest, hgu95av2 , ALL, EMBO03 .

Bioconductor Project Bioconductor is an open-source and open-development software project for the analysis of biomedical and genomic data. The project was started in the Fall of 2001 and includes 25 core developers in the US, Europe, and Australia. Provide access to powerful statistical and graphical methods for the analysis of biomedical and genomic data. Facilitate the integration of biological metadata from WWW in the analysis of experimental data. E.g. GenBank, GO, LocusLink, PubMed. Provide training in computational and statistical methods.

Bioconductor Packages Statistical methods: cluster analysis, estimation and (multiple) testing for linear and non-linear models (with possibly censored continuous and polychotomous outcomes), resampling, visualization, etc. Biological assays: cell-based assays, DNA microarrays (transcript levels, DNA copy number from CGH), proteomics, SAGE, SELDI-TOF, SNP, etc. Biological metadata from WWW: GenBank, GO, KEGG, PubMed,etc Interfaces with other languages: C, Java, Perl, Python, XML, etc. – Omega Project (www.omegahat.org). Interactions with other projects: BGL, GeneSpring, Graphviz, MAGE-ML, Resourcerer, etc.

Bioconductor Packages Analysis packages: e.g., annotate, affy, marray, multtest. Data packages: • Biological metadata: mappings between dierent gene identifiers (e.g., AffyID, GO ID, LocusID, PMID), CDF and probe sequence information for Affymetrix chips. E.g. hgu95av2 , GO, KEGG. • Experimental data: code, data, and documentation for specific experiments or projects. ALL: Chiaretti et al. (2004) ALL dataset. golubEsets: Golub et al. (2000) ALL/AML dataset. yeastCC: Spellman et al. (1998) yeast cell cycle dataset.

Bioconductor Packages General infrastructure: Biobase, Biostrings, DynDoc, reposTools, rhdf5 , ruuid, tkWidgets, widgetTools. Annotation: annotate, AnnBuilder + metadata packages. Graphics: geneplotter, hexbin. Pre-processing Aymetrix oligonucleotide chip data: affy, affycomp, affydata, affylmGUI , affyPLM, annaffy, gcrma, makecdfenv, vsn. Pre-processing two-color spotted DNA microarray data: arrayMagic, arrayQuality, limma, limmaGUI , marray, vsn. Other assays: aCGH, DNAcopy, prada, PROcess, RSNPer, SAGElyzer. Dierential gene expression: EBarrays, edd, factDesign, genefilter, limma, limmaGUI , multtest, ROC. Graphs and networks: graph, RBGL, Rgraphviz . Gene Ontology: GOstats, goTools.

Microarray data analysis Pre-processing of – spotted array data with marray packages; – Affymetrix chip data with affy packages. List of differentially expressed genes from genefilter, limma, or multtest packages. Prediction of tumor class using randomForest package. Clustering of genes using cluster or hopach packages. Use of annotate package – to retrieve and search PubMed abstracts; – to generate an HTML report with links to LocusLink and PubMed for each gene.

affy Package To load the necessary packages, > library(affy) > library(affydata) One of the main functions for reading in Affymetrix data is ReadAffy. It reads in data from CEL files and creates objects of class AffyBatch. In this lab we will work mainly with the Dilution dataset, which is included in the affydata package. To load the dataset, type >data(Dilution) For a description of Dilution, type >? Dilution

affy classes and methods One of the main classes in affy is the AffyBatch class. >class(Dilution) [1] “AffyBatch” > slotNames(Dilution) [1] "cdfName“ "nrow“ "ncol" "exprs" "se.exprs“ "phenoData" [7]"description" "annotation" "notes“ >Dilution AffyBatch object size of arrays=640x640 features (12805 kb) cdf=HG_U95Av2 (12625 affyids) number of samples=4 number of genes=12625 annotation=hgu95av2

affy classes and methods The exprs slot contains a matrix with columns corresponding to chips and rows to individual probes on the chip. To obtain the matrix of intensities for all four chips, > e <- exprs(Dilution) Probe-level PM and MM intensities can be accessed using the pm and mm methods. > PM <- pm(Dilution)

affy classes and methods > PM[1:5, ] 20A 20B 10A 10B [1,] 468.8 282.3 433.0 198.0 [2,] 430.0 265.0 308.5 192.8 [3,] 182.3 115.0 138.0 86.3 [4,] 930.0 588.0 752.8 392.5 [5,] 171.0 128.0 152.3 97.8

affy classes and methods To get the probe-set names (Ay IDs), > gnames <- geneNames(Dilution) > length(gnames) [1] 12625 > gnames[1:5] [1] "1000_at" "1001_at" "1002_f_at" "1003_s_at" [5]"1004_at"

affy classes and methods To produce boxplots plots of log base 2 probe intensities, > boxplot(Dilution, col = c(2, 2, 3, 3))

affy classes and methods The boxplots show that the Dilution data needs normalization. As described in the dataset help file and in the phenoData slot (pData(Dilution)), two concentrations of mRNA were used and, for each concentration, two scanners were used. From the plots, we note that scanner effects seem stronger than concentration effects (different colors). In other words, chips that should be the same are different; chips that should be different are similar. Because different mRNA concentrations were used, we perform normalization within concentration groups. The default procedure implemented in the normalize method is probe-level quantile normalization.

affy classes and methods > Dil20 <- normalize(Dilution[, 1:2]) > Dil10 <- normalize(Dilution[, 3:4]) > normDil <- merge(Dil20, Dil10) >boxplot(normDil, col=c(2,2,3,3))

affy classes and methods We view the process of going from probe-level intensities to gene-level expression measures as a three-step procedure consisting of: (i) background adjustment; (ii) normalization; (iii) summarization. The affy package provides implementations for a number of methods for each of these steps: (i) background correction: e.g., none, MAS 5.0, convolution; (ii) normalization: e.g., probe-level quantile, cyclic loess, contrast loess; (iii) summarization: e.g., MAS 4.0, MAS 5.0, MBEI (Li & Wong, 2001), median polish for additive linear model (Irizarry et al., 2003). The Robust Multichip Average (RMA) method refers to the sequence: convolution background adjustment, probe-level quantile normalization, and median polish summarization for gene-specific additive models with probe and chip effects. > rmaDil <- rma(Dilution)

affy classes and methods CDF data packages Data packages providing CDF information can be download from www.bioconductor.org. These packages contain environment objects which provide mappings between AffyIDs and matrices of probe locations, with rows corresponding to probe-pairs and columns to PM and MM cells. The CDF environment for the HGU95Av2 chip is already in the package. For information on the environment object type >? hgu95av2cdf