An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 4, 2013.

Slides:



Advertisements
Similar presentations
AFGC Damares C Monte Carnegie Institution of Washington,
Advertisements

“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Ulrike Naumann High-throughput flow cytometry data and how to load, transform and visualise data and gate populations in Bioconductor.
Overview of Bioconductor
Bioconductor Course in Practical Microarray Analysis Heidelberg Slides ©2002 Sandrine Dudoit, Robert Gentleman. Adapted by Wolfgang Huber.
Introduction to BioConductor Friday 23th nov 2007 Ståle Nygård Statistical methods and bioinformatics for the analysis of microarray.
The Rice Functional Genomics Program of China cDNA microarray database (RIFGP-CDMD) consists of complete datasets, including the probe sequences, microarray.
1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarrays Dr Peter Smooker,
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Microarray Data Preprocessing and Clustering Analysis
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.
1 Microarray Cancer Data Visualization Analysis in Relation to Pharmacogenomics By Ngozi Nwana.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introduction to R and Bioconductor BMI 731 Winter 2005
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,
Scanning and image analysis Scanning -Dyes -Confocal scanner -CCD scanner Image File Formats Image analysis -Locating the spots -Segmentation -Evaluating.
Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Microarrays: Basic Principle AGCCTAGCCT ACCGAACCGA GCGGAGCGGA CCGGACCGGA TCGGATCGGA Probe Targets Highly parallel molecular search and sort process based.
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
with an emphasis on DNA microarrays
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.
An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 9, 2014.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez
CDNA Microarrays MB206.
Data Type 1: Microarrays
Panu Somervuo, March 19, cDNA microarrays.
Randomization issues Two-sample t-test vs paired t-test I made a mistake in creating the dataset, so previous analyses will not be comparable.
Introduction to BioConductor 許家維 許文馨 游崇善 陳彥如. Bioconductor BioConductor 起初是由 Fred Hutchinson 癌症研究 中心發起的計畫,之後有許多來自不同國家的研 究人員參與,這個計畫是一個為了分析理解基因 體資料的開放源碼計劃。
Agenda Introduction to microarrays
Introduction to R / sma / Bioconductor Statistics for Microarray Data Analysis The Fields Institute for Research in Mathematical Sciences May 25, 2002.
Microarray - Leukemia vs. normal GeneChip System.
1 maxdLoad The maxd website: © 2002 Norman Morrison for Manchester Bioinformatics.
Bioconductor Course in Practical Microarray Analysis Heidelberg, 8 Oct 2003 Slides ©2002 Sandrine Dudoit, Robert Gentleman. Adapted by Wolfgang Huber.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Comparison of Microarray Data Generated from Degraded RNA using Five Different Target Synthesis Methods and Commercial Microarrays Scott Tighe and Tim.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Example Analysis of an Affymetrix Dataset Using AFFY and LIMMA 4/4/2011 Copyright © 2011 Dan Nettleton.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Extracting quantitative information from proteomic 2-D gels Lecture in the bioinformatics course ”Gene expression and cell models” April 20, 2005 John.
PROGNOCHIP-BASE, FORTH-ICS 1 PrognoChip-BASE: An Information System for the Management of Spotted DNA MicroArray Experiments Extension of BASE v
Computational Biology and Bioinformatics Lab. Songhwan Hwang Functional Genomics DNA Microarray Technology.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
Statistics for Differential Expression Naomi Altman Oct. 06.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
A B Supporting Information Figure S1: Distribution of the density of expression intensities for the complete microarray dataset (A) and after removal of.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Microarray: An Introduction
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
基于 R/Bioconductor 进行生物芯片数据分析 曹宗富 博奥生物有限公司
Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.
Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.
Microarray - Leukemia vs. normal GeneChip System.
Microarrays 1/31/2018.
Gene Expression Analysis
Course: Statistics in Bioinformatics Date: 指導教授: 陳光琦 學生: 吳昱賢
Presentation transcript:

An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 4, 2013

Overview  Background on Bioconductor project  Installation and Packages in Bioconductor  An example: working with microarray meta-data

Bioconductor  Biological experiments continually generate more data and larger datasets  Analysis of large datasets is nearly impossible without statistics and bioinformatics  Research groups often re-write the same software with slightly different purposes  Bioconductor includes a set of open-source/open- development tools that are employable in a broad number of biomedical research areas

Bioconductor Project  The Bioconductor project started in 2001  Goal: make it easier to conduct reproducible consistent analysis of data from new high- throughput biological technologies  Core maintainers of the Bioconductor website located at Fred Hutchinson Cancer Research Center  Updated version released biannually coinciding with the release of R  Like R, there are contributed software packages

Goals of the Bioconductor Project  Provide access to statistical and graphical tools for analysis of high-dimensional biological data  micro-array analysis  analysis of high-throughput  Include comprehensive documentation describing and providing examples for packages  Website provides sample workflows for different types of analysis  Packages have associated vignettes that provide examples of how to use functions  Have additional tools to work with publically available databases and other meta-data

Biological Question Experiment (e.g. Microarray) Image analysis Pre-processing Experimental Design Normalization Biological verification and interpretation Estimation Testing Clustering Prediction …..

Bioconductor website Lets take a look at the website...

Installing Bioconductor  All packages available in Bioconductor are run using R  Bioconductor must be installed within the R environment prior to installing and using Bioconductor packages > source(" > biocLite()

Bioconductor Packages  610 packages total (for now…)  Biobase is the base package installed when you install Bioconductor  It includes several key packages (e.g. affy and limma) as well as several sample datasets > biocLite(“Biobase”) > library(Biobase)

Basic Classes of Packages  General infrastructure  Biobase, DynDoc, reposTools, rhdf5, ruuid, tkWidgets, widgetTools  Annotation  annotate, AnnBuilder  data packages  Graphics  geneplotter, hexbin  Pre-processing (affy and 2-channel arrays)  affy, affycomp, affydata, makecdfenv, limma, marrayClasses, marrayInpout, marrayNorm, marrayPlots, marrayTools, vsn  Differential gene expression  edd, genefilter, limma, multtest, ROC, siggenes  Graphs and Networks  graph, RBGL, Rgraphviz  Flow Cytometry  prada, flowCore, flowViz, flowUtils  Protein Interactions  ppiData, ppiStats, ScISI, Rintact  An so on…

Help Files for Bioconductor Packages  Like R, there are help files available for Bioconductor packages.  They can be accessed in several ways. > help(Biobase) > library(help=”Biobase”) > browseVingettes(package=”Biobase”) OR Use the Vignettes pull down menu in R  Note vignettes often contains more information than a traditional R help page.

Package Nuances  Similar to R packages and are loaded into and used in R  However, Bioconductor makes more use of the S4 class system from R  R packages typically use the S3 class system. The difference...  S4 more formal and rigorous (makes it somewhat more complicated than R)  If you really want to know more about the S4 class system you can check out

Example Use: Microarray Experiments  Microarrays are collections of microscopic DNA spots attached to solid surface  Spots contain probes, i.e. short segments of DNA gene sections  Probes hybridize with cDNA or cRNA in sample (targets)  Fluorescent probes used to quantify relative abundance of targets  Can be used to measure expression level, change in expression, SNPs,...

Gene Detection  1-Channel array: hybridized cDNA from single sample to array and measure intensity  label sample with a single fluorophore  compare relative intensity to a reference sample done on a separate chip  2-Channel arrays: hybridized cDNA for two samples (e.g. diseased vs. healthy tissue)  label each with one of two different fluorophores  mix two samples and apply to single microarray  look at fluorescence at 2 wavelengths corresponding to each fluorophore  measure ratio of intensity for each fluorophore

Microarray Analysis  Microarrays are large datasets that often have poor precision  Statistical challenges…  Account for effect of background noise  Data normalization (remove non-biological variability)  Detecting/removing poor quality or low quality feature (flagging)  Multiple comparisons and clustering analysis (e.g. FDR, hierarchical clustering)  Network analysis (e.g. Gene Ontology)

Meta-Data  Meta-data are data about the data  Datasets in Bioconductor often have meta-data so you know something about the dataset  sample.ExpressionSet is an example of microarray meta-data provided in Biobase  It is of class ExpressionSet (example of an S4 class). This class includes data describing the lab, the experiment, and an abstract that are all accessible in R. > data(sample.ExpressionSet) > sample.ExpressionSet

Exploring sample.ExpressionSet  What information exists in the meta-data sample.ExpressionSet  Number of sample  Number of “features”  Protocol for data collection  Sample names  Annotation type

Difference from S3 class object  So how different is this from a S3 class object? Linear models fit using lm are S3 class objects for example. > x<-rnorm(100); y<-rnorm(100) > fit<-lm(y~x) > class(fit) > names(fit) > fit$coefficients  What happens if we use some familiar R functions to look at sample.ExpressionSet? > class(sample.ExpressionSet) > names(sample.ExpressionSet)

S4 Commands  There are sometimes slightly different commands and nuances to look at an S4 class object in R  Use “slotNames” rather than “names” >slotNames(sample.ExpressionSet)  Also use rather than “$” to look things within an S4 class object

Accessing and Expression Set  Accessing data and parts of the data using the symbol can be dangerous  R does not provide a mechanism for protecting data (i.e. we can overwrite our data by accident)  A better idea is to subset the parts of the data you want to handle

Exploring sample.ExpressionSet  Although slotNames tells us what attributes sample.ExpressionSet has,  we are interested in accessing the microarray data itself. > abstract(sample.ExpressionSet) >#Variable names of the data > varMetadata(sample.ExpressionSet) >#Names of the genes > featureNames(sample.ExpressionSet) >#Expression values for the genes > exprs(sample.ExpressionSet)

Visualizing the Data  Let’s look at the distribution of gene expression values for all of the arrays. > dim(sample.ExpressionSet) > plot(density(exprs(sample.ExpressionSet)[,1]), xlim=c(0,6000), ylim=c(0, 0.006), main="Sample densities") > for (i in 2:25){ lines(density(exprs(sample.ExpressionSet)[,i]), col=i) }

Subsetting the data  We can subset our microarray object just like a matrix.  In gene array datasets, samples are columns and features are rows.  Thus if we want to subset of samples (i.e. things like cases or controls) we want columns.  However if we are interested in particular probes, we subset on rows. > sample.ExpressionSet$sex > subESet<-sample.ExpressionSet[1:10,] > exprs(sample.ExpressionSet)[1:10,] > exprs(subESet)

Subsetting the data  What if we only want to consider females? > f.ids<-which(sample.ExpressionSet$sex==”Female”) > femalesESet<-sample.ExpressionSet[,f.ids]  What if we only want to only AFFX genes? We can use the command grep in this case... > AFFX.ids<-grep(“AFFX”, featureNames(sample.ExpressionSet)) > AFFX.ESet<-sample.ExpressionSet[AFFX.ids,]

Next Steps?  Now we are familiar with the data, we could go the next step and do the analysis...  Pre-processing: assess quality of the data, remove any probes we know to be non-informative  Look for differential expression using a machine learning technique  Annotation  Gene set enrichment ...  Fortunately, Bioconductor provides workflows for many common analyses to help you get started.

 Although R has many statistical packages, packages in Bioconductor are designed for bioinformatics type problems  We have only touched on one small part of what is available  For further help using Bioconductor  The Bioconductor website has workshops from previous years  There is also an annual User’s group meeting  Package vignettes and help files also often contain examples with “real” data so you can work through and example