An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 9, 2014.

Slides:

Advertisements

Similar presentations

AFGC Damares C Monte Carnegie Institution of Washington,

Advertisements

“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.

Overview of Bioconductor

An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 4, 2013.

Introduction to BioConductor Friday 23th nov 2007 Ståle Nygård Statistical methods and bioinformatics for the analysis of microarray.

Bioinformatics at WSU Matt Settles Bioinformatics Core Washington State University Wednesday, April 23, 2008 WSU Linux User Group (LUG)‏

1 MicroArray -- Data Analysis Cecilia Hansen & Dirk Repsilber Bioinformatics - 10p, October 2001.

Microarrays Dr Peter Smooker,

Microarray Data Preprocessing and Clustering Analysis

Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

1 Microarray Cancer Data Visualization Analysis in Relation to Pharmacogenomics By Ngozi Nwana.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

Introduction to R and Bioconductor BMI 731 Winter 2005

ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”

Inferring the nature of the gene network connectivity Dynamic modeling of gene expression data Neal S. Holter, Amos Maritan, Marek Cieplak, Nina V. Fedoroff,

Microarray Data Analysis - A Brief Overview R Group Rongkun Shen

Why microarrays in a bioinformatics class? Design of chips Quantitation of signals Integration of the data Extraction of groups of genes with linked expression.

Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.

B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.

Microarray Preprocessing

with an emphasis on DNA microarrays

CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.

CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.

Copyright 2000, Media Cybernetics, L.P. Array-Pro ® Analyzer Software.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez

CDNA Microarrays MB206.

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Data Type 1: Microarrays

Panu Somervuo, March 19, cDNA microarrays.

Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.

Introduction to BioConductor 許家維許文馨游崇善陳彥如. Bioconductor BioConductor 起初是由 Fred Hutchinson 癌症研究中心發起的計畫，之後有許多來自不同國家的研究人員參與，這個計畫是一個為了分析理解基因體資料的開放源碼計劃。

Agenda Introduction to microarrays

Introduction to R / sma / Bioconductor Statistics for Microarray Data Analysis The Fields Institute for Research in Mathematical Sciences May 25, 2002.

Microarray - Leukemia vs. normal GeneChip System.

1 maxdLoad The maxd website: © 2002 Norman Morrison for Manchester Bioinformatics.

Bioconductor Course in Practical Microarray Analysis Heidelberg, 8 Oct 2003 Slides ©2002 Sandrine Dudoit, Robert Gentleman. Adapted by Wolfgang Huber.

ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.

A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.

A Short Overview of Microarrays Tex Thompson Spring 2005.

Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:

Extracting quantitative information from proteomic 2-D gels Lecture in the bioinformatics course ”Gene expression and cell models” April 20, 2005 John.

Computational Biology and Bioinformatics Lab. Songhwan Hwang Functional Genomics DNA Microarray Technology.

Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.

Statistics for Differential Expression Naomi Altman Oct. 06.

Nuria Lopez-Bigas Methods and tools in functional genomics (microarrays) BCO17.

Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.

While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.

Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.

Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.

ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.

Microarray Data Analysis The Bioinformatics side of the bench.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Microarray: An Introduction

1 Survey of Biodata Analysis from a Data Mining Perspective Peter Bajcsy Jiawei Han Lei Liu Jiong Yang.

Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.

MESA A Simple Microarray Data Management Server. General MESA is a prototype web-based database solution for the massive amounts of initial data generated.

基于 R/Bioconductor 进行生物芯片数据分析曹宗富博奥生物有限公司

Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.

Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.

Microarray - Leukemia vs. normal GeneChip System.

Microarrays 1/31/2018.

Gene Expression Analysis

Course: Statistics in Bioinformatics Date: 指導教授: 陳光琦學生: 吳昱賢

Presentation transcript:

An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 9, 2014

Overview  Background on Bioconductor project  Installation and Packages in Bioconductor  An example: working with microarray meta-data

Bioconductor  Biological experiments continually generate more data and larger datasets  Analysis of large datasets is nearly impossible without statistics and bioinformatics  Research groups often re-write the same software with slightly different purposes  Bioconductor includes a set of open-source/open- development tools that are employable in a broad number of biomedical research areas

Bioconductor Project  The Bioconductor project started in 2001  Goal: make it easier to conduct reproducible consistent analysis of data from new high- throughput biological technologies  Core maintainers of the Bioconductor website located at Fred Hutchinson Cancer Research Center  Updated version released biannually coinciding with the release of R  Like R, there are contributed software packages

Goals of the Bioconductor Project  Provide access to statistical and graphical tools for analysis of high-dimensional biological data  micro-array analysis  analysis of high-throughput  Include comprehensive documentation describing and providing examples for packages  Website provides sample workflows for different types of analysis  Packages have associated vignettes that provide examples of how to use functions  Have additional tools to work with publically available databases and other meta-data

Biological Question Experiment (e.g. Microarray) Image analysis Pre-processing Experimental Design Normalization Biological verification and interpretation Estimation Testing Clustering Prediction …..

Bioconductor website Lets take a look at the website...

Installing Bioconductor  All packages available in Bioconductor are run using R  Bioconductor must be installed within the R environment prior to installing and using Bioconductor packages > source(" > biocLite()

Bioconductor Packages  749 packages total (for now… there were 610 this time last year)  Biobase is the base package installed when you install Bioconductor  It includes several key packages (e.g. affy and limma) as well as several sample datasets > biocLite(“Biobase”) > library(Biobase)

Basic Classes of Packages  General infrastructure  Biobase, DynDoc, reposTools, rhdf5, ruuid, tkWidgets, widgetTools  Annotation  annotate, AnnBuilder  data packages  Graphics  geneplotter, hexbin  Pre-processing (affy and 2-channel arrays)  affy, affycomp, affydata, makecdfenv, limma, marrayClasses, marrayInpout, marrayNorm, marrayPlots, marrayTools, vsn  Differential gene expression  edd, genefilter, limma, multtest, ROC, siggenes  Graphs and Networks  graph, RBGL, Rgraphviz  Flow Cytometry  prada, flowCore, flowViz, flowUtils  Protein Interactions  ppiData, ppiStats, ScISI, Rintact  An so on…

Help Files for Bioconductor Packages  Like R, there are help files available for Bioconductor packages.  They can be accessed in several ways. > help(Biobase) > library(help=”Biobase”) > browseVignettes(package=”Biobase”) OR Use the Vignettes pull down menu in R  Note vignettes often contains more information than a traditional R help page.

Package Nuances  Similar to R packages and are loaded into and used in R  However, Bioconductor makes more use of the S4 class system from R  R packages typically use the S3 class system. The difference...  S4 more formal and rigorous (makes it somewhat more complicated than R)  If you really want to know more about the S4 class system you can check out

Example Use: Microarray Experiments  Microarrays are collections of microscopic DNA spots attached to solid surface  Spots contain probes, i.e. short segments of DNA gene sections  Probes hybridize with cDNA or cRNA in sample (targets)  Fluorescent probes used to quantify relative abundance of targets  Can be used to measure expression level, change in expression, SNPs,...

Gene Detection  1-Channel array: hybridized cDNA from single sample to array and measure intensity  label sample with a single fluorophore  compare relative intensity to a reference sample done on a separate chip  2-Channel arrays: hybridized cDNA for two samples (e.g. diseased vs. healthy tissue)  label each with one of two different fluorophores  mix two samples and apply to single microarray  look at fluorescence at 2 wavelengths corresponding to each fluorophore  measure ratio of intensity for each fluorophore

Microarray Analysis  Microarrays are large datasets that often have poor precision  Statistical challenges…  Account for effect of background noise  Data normalization (remove non-biological variability)  Detecting/removing poor quality or low quality feature (flagging)  Multiple comparisons and clustering analysis (e.g. FDR, hierarchical clustering)  Network analysis (e.g. Gene Ontology)

Meta-Data  Meta-data are data about the data  Datasets in Bioconductor often have meta-data so you know something about the dataset  sample.ExpressionSet is an example of microarray meta-data provided in Biobase  It is of class ExpressionSet (example of an S4 class). This class includes data describing the lab, the experiment, and an abstract that are all accessible in R. > data(sample.ExpressionSet) > sample.ExpressionSet

Exploring sample.ExpressionSet  What information exists in the meta-data sample.ExpressionSet  Number of sample  Number of “features”  Protocol for data collection  Sample names  Annotation type

Difference from S3 class object  So how different is this from a S3 class object? Linear models fit using lm are S3 class objects for example. > x<-rnorm(100); y<-rnorm(100) > fit<-lm(y~x) > class(fit) > names(fit) > fit$coefficients  What happens if we use some familiar R functions to look at sample.ExpressionSet? > class(sample.ExpressionSet) > names(sample.ExpressionSet)

S4 Commands  There are sometimes slightly different commands and nuances to look at an S4 class object in R  Use “slotNames” rather than “names” >slotNames(sample.ExpressionSet)  Also use rather than “$” to look things within an S4 class object

Accessing and Expression Set  Accessing data and parts of the data using the symbol can be dangerous  R does not provide a mechanism for protecting data (i.e. we can overwrite our data by accident)  A better idea is to subset the parts of the data you want to handle

Exploring sample.ExpressionSet  Although slotNames tells us what attributes sample.ExpressionSet has,  we are interested in accessing the microarray data itself. > abstract(sample.ExpressionSet) [1] "An example object of expression set (ExpressionSet) class" > varMetadata(sample.ExpressionSet) labelDescription sex Female/Male type Case/Control score Testing Score

Exploring sample.ExpressionSet  Accessing the microarray data itself. > #Names of the genes > featureNames(sample.ExpressionSet) [1] "AFFX-MurIL2_at" "AFFX-MurIL10_at" [3] "AFFX-MurIL4_at" "AFFX-MurFAS_at" … > exprs(sample.ExpressionSet)[1:5,1:5] A B C D E AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at AFFX-BioB-5_at

Visualizing the Data  Let’s look at the distribution of gene expression values for all of the arrays. > dim(sample.ExpressionSet) Features Samples > plot(density(exprs(sample.ExpressionSet)[,1]), xlim=c(0,6000), ylim=c(0, 0.006), main="Sample densities")

Visualizing the Data

 What about the distribution of several gene expression values for all of the arrays. >plot(density(exprs(sample.ExpressionSet)[,1]), xlim=c(0,6000), ylim=c(0, 0.006), main="Sample densities") >for (i in 2:25){ lines(density(exprs(sample.ExpressionSet)[,i]), col=i) }

Visualizing the Data

Subsetting the data  We can subset our microarray object just like a matrix.  In gene array datasets, samples are columns and features are rows.  Thus if we want to subset of samples (i.e. things like cases or controls) we want columns.  However if we are interested in particular probes, we subset on rows. > sample.ExpressionSet$sex > subESet<-sample.ExpressionSet[1:10,] > exprs(sample.ExpressionSet)[1:10,] > exprs(subESet)

Subsetting the data  What if we only want to consider females? > f.ids<-which(sample.ExpressionSet$sex==”Female”) > femalesESet<-sample.ExpressionSet[,f.ids]  What if we only want to only AFFX genes? We can use the command grep in this case... > AFFX.ids<-grep(“AFFX”, featureNames(sample.ExpressionSet)) > AFFX.ESet<-sample.ExpressionSet[AFFX.ids,]

Next Steps?  Now we are familiar with the data, we could go the next step and do the analysis...  Pre-processing: assess quality of the data, remove any probes we know to be non-informative  Look for differential expression using a machine learning technique  Annotation  Gene set enrichment ...  Fortunately, Bioconductor provides workflows for many common analyses to help you get started.

 Although R has many statistical packages, packages in Bioconductor are designed for bioinformatics type problems  We have only touched on one small part of what is available  For further help using Bioconductor  The Bioconductor website has workshops from previous years  There is also an annual User’s group meeting  Package vignettes and help files also often contain examples with “real” data so you can work through and example