STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez dfernan@gmail.com alejandritoqz@gmail.alejandritoqz@gmail.com

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology www.bioconductor.org –Provides tools for the analysis of high- throughput genomic data Software, data, documentation Training materials Mailing list –Based on R Open to conduct out own analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology What can bioconductor do? Handle data from diverse platforms, Affymetrix, Illumina, etc.. Perform analysis of expression, exon, copy number, SNP, etc analysis Microarrays Import fast, Bowtie, BAM and other sequence formats Perform quality assessment, ChIP-seq, etc… Sequence data Access to GO, KEGG, NCBI and other sources of annotation Annotation Analyze flow cytometric, mass spec, cell-based an other assays High throughput assays

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Outline Installation Packages Microarray data analysis –Affymetrix files Low level analysis High level analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Installation There exist two types of installation –Core packages >source(“http://bioconductor.org/biocLite.R”) >biocLite() –Other packages >source(“http://bioconductor.org/biocLite.R”) >biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology BioConductor Packages View the installed packages: –rownames(installed.packages()) General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings, multtest Annotation: annotate, AnnBuilder  data packages. Graphics: geneplotter, hexbin. Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality Differential gene expression: edd, genefilter, limma, ROC, siggenes, EBArrays, factDesign Graphs and networks: graph, RBGL, Rgraphviz. Other data: SAGElyzer, DNAcopy, PROcess, aCGH

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Microarray data analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix data Each gene (or portion of a gene) is represented by 11 to 20 oligonucleotides of 25 base-pairs. Probe: an oligonucleotide of 25 base-pairs, i.e., a 25- mer.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymatrix data Perfect match (PM): A 25-mer complementary to a reference sequence of interest (e.g., part of a gene). Mismatch (MM): same as PM but with a single homomeric base change for the middle (13 th ) base (transversion purine pyrimidine, G C, A T). –The purpose of the MM probe design is to measure non-specific binding and background noise. Probe-pair: a (PM,MM) pair. Probe-pair set: a collection of probe-pairs (11 to 20) related to a common gene or fraction of a gene. Affy ID: an identifier for a probe-pair set.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affy Microarray data DAT file –Raw (TIFF) optical image of the hybridized chip CEL file –Cell intensity file stores the results of the intensity calculations on the pixel values of the DAT file CDF (Chip Description File) –Provided by Affy, describe information about the probe array design, characteristics, probe utilization and content, and scanning and analysis parameters. These files are unique for each probe array type.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix Data Flow Scan Chip Hybridized GeneChip DAT file Process Image CEL file CDF file MAS4 MAS5 RMA Quantile High Level Analysis High Level Analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Microarray analysis Go to and download the data set: –GSE10940 The R script has to be in the same file of the.cel files The data set contains 12.CEL files –library(affy) –data.affy=ReadAffy() What is the name of the CDF file? How many genes are considered on the arrays? What is the annotation version?

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The data set Secretory and transmembrane proteins traverse the endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations. Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes. Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking. Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Looking at RAW data Low-level analysis MA plot  MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter") Image of an array  image(data.affy) Density of the log intensities of the arrays  hist(data.affy) Boxplot of the data  boxplot(data.affy, col=seq(2,7,by=1))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Normalization  data.rma=rma(data.affy) Install the package affyPLM to view the MA plot after normalization (along with dependencies)  MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter”)  expr.rma=exprs(data.rma) # Puts data in a table  boxplot(data.frame(expr.rma), col=seq(2,7,by=1))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Before moving forward… affy probeset names rownames(expr.rma)[1:100] Suffixes are meaningful, for example: _at : hybridizes to unique antisense transcript for this chip _s_at: all probes cross hybridize to a specified set of sequences _a_at: all probes cross hybridize to a specified gene family _x_at: at least some probes cross hybridize with other target sequences for this chip _r_at: rules dropped and many more…

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files One solution Dai, M. et. at (2005) They reorganized probes on more than a dozen popular 30 GeneChips Comparing analysis results between the original and the redefined probe sets –Reveals ~ 30–50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files Go to: –http://brainarray.mbni.med.umich.edu/Brainarray/ Database/CustomCDF/13.0.0/refseq.asphttp://brainarray.mbni.med.umich.edu/Brainarray/ Database/CustomCDF/13.0.0/refseq.asp –Download the Drosophila melanogaster RefSeq CDF annotation corresponding to the Affy array analyzed –Install/loaded it on R R CMD INSTALL…  data.affy@cdfName="drosophila2dmrefseqcdf” data.affy@cdfName="drosophila2dmrefseqcdf  data.rma.refseq=rma(data.affy)  expr.rma.refseq=exprs(data.rma.refseq)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology High-level analysis Perform a comparison between the control group and the experimental group –Objective: Obtain the most significant genes with an FDR of 5% and with a fold change of 1 –Information provided in “SamplePhenotype.csv” to obtain controls and mutant ids  sample.ids=read.csv("SamplePhenotype.csv",header= F)  control=grep("Control",sample.ids[,2])  mutants=grep("Logjam",sample.ids[,2]) –Obtain just the RefSeq ids  genes_t=matrix(rownames(expr.rma.refseq))  genes.refseq=apply(genes_t,1,function(x) sub("_at","",x))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Calculating the fold change for every gene –foldchange=apply(expr.rma, 1, function(x) mean( x[mutants] ) - mean( x[control] ) ) Perform a t-test and obtain the p-values –T.p.value=apply(expr.rma, 1, function(x) t.test( x[mutants], x[control], var.equal=T )$p.value ) Calculating the FDR –fdr=p.adjust(T.p.value, method="fdr") THE GENES –genes.up=genes.refseq[ which( fdr 0 ) ] –genes.down=genes.refseq [ which( fdr < 0.05 & foldchange <0 ) ]

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Results Provide a.csv file with the list of significant genes with an FDR of 5% and with a fold change of 1 Provide a heatmap with the significant genes –genes.ids=c(which( fdr 0 ),which( fdr < 0.05 & foldchange <0 )) –colnames(expr.rma.refseq)=c(rep("Control",6),rep("Mutant",6)) –heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Beyond the gene list paradigm http://david.abcc.ncifcrf.gov

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez

Similar presentations

Presentation on theme: "STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez

Similar presentations

Presentation on theme: "STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez"— Presentation transcript:

Similar presentations

About project

Feedback