STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez

Slides:



Advertisements
Similar presentations
“BioMart is a query-oriented data management system developed jointly by the Ontario Institute for Cancer Research (OICR) and the.
Advertisements

An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 4, 2013.
Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.
Modeling sequence dependence of microarray probe signals Li Zhang Department of Biostatistics and Applied Mathematics MD Anderson Cancer Center.
1. Principles and important terminology 2. RNA Preparation and quality controls 3. Data handling 4. Costs 5. Protocols 6. Information for collaboration.
How to Work With Affymetrix .Cel Files in geWorkbench
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Getting the numbers comparable
DNA microarray and array data analysis
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
Introduction to R and Bioconductor BMI 731 Winter 2005
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Evaluation of Signaling Cascades Based on the Weights from Microarray and ChIP-seq Data by Zerrin Işık Volkan Atalay Rengül Çetin-Atalay Middle East Technical.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Microarray Data Analysis - A Brief Overview R Group Rongkun Shen
Introduce to Microarray
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2015 Xiaole Shirley Liu Please Fill Out Student Sign In.
GeneChips and Microarray Expression Data
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Introduction to Microarray Analysis
Microarray Data Analysis The Bioinformatics side of the bench.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Affymetrix GeneChips Oligonucleotide.
An Introduction to Bioconductor Bethany Wolf Statistical Computing I April 9, 2014.
Data Type 1: Microarrays
Microarray Informatics Donald Dunbar MSc Seminar 26 th February 2011.
Panu Somervuo, March 19, cDNA microarrays.
Bioconductor Packages for Pre-processing DNA Microarray Data affy and marray Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
RNAseq analyses -- methods
Agenda Introduction to microarrays
Dr Andrew Harrison Departments of Mathematical Sciences and Biological Sciences University of Essex Looking for signals in tens of thousands.
Taverna and SoapLab Elda Rossi – CINECA (Italy)
Bioconductor in R with a expectation free dataset Transcriptomics - practical 2014.
Bioconductor Course in Practical Microarray Analysis Heidelberg, 8 Oct 2003 Slides ©2002 Sandrine Dudoit, Robert Gentleman. Adapted by Wolfgang Huber.
3/24/2005 TIGP 1 Bioinformatics for Microarray Studies at IBS Pei-Ing Hwang, Ph.D. Mar. 24, 2005.
Introduction to DNA microarray technologies Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor short course Summer 2002.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
Model-based analysis of oligonucleotide arrays, dChip software Statistics and Genomics – Lecture 4 Department of Biostatistics Harvard School of Public.
1 Example Analysis of an Affymetrix Dataset Using AFFY and LIMMA 4/4/2011 Copyright © 2011 Dan Nettleton.
SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Functional Genomics Carol Bult, Ph.D. Course coordinator The Jackson Laboratory Winter/Spring 2011 Keith Hutchison, Ph.D. Course co-coordinator.
Introduction to Oligonucleotide Microarray Technology
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
基于 R/Bioconductor 进行生物芯片数据分析 曹宗富 博奥生物有限公司
STAT115 STAT215 BIO512 BIST298 Introduction to Computational Biology and Bioinformatics Spring 2016 Xiaole Shirley Liu.
Lab 5 Unsupervised and supervised clustering Feb 22 th 2012 Daniel Fernandez Alejandro Quiroz.
Taverna and SoapLab Elda Rossi – CINECA (Italy)
Using ArrayStar with a public dataset
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Getting the numbers comparable
Affymetrix and BioConductor
Data Type 1: Microarrays
Presentation transcript:

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Lab 4 R and Bioconductor II Feb 15, 2012 Alejandro Quiroz and Daniel Fernandez

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology –Provides tools for the analysis of high- throughput genomic data Software, data, documentation Training materials Mailing list –Based on R Open to conduct out own analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology What can bioconductor do? Handle data from diverse platforms, Affymetrix, Illumina, etc.. Perform analysis of expression, exon, copy number, SNP, etc analysis Microarrays Import fast, Bowtie, BAM and other sequence formats Perform quality assessment, ChIP-seq, etc… Sequence data Access to GO, KEGG, NCBI and other sources of annotation Annotation Analyze flow cytometric, mass spec, cell-based an other assays High throughput assays

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Outline Installation Packages Microarray data analysis –Affymetrix files Low level analysis High level analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Installation There exist two types of installation –Core packages >source(“ >biocLite() –Other packages >source(“ >biocLite(c(“pkg1”, “pkg2”,…,“pkgN”))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology BioConductor Packages View the installed packages: –rownames(installed.packages()) General infrastructure: Biobase, DynDoc, reposTools, ruuid, tkWidgets, widgetTools, BioStrings, multtest Annotation: annotate, AnnBuilder  data packages. Graphics: geneplotter, hexbin. Pre-processing Affymetrix oligonucleotide chip data: affy, affycomp, affydata, makecdfenv, vsn, gcrma Pre-processing two-color spotted DNA microarray data: marray, vsn, arrayMagic, arrayQuality Differential gene expression: edd, genefilter, limma, ROC, siggenes, EBArrays, factDesign Graphs and networks: graph, RBGL, Rgraphviz. Other data: SAGElyzer, DNAcopy, PROcess, aCGH

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Microarray data analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix data Each gene (or portion of a gene) is represented by 11 to 20 oligonucleotides of 25 base-pairs. Probe: an oligonucleotide of 25 base-pairs, i.e., a 25- mer.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymatrix data Perfect match (PM): A 25-mer complementary to a reference sequence of interest (e.g., part of a gene). Mismatch (MM): same as PM but with a single homomeric base change for the middle (13 th ) base (transversion purine pyrimidine, G C, A T). –The purpose of the MM probe design is to measure non-specific binding and background noise. Probe-pair: a (PM,MM) pair. Probe-pair set: a collection of probe-pairs (11 to 20) related to a common gene or fraction of a gene. Affy ID: an identifier for a probe-pair set.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affy Microarray data DAT file –Raw (TIFF) optical image of the hybridized chip CEL file –Cell intensity file stores the results of the intensity calculations on the pixel values of the DAT file CDF (Chip Description File) –Provided by Affy, describe information about the probe array design, characteristics, probe utilization and content, and scanning and analysis parameters. These files are unique for each probe array type.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Affymetrix Data Flow Scan Chip Hybridized GeneChip DAT file Process Image CEL file CDF file MAS4 MAS5 RMA Quantile High Level Analysis High Level Analysis

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Microarray analysis Go to and download the data set: –GSE10940 The R script has to be in the same file of the.cel files The data set contains 12.CEL files –library(affy) –data.affy=ReadAffy() What is the name of the CDF file? How many genes are considered on the arrays? What is the annotation version?

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology The data set Secretory and transmembrane proteins traverse the endoplasmic reticulum (ER) and Golgi compartments for final maturation prior to reaching their functional destinations. Members of the p24 protein family function in trafficking some secretory proteins in yeast and higher eukaryotes. Yeast p24 mutants have minor secretory defects and induce an ER stress response that likely results from accumulation of proteins in the ER due to disrupted trafficking. Test the hypothesis that loss of Drosophila melanogaster p24 protein function causes a transcriptional response characteristic of ER stress activation.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology

Looking at RAW data Low-level analysis MA plot  MAplot(data.affy, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter") Image of an array  image(data.affy) Density of the log intensities of the arrays  hist(data.affy) Boxplot of the data  boxplot(data.affy, col=seq(2,7,by=1))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Normalization  data.rma=rma(data.affy) Install the package affyPLM to view the MA plot after normalization (along with dependencies)  MAplot(data.rma, pairs = TRUE, which=c(1,2,3,4), plot.method = "smoothScatter”)  expr.rma=exprs(data.rma) # Puts data in a table  boxplot(data.frame(expr.rma), col=seq(2,7,by=1))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Before moving forward… affy probeset names rownames(expr.rma)[1:100] Suffixes are meaningful, for example: _at : hybridizes to unique antisense transcript for this chip _s_at: all probes cross hybridize to a specified set of sequences _a_at: all probes cross hybridize to a specified gene family _x_at: at least some probes cross hybridize with other target sequences for this chip _r_at: rules dropped and many more…

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files The most popular platform for genome-wide expression profiling is the Affymetrix GeneChip. However, its selection of probes relied on earlier genome and transcriptome annotation which is significantly different from current knowledge. The resultant informatics problems have a profound impact on analysis and interpretation the data.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files One solution Dai, M. et. at (2005) They reorganized probes on more than a dozen popular 30 GeneChips Comparing analysis results between the original and the redefined probe sets –Reveals ~ 30–50% discrepancy in the genes previously identified as differentially expressed, regardless of analysis method.

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Custom CDF files Go to: – Database/CustomCDF/13.0.0/refseq.asphttp://brainarray.mbni.med.umich.edu/Brainarray/ Database/CustomCDF/13.0.0/refseq.asp –Download the Drosophila melanogaster RefSeq CDF annotation corresponding to the Affy array analyzed –Install/loaded it on R R CMD INSTALL…   data.rma.refseq=rma(data.affy)  expr.rma.refseq=exprs(data.rma.refseq)

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology High-level analysis Perform a comparison between the control group and the experimental group –Objective: Obtain the most significant genes with an FDR of 5% and with a fold change of 1 –Information provided in “SamplePhenotype.csv” to obtain controls and mutant ids  sample.ids=read.csv("SamplePhenotype.csv",header= F)  control=grep("Control",sample.ids[,2])  mutants=grep("Logjam",sample.ids[,2]) –Obtain just the RefSeq ids  genes_t=matrix(rownames(expr.rma.refseq))  genes.refseq=apply(genes_t,1,function(x) sub("_at","",x))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Calculating the fold change for every gene –foldchange=apply(expr.rma, 1, function(x) mean( x[mutants] ) - mean( x[control] ) ) Perform a t-test and obtain the p-values –T.p.value=apply(expr.rma, 1, function(x) t.test( x[mutants], x[control], var.equal=T )$p.value ) Calculating the FDR –fdr=p.adjust(T.p.value, method="fdr") THE GENES –genes.up=genes.refseq[ which( fdr 0 ) ] –genes.down=genes.refseq [ which( fdr < 0.05 & foldchange <0 ) ]

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Results Provide a.csv file with the list of significant genes with an FDR of 5% and with a fold change of 1 Provide a heatmap with the significant genes –genes.ids=c(which( fdr 0 ),which( fdr < 0.05 & foldchange <0 )) –colnames(expr.rma.refseq)=c(rep("Control",6),rep("Mutant",6)) –heatmap(expr.rma.refseq[genes.ids,],margins=c(5,10))

STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology Beyond the gene list paradigm