SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data.

Slides:



Advertisements
Similar presentations
NASC Normalisation and Analysis of the Affymetrix Data David J Craigon.
Advertisements

Application of available statistical tools Development of specific, more appropriate statistical tools for use with microarrays Functional annotation of.
Gene Expression Index Stat Outline Gene expression index –MAS4, average –MAS5, Tukey Biweight –dChip, model based, multi-array –RMA, model.
Microarray Normalization
Zhongxue Chen, Monnie McGee, Qingzhong Liu and Richard Scheuermann
Introduction to Affymetrix Microarrays
Microarray technology and analysis of gene expression data Hillevi Lindroos.
Introduction to DNA Microarray Technology Steen Knudsen, April 2005.
Getting the numbers comparable
DNA microarray and array data analysis
Probe Level Analysis of AffymetrixTM Data
1 Analysis of Affymetrix GeneChip Data EPP 245/298 Statistical Analysis of Laboratory Data.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
1 Preprocessing for Affymetrix GeneChip Data 1/18/2011 Copyright © 2011 Dan Nettleton.
GCB/CIS 535 Microarray Topics John Tobias November 3 rd, 2004.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Data analytical issues with high-density oligonucleotide arrays A model for gene expression analysis and data quality assessment.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
Microarray Data Analysis Data quality assessment and normalization for affymetrix chips.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Introduce to Microarray
Affymetrix GeneChip Data Analysis Chip concepts and array design Improving intensity estimation from probe pairs level Clustering Motif discovering and.
Analysis of microarray data
Summaries of Affymetrix GeneChip probe level data By Rafael A. Irizarry PH 296 Project, Fall 2003 Group: Kelly Moore, Amanda Shieh, Xin Zhao.
Microarray Preprocessing
with an emphasis on DNA microarrays
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Affymetrix vs. glass slide based arrays
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Lecture 22 Introduction to Microarray
Data Type 1: Microarrays
Bioconductor Packages for Pre-processing DNA Microarray Data affy and marray Sandrine Dudoit, Robert Gentleman, Rafael Irizarry, and Yee Hwa Yang Bioconductor.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
Agenda Introduction to microarrays
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Lo w -Level Analysis of Affymetrix Data Mark Reimers National Cancer Institute Bethesda Maryland.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Intro to Microarray Analysis Courtesy of Professor Dan Nettleton Iowa State University (with some edits)
SPH 247 Statistical Analysis of Laboratory Data April 23, 2013.
Lecture Topic 5 Pre-processing AFFY data. Probe Level Analysis The Purpose –Calculate an expression value for each probe set (gene) from the PM.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
GeneChip® Probe Arrays
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Northern blotting & mRNA detection by qPCR - part 2.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Disease Diagnosis by DNAC MEC seminar 25 May 04. DNA chip Blood Biopsy Sample rRNA/mRNA/ tRNA RNA RNA with cDNA Hybridization Mixture of cell-lines Reference.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
Oigonucleotide (Affyx) Array Basics Joseph Nevins Holly Dressman Mike West Duke University.
Gene expression  Introduction to gene expression arrays Microarray Data pre-processing  Introduction to RNA-seq Deep sequencing applications RNA-seq.
From: Duggan et.al. Nature Genetics 21:10-14, 1999 Microarray-Based Assays (The Basics) Each feature or “spot” represents a specific expressed gene (mRNA).
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data Rafael A. Irizarry Department of Biostatistics, JHU (joint.
Introduction to Oligonucleotide Microarray Technology
Microarray: An Introduction
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Detecting DNA with DNA probes arrays. DNA sequences can be detected by DNA probes and arrays (= collection of microscopic DNA spots attached to a solid.
Microarray - Leukemia vs. normal GeneChip System.
CDNA-Project cDNA project Julia Brettschneider (UCB Statistics)
Gene Expression Arrays
The Basics of Microarray Image Processing
Microarrays 1/31/2018.
Analysis of Affymetrix GeneChip Data
Getting the numbers comparable
Data Type 1: Microarrays
Pre-processing AFFY data
Presentation transcript:

SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data

Basic Design of Expression Arrays For each gene that is a target for the array, we have a known DNA sequence. mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample April 16, 2013SPH 247 Statistical Analysis of Laboratory Data2

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data3 TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC Exon Intron Probe Sequence cDNA arrays use variable length probes derived from expressed sequence tags –Spotted and almost always used with two color methods –Can be used in species with an unsequenced genome Long oligoarrays use 60-70mers –Agilent two-color arrays –Illumina Bead Arrays –Usually use computationally derived probes but can use probes from sequenced EST’s

Affymetrix GeneChips use multiple 25-mers For each gene, one or more sets of 8-20 distinct probes May overlap May cover more than one exon Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibit binding. This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them April 16, 2013SPH 247 Statistical Analysis of Laboratory Data4

Illumina Bead Arrays Beads are coated with many copies of a 50-mer gene specific probe and a 29-mer address sequence Multiple beads per probe, random, but around 20 Each chip of the Ref-8 contains 8 arrays with ~ 25,000 targets, plus controls Each chip of the WG-6 contains 6 arrays with ~ 50,000 targets, plus controls Each chip of the HT-12 chip contains 12 arrays with ~ 50,000 targets and controls April 16, 2013SPH 247 Statistical Analysis of Laboratory Data5

Probe Design A good probe sequence should match the chosen gene or exon from a gene and should not match any other gene in the genome. Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data6

The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content. This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene. Thus only comparisons of intensity within the same probe across arrays makes sense. A higher signal for one gene than another on the same array does not mean that the copy number is higher April 16, 2013SPH 247 Statistical Analysis of Laboratory Data7

Affymetrix GeneChips For each probe set, there are 8-20 perfect match (PM) probes which may overlap or not and which target the same gene There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly Most of us ignore the MM probes April 16, 2013SPH 247 Statistical Analysis of Laboratory Data8

Expression Indices A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene). There have been a large number of suggested methods. Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences Summary of Illumina beads is simpler, but there are still issues. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data9

Usable Methods Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA The RMA method of Irizarry et al. is available in Bioconductor. The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor/CRAN as part of the LMGene R package April 16, 2013SPH 247 Statistical Analysis of Laboratory Data10

Bioconductor Documentation > library(affy) Loading required package: Biobase Loading required package: tools Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. Loading required package: affyio Loading required package: preprocessCore April 16, 2013SPH 247 Statistical Analysis of Laboratory Data11

Bioconductor Documentation > openVignette() Please select a vignette: 1: affy - 1. Primer 2: affy - 2. Built-in Processing Methods 3: affy - 3. Custom Processing Methods 4: affy - 4. Import Methods 5: affy - 5. Automatic downloading of CDF packages 6: Biobase - An introduction to Biobase and ExpressionSets 7: Biobase - Bioconductor Overview 8: Biobase - esApply Introduction 9: Biobase - Notes for eSet developers 10: Biobase - Notes for writing introductory 'how to' documents 11: Biobase - quick views of eSet instances Selection: April 16, 2013SPH 247 Statistical Analysis of Laboratory Data12

Reading Affy Data into R The CEL files contain the data from an array. We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes. The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data13

Example Data Set Data from Robert Rice’s lab on twelve keratinocyte cell lines, at six different stages. Affymetrix HG U95A GeneChips. For each “gene”, we will run a one-way ANOVA with two observations per cell. For this illustration, we will use RMA. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data14

Files for the Analysis.CDF file has U95A chip definition (which probe is where on the chip). Built in to the affy package..CEL files contain the raw data after pixel level analysis, one number for each spot. Files are called LN0A.CEL, LN0B.CEL…LN5B.CEL and are on the web site. 409,600 probe values in 12,625 probe sets. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data15

The ReadAffy function ReadAffy() function reads all of the CEL files in the current working directory into an object of class AffyBatch, which is itself an object of class ExpressionSet ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset You can also specify filenames, phenotype or experimental data, and MIAME information April 16, 2013SPH 247 Statistical Analysis of Laboratory Data16

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data17 rrdata <- ReadAffy() > class(rrdata) [1] "AffyBatch" attr(,"package") [1] "affy“ > dim(exprs(rrdata)) [1] > colnames(exprs(rrdata)) [1] "LN0A.CEL" "LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL" "LN2B.CEL" [7] "LN3A.CEL" "LN3B.CEL" "LN4A.CEL" "LN4B.CEL" "LN5A.CEL" "LN5B.CEL" > length(probeNames(rrdata)) [1] > length(unique(probeNames(rrdata))) [1] > length((featureNames(rrdata))) [1] > featureNames(rrdata)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at"

The ExpressionSet class An object of class ExpressionSet has several slots the most important of which is an assayData object, containing one or more matrices. The best way to extract parts of this is using appropriate methods. exprs() extracts an expression matrix featureNames() extracts the names of the probe sets. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data18

Expression Indices The 409,600 rows of the expression matrix in the AffyBatch object Data each correspond to a probe (25- mer) Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number This has conceptually several steps April 16, 2013SPH 247 Statistical Analysis of Laboratory Data19

Steps in Expression Index Construction Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays. We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data20

Data transformation is the process of changing the scale of the data so that it is more comparable from high to low. Common transformations are the logarithm and generalized logarithm Normalization is the process of adjusting for systematic differences from one array to another. Normalization may be done before or after transformation, and before or after probe set summarization. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data21

One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers April 16, 2013SPH 247 Statistical Analysis of Laboratory Data22

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data Mean _at _at _at _at _at _at _at _at _at _at _at Expression Index Probe intensities for LASP1 in a radiation dose-response experiment

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data24 Log probe intensities for LASP1 in a radiation dose-response experiment Mean _at _at _at _at _at _at _at _at _at _at _at Expression Index

The RMA Method Background correction that does not make 0 signal correspond to 0 amount Quantile normalization Log 2 transform Median polish summary of PM probes April 16, 2013SPH 247 Statistical Analysis of Laboratory Data25

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data26 > eset <- rma(rrdata) trying URL ' Content type 'application/zip' length bytes (1.3 Mb) opened URL downloaded 1.3 Mb package 'hgu95av2cdf' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\dmrocke\Local Settings… updating HTML package descriptions Background correcting Normalizing Calculating Expression > class(eset) [1] "ExpressionSet" attr(,"package") [1] "Biobase" > dim(exprs(eset)) [1] > featureNames(eset)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at"

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data27 > exprs(eset)[1:5,] LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL 100_g_at _at _at _f_at _s_at LN3B.CEL LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL 100_g_at _at _at _f_at _s_at

April 16, 2013SPH 247 Statistical Analysis of Laboratory Data28 > summary(exprs(eset)) LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. :11.952

Probe Sets not Genes It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied April 16, 2013SPH 247 Statistical Analysis of Laboratory Data29