SPH 247 Statistical Analysis of Laboratory Data 1April 16, 2013SPH 247 Statistical Analysis of Laboratory Data
Basic Design of Expression Arrays For each gene that is a target for the array, we have a known DNA sequence. mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample April 16, 2013SPH 247 Statistical Analysis of Laboratory Data2
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data3 TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC Exon Intron Probe Sequence cDNA arrays use variable length probes derived from expressed sequence tags –Spotted and almost always used with two color methods –Can be used in species with an unsequenced genome Long oligoarrays use 60-70mers –Agilent two-color arrays –Illumina Bead Arrays –Usually use computationally derived probes but can use probes from sequenced EST’s
Affymetrix GeneChips use multiple 25-mers For each gene, one or more sets of 8-20 distinct probes May overlap May cover more than one exon Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibit binding. This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them April 16, 2013SPH 247 Statistical Analysis of Laboratory Data4
Illumina Bead Arrays Beads are coated with many copies of a 50-mer gene specific probe and a 29-mer address sequence Multiple beads per probe, random, but around 20 Each chip of the Ref-8 contains 8 arrays with ~ 25,000 targets, plus controls Each chip of the WG-6 contains 6 arrays with ~ 50,000 targets, plus controls Each chip of the HT-12 chip contains 12 arrays with ~ 50,000 targets and controls April 16, 2013SPH 247 Statistical Analysis of Laboratory Data5
Probe Design A good probe sequence should match the chosen gene or exon from a gene and should not match any other gene in the genome. Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data6
The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content. This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene. Thus only comparisons of intensity within the same probe across arrays makes sense. A higher signal for one gene than another on the same array does not mean that the copy number is higher April 16, 2013SPH 247 Statistical Analysis of Laboratory Data7
Affymetrix GeneChips For each probe set, there are 8-20 perfect match (PM) probes which may overlap or not and which target the same gene There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly Most of us ignore the MM probes April 16, 2013SPH 247 Statistical Analysis of Laboratory Data8
Expression Indices A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene). There have been a large number of suggested methods. Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences Summary of Illumina beads is simpler, but there are still issues. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data9
Usable Methods Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA The RMA method of Irizarry et al. is available in Bioconductor. The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor/CRAN as part of the LMGene R package April 16, 2013SPH 247 Statistical Analysis of Laboratory Data10
Bioconductor Documentation > library(affy) Loading required package: Biobase Loading required package: tools Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. Loading required package: affyio Loading required package: preprocessCore April 16, 2013SPH 247 Statistical Analysis of Laboratory Data11
Bioconductor Documentation > openVignette() Please select a vignette: 1: affy - 1. Primer 2: affy - 2. Built-in Processing Methods 3: affy - 3. Custom Processing Methods 4: affy - 4. Import Methods 5: affy - 5. Automatic downloading of CDF packages 6: Biobase - An introduction to Biobase and ExpressionSets 7: Biobase - Bioconductor Overview 8: Biobase - esApply Introduction 9: Biobase - Notes for eSet developers 10: Biobase - Notes for writing introductory 'how to' documents 11: Biobase - quick views of eSet instances Selection: April 16, 2013SPH 247 Statistical Analysis of Laboratory Data12
Reading Affy Data into R The CEL files contain the data from an array. We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes. The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data13
Example Data Set Data from Robert Rice’s lab on twelve keratinocyte cell lines, at six different stages. Affymetrix HG U95A GeneChips. For each “gene”, we will run a one-way ANOVA with two observations per cell. For this illustration, we will use RMA. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data14
Files for the Analysis.CDF file has U95A chip definition (which probe is where on the chip). Built in to the affy package..CEL files contain the raw data after pixel level analysis, one number for each spot. Files are called LN0A.CEL, LN0B.CEL…LN5B.CEL and are on the web site. 409,600 probe values in 12,625 probe sets. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data15
The ReadAffy function ReadAffy() function reads all of the CEL files in the current working directory into an object of class AffyBatch, which is itself an object of class ExpressionSet ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset You can also specify filenames, phenotype or experimental data, and MIAME information April 16, 2013SPH 247 Statistical Analysis of Laboratory Data16
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data17 rrdata <- ReadAffy() > class(rrdata) [1] "AffyBatch" attr(,"package") [1] "affy“ > dim(exprs(rrdata)) [1] > colnames(exprs(rrdata)) [1] "LN0A.CEL" "LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL" "LN2B.CEL" [7] "LN3A.CEL" "LN3B.CEL" "LN4A.CEL" "LN4B.CEL" "LN5A.CEL" "LN5B.CEL" > length(probeNames(rrdata)) [1] > length(unique(probeNames(rrdata))) [1] > length((featureNames(rrdata))) [1] > featureNames(rrdata)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at"
The ExpressionSet class An object of class ExpressionSet has several slots the most important of which is an assayData object, containing one or more matrices. The best way to extract parts of this is using appropriate methods. exprs() extracts an expression matrix featureNames() extracts the names of the probe sets. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data18
Expression Indices The 409,600 rows of the expression matrix in the AffyBatch object Data each correspond to a probe (25- mer) Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number This has conceptually several steps April 16, 2013SPH 247 Statistical Analysis of Laboratory Data19
Steps in Expression Index Construction Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays. We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data20
Data transformation is the process of changing the scale of the data so that it is more comparable from high to low. Common transformations are the logarithm and generalized logarithm Normalization is the process of adjusting for systematic differences from one array to another. Normalization may be done before or after transformation, and before or after probe set summarization. April 16, 2013SPH 247 Statistical Analysis of Laboratory Data21
One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers April 16, 2013SPH 247 Statistical Analysis of Laboratory Data22
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data Mean _at _at _at _at _at _at _at _at _at _at _at Expression Index Probe intensities for LASP1 in a radiation dose-response experiment
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data24 Log probe intensities for LASP1 in a radiation dose-response experiment Mean _at _at _at _at _at _at _at _at _at _at _at Expression Index
The RMA Method Background correction that does not make 0 signal correspond to 0 amount Quantile normalization Log 2 transform Median polish summary of PM probes April 16, 2013SPH 247 Statistical Analysis of Laboratory Data25
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data26 > eset <- rma(rrdata) trying URL ' Content type 'application/zip' length bytes (1.3 Mb) opened URL downloaded 1.3 Mb package 'hgu95av2cdf' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\dmrocke\Local Settings… updating HTML package descriptions Background correcting Normalizing Calculating Expression > class(eset) [1] "ExpressionSet" attr(,"package") [1] "Biobase" > dim(exprs(eset)) [1] > featureNames(eset)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at"
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data27 > exprs(eset)[1:5,] LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL 100_g_at _at _at _f_at _s_at LN3B.CEL LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL 100_g_at _at _at _f_at _s_at
April 16, 2013SPH 247 Statistical Analysis of Laboratory Data28 > summary(exprs(eset)) LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. : LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL Min. : Min. : Min. : Min. : st Qu.: st Qu.: st Qu.: st Qu.: Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.: Max. : Max. : Max. : Max. :11.952
Probe Sets not Genes It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied April 16, 2013SPH 247 Statistical Analysis of Laboratory Data29