Download presentation
Presentation is loading. Please wait.
1
Analysis of Affymetrix GeneChip Data
11/28/2018 Analysis of Affymetrix GeneChip Data EPP 245 Statistical Analysis of Laboratory Data
2
Basic Design of Expression Arrays
11/28/2018 Basic Design of Expression Arrays For each gene that is a target for the array, we have a known DNA sequence. mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
3
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Exon Intron TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACG ATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC Probe Sequence cDNA arrays use variable length probes derived from expressed sequence tags Spotted and almost always used with two color methods Can be used in species with an unsequenced genome Long oligoarrays use 60-70mers Agilent two-color arrays Spotted arrays from UC Davis or elsewhere Usually use computationally derived probes but can use probes from sequenced EST’s November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
4
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Affymetrix GeneChips use multiple 25-mers For each gene, one or more sets of 8-20 distinct probes May overlap May cover more than one exon Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibit binding. This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
5
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Probe Design A good probe sequence should match the chosen gene or exon from a gene and should not match any other gene in the genome. Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
6
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content. This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene. Thus only comparisons of intensity within the same probe across arrays makes sense. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
7
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Affymetrix GeneChips For each probe set, there are 8-20 perfect match (PM) probes which may overlap or not and which target the same gene There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly Most of us ignore the MM probes November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
8
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Expression Indices A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene). There have been a large number of suggested methods. Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
9
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Usable Methods Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA The RMA method of Irizarry et al. is available in Bioconductor. The GLA method (Durbin, Rocke, Zhou) is also available in Bioconductor November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
10
Bioconductor Documentation
11/28/2018 Bioconductor Documentation > library(affy) Loading required package: Biobase Loading required package: tools Welcome to Bioconductor Vignettes contain introductory material. To view, type 'openVignette()'. To cite Bioconductor, see 'citation("Biobase")' and for packages 'citation(pkgname)'. Loading required package: affyio Loading required package: preprocessCore November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
11
Bioconductor Documentation
11/28/2018 Bioconductor Documentation > openVignette() Please select a vignette: 1: affy - 1. Primer 2: affy - 2. Built-in Processing Methods 3: affy - 3. Custom Processing Methods 4: affy - 4. Import Methods 5: affy - 5. Automatic downloading of CDF packages 6: Biobase - An introduction to Biobase and ExpressionSets 7: Biobase - Bioconductor Overview 8: Biobase - esApply Introduction 9: Biobase - Notes for eSet developers 10: Biobase - Notes for writing introductory 'how to' documents 11: Biobase - quick views of eSet instances Selection: November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
12
Reading Affy Data into R
11/28/2018 Reading Affy Data into R The CEL files contain the data from an array. We will look at data from an older type of array, the U95A which contains 12,625 probe sets and 409,600 probes. The CDF file contains information relating probe pair sets to locations on the array. These are built into the affy package for standard types. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
13
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Example Data Set Data from Robert Rice’s lab on twelve keratinocyte cell lines, at six different stages. Affymetrix HG U95A GeneChips. For each “gene”, we will run a one-way ANOVA with two observations per cell. For this illustration, we will use RMA. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
14
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Files for the Analysis .CDF file has U95A chip definition (which probe is where on the chip). Built in. .CEL files contain the raw data after pixel level analysis, one number for each spot. Files are called LN0A.CEL, LN0B.CEL…LN5B.CEL and are on the web site. 409,600 probe values in 12,625 probe sets. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
15
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 The ReadAffy function ReadAffy() function reads all of the CEL files in the current working directory into an object of class AffyBatch, which is itself an object of class ExpressionSet ReadAffy(widget=T) does so in a GUI that allows entry of other characteristics of the dataset You can also specify filenames, phenotype or experimental data, and MIAME information November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
16
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 rrdata <- ReadAffy() > class(rrdata) [1] "AffyBatch" attr(,"package") [1] "affy“ > dim(exprs(rrdata)) [1] > colnames(exprs(rrdata)) [1] "LN0A.CEL" "LN0B.CEL" "LN1A.CEL" "LN1B.CEL" "LN2A.CEL" "LN2B.CEL" [7] "LN3A.CEL" "LN3B.CEL" "LN4A.CEL" "LN4B.CEL" "LN5A.CEL" "LN5B.CEL" > length(probeNames(rrdata)) [1] > length(unique(probeNames(rrdata))) [1] 12625 > length((featureNames(rrdata))) > featureNames(rrdata)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at" November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
17
The ExpressionSet class
11/28/2018 The ExpressionSet class An object of class ExpressionSet has several slots the most important of which is an assayData object, containing one or more matrices. The best way to extract parts of this is using appropriate methods. exprs() extracts an expression matrix featureNames() extracts the names of the probe sets. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
18
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Expression Indices The 409,600 rows of the expression matrix in the AffyBatch object Data each correspond to a probe (25-mer) Ordinarily to use this we need to combine the probe level data for each probe set into a single expression number This has conceptually several steps November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
19
Steps in Expression Index Construction
11/28/2018 Steps in Expression Index Construction Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays. We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
20
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Data transformation is the process of changing the scale of the data so that it is more comparable from high to low. Common transformations are the logarithm and generalized logarithm Normalization is the process of adjusting for systematic differences from one array to another. Normalization may be done before or after transformation, and before or after probe set summarization. November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
21
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
22
Probe intensities for LASP1 in a radiation dose-response experiment
11/28/2018 Probe intensities for LASP1 in a radiation dose-response experiment 1 10 100 Mean 200618_at1 360 216 158 198 233.0 200618_at2 313 402 106 103 231.0 200618_at3 130 182 79 91 120.5 200618_at4 351 370 195 136 263.0 200618_at5 164 98 107 124.8 200618_at6 223 219 196 200.5 200618_at7 437 529 329.8 200618_at8 509 554 274 128 366.3 200618_at9 522 720 285 431.3 200618_at10 668 715 247 260 472.5 200618_at11 306 286 144 159 223.8 Expression Index 362.1 393.0 176.8 157.6 November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
23
Log probe intensities for LASP1 in a radiation
11/28/2018 Log probe intensities for LASP1 in a radiation dose-response experiment 1 10 100 Mean 200618_at1 2.56 2.33 2.20 2.30 2.35 200618_at2 2.50 2.60 2.03 2.01 2.28 200618_at3 2.11 2.26 1.90 1.96 2.06 200618_at4 2.55 2.57 2.29 2.13 2.38 200618_at5 2.21 1.99 2.09 200618_at6 2.34 200618_at7 2.64 2.72 2.46 200618_at8 2.71 2.74 2.44 200618_at9 2.86 2.45 2.58 200618_at10 2.82 2.85 2.39 2.41 2.62 200618_at11 2.49 2.16 Expression Index 2.51 2.53 2.18 November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
24
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 The RMA Method Background correction that does not make 0 signal correspond to 0 amount Quantile normalization Log2 transform Median polish summary of PM probes November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
25
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 > eset <- rma(rrdata) trying URL ' Content type 'application/zip' length bytes (1.3 Mb) opened URL downloaded 1.3 Mb package 'hgu95av2cdf' successfully unpacked and MD5 sums checked The downloaded packages are in C:\Documents and Settings\dmrocke\Local Settings… updating HTML package descriptions Background correcting Normalizing Calculating Expression > class(eset) [1] "ExpressionSet" attr(,"package") [1] "Biobase" > dim(exprs(eset)) [1] > featureNames(eset)[1:5] [1] "100_g_at" "1000_at" "1001_at" "1002_f_at" "1003_s_at" November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
26
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 > exprs(eset)[1:5,] LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL LN2A.CEL LN2B.CEL LN3A.CEL 100_g_at 1000_at 1001_at 1002_f_at 1003_s_at LN3B.CEL LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL 100_g_at 1000_at 1001_at 1002_f_at 1003_s_at November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
27
EPP 245 Statistical Analysis of Laboratory Data
> summary(exprs(eset)) LN0A.CEL LN0B.CEL LN1A.CEL LN1B.CEL Min. : Min. : Min. : Min. : 2.636 1st Qu.: st Qu.: st Qu.: st Qu.: 4.477 Median : Median : Median : Median : 6.078 Mean : Mean : Mean : Mean : 6.128 3rd Qu.: rd Qu.: rd Qu.: rd Qu.: 7.467 Max. : Max. : Max. : Max. :11.889 LN2A.CEL LN2B.CEL LN3A.CEL LN3B.CEL Min. : Min. : Min. : Min. : 2.622 1st Qu.: st Qu.: st Qu.: st Qu.: 4.428 Median : Median : Median : Median : 6.028 Mean : Mean : Mean : Mean : 6.117 3rd Qu.: rd Qu.: rd Qu.: rd Qu.: 7.459 Max. : Max. : Max. : Max. :13.138 LN4A.CEL LN4B.CEL LN5A.CEL LN5B.CEL Min. : Min. : Min. : Min. : 2.590 1st Qu.: st Qu.: st Qu.: st Qu.: 4.487 Median : Median : Median : Median : 6.068 Mean : Mean : Mean : Mean : 6.123 3rd Qu.: rd Qu.: rd Qu.: rd Qu.: 7.457 Max. : Max. : Max. : Max. :11.952 11/28/2018 November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
28
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Probe Sets not Genes It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
29
EPP 245 Statistical Analysis of Laboratory Data
11/28/2018 Exercise Download the ten arrays from the web site Load the arrays into R using Read.Affy and construct the RMA expression indices November 15, 2007 EPP 245 Statistical Analysis of Laboratory Data
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.