Download presentation
Published byReynard Skinner Modified over 9 years ago
1
Introduction to microarray technology and analysis
Carol Bult Associate Professor The Jackson Laboratory 1
2
Measuring Gene Expression
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently harder.
3
Central Assumption of Gene Expression Microarrays
The level of a given mRNA is positively correlated with the expression of the associated protein. Higher mRNA levels mean higher protein expression, lower mRNA means lower protein expression Other factors: Protein degradation, mRNA degradation, polyadenylation, codon preference, translation rates, alternative splicing, translation lag…
4
Principal Uses of Microarrays
Genome-scale gene expression analysis Differential gene expression between two (or more) sample types Responses to environmental factors Disease processes (e.g. cancer) Effects of drugs Identification of genes associated with clinical outcomes (e.g. survival)
5
Microarray example: Biomarker identification - lung cancer
Samples Genes Gene expression patterns segregate the four major morphological lung tumor subtypes Patterns of gene expression promise to refine traditional morphologic classification of lung cancer. Can we distinguish subsets of genes that would allow molecular-level classification of tumor subtypes? Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24): 5
6
Data partitioning clinically important: Patient survival for lung cancer subgroups
60 Cum. Survival Time (months) .2 .4 .6 .8 1 10 20 30 40 50 Cum. Survival (Group 3) Cum. Survival (Group 2) Cum. Survival (Group 1) p = 0.002 for Gr. 1 vs. Gr. 3 Can we identify individual genes that can predict patient survival for adenocarcinoma lung cancer? Garber, Troyanskaya et al. Diversity of gene expression in adenocarcinoma of the lung. PNAS 2001, 98(24): 6
7
Differentially expressed genes Sample class prediction etc.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation
8
Technology basics Microarrays are composed of short, specific DNA sequences attached to a glass or silicon slide at high density A microarray works by exploiting the ability of an mRNA molecule to bind specifically to, or hybridize, the DNA template from which it originated RNA or DNA from the sample of interest is fluorescently-labeled so that relative or absolute abundances can be quantitatively measured (Chris Workman)
9
Two color vs single color
Bakel and Holstege
10
Other applications of microarray technology (besides measuring gene expression)
DNA copy number analysis SNP analysis chIP-chip (interaction data) Competitive growth assays …
11
Major technologies cDNA probes (> 200 nt), usually produced by PCR, attached to either nylon or glass supports Oligonucleotides (25-80 nt) attached to glass support Oligonucleotides (25-30 nt) synthesized in situ on silica wafers (Affymetrix) Probes attached to tagged beads
12
cDNA Microarray Design
Probe selection Non-redundant set of probes Includes genes of interest to project Corresponds to physically available clones Chip layout Grouping of probes by function Correspondence between wells in microtiter plates and spots on the chip
13
Building the chip Ngai Lab arrayer , UC Berkeley Print-tip head
15
Example dual channel cDNA array results
(Chris Workman)
16
Affymetrix GeneChips Probes are oligos synthesized in situ using a photolithographic approach There are at least 5 oligos per cDNA, plus an equal number of negative controls The apparatus requires a fluidics station for hybridization and a special scanner Only a single fluorochrome is used per hybridization
18
Affy There may be 5,000-100,000 probe sets per chip
A probe set = PM, MM pairs
20
Interpreting Affymetrix Output Perfect Match/Mismatch Strategy
Each probe designed to be perfectly complementary to a target sequence, a partner probe is generated that is identical except for a single base mismatch in its center. These probe pairs, called the Perfect Match probe (PM) and the Mismatch probe (MM), allow the quantitation and subtraction of signals caused by non-specific cross-hybridization. The difference in hybridization signals between the partners serve as indicators of specific target abundance Moustafa Ghanem 20
21
Differentially expressed genes Sample class prediction etc.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation
22
Experimental Design Bakel and Holstege
23
- Donald Rumsfeld, former Secretary of Defense
Microarray Analysis: Controlling for the Known Knowns and Unknown Unknowns - Donald Rumsfeld, former Secretary of Defense
29
Selected references Best advice? Consult a statistician before you start!
30
Statistical Power The probability that a test will reject a null hypothesis if it is false Type I and Type II errors Type 1 – fail to accept the null hypothesis We say there is a difference in gene expression between gene A and gene B when there really isn’t Type 2- fail to reject the null hypothesis We say there is no difference in gene expression between gene A and gene B when there actually is!
31
Power in Perspective Sample size Effect size Alpha level Power
Number of units Effect size Signal to noise Alpha level Significance level Power Likelihood of detecting a treatment effect if it is there What are the 4 main components that determine what conclusions are drawn from a study?
32
Check out this pithy description of Statistical Power and Hypothesis Testing
33
MicroArray Image Analysis
Based on slides from Robin Liechti
34
Microarray analysis Array construction, hybridisation, scanning
Quantitation of fluorescence signals Data visualisation Meta-analysis (clustering) More visualisation
35
Technical pseudo-colour image sample (labelled) probe (on chip)
[image from Jeremy Buhler]
36
Experimental design Track what’s on the chip
which spot corresponds to which gene Duplicate experimental spots reproducibility Controls DNAs spotted on glass positive probe (induced or repressed) negative probe (bacterial genes on human chip) oligos on glass or synthesised on chip (Affymetrix) point mutants (hybridisation plus/minus)
37
Images from scanner Resolution
standard 10m [currently, max 5m] 100m spot on chip = 10 pixels in diameter Image format TIFF (tagged image file format) 16 bit (65’536 levels of grey) 1cm x 1cm image at 16 bit = 2Mb (uncompressed) other formats exist e.g.. SCN (used at Stanford University) Separate image for each fluorescent sample channel 1, channel 2, etc.
38
Images in analysis software
The two 16-bit images (cy3, cy5) are compressed into 8-bit images Goal : display fluorescence intensities for both wavelengths using a 24-bit RGB overlay image RGB image : Blue values (B) are set to 0 Red values (R) are used for cy5 intensities Green values (G) are used for cy3 intensities Qualitative representation of results
39
Images : examples Pseudo-color overlay cy3 cy5 Spot color
Signal strength Gene expression yellow Control = perturbed unchanged red Control < perturbed induced green Control > perturbed repressed
40
Processing of images Addressing or gridding Segmentation
Assigning coordinates to each of the spots Segmentation Classification of pixels either as foreground or as background Intensity extraction (for each spot) Foreground fluorescence intensity pairs (R, G) Background intensities Quality measures
42
File or archive your e-mail on your own computer
43
Addressing (I) ScanAlyze Parameters to address the spots positions Separation between rows and columns of grids Individual translation of grids Separation between rows and columns of spots within each grid Small individual translation of spots Overall position of the array in the image The basic structure of the images is known (determined by the arrayer)
44
Addressing (II) The measurement process depends on the addressing procedure Addressing efficiency can be enhanced by allowing user intervention (slow!) Most software systems now provide for both manual and automatic gridding procedures
45
Segmentation (I) Classification of pixels as foreground or background -> fluorescence intensities are calculated for each spot as measure of transcript abundance Production of a spot mask : set of foreground pixels for each spot
46
Segmentation (II) Segmentation methods : Fixed circle segmentation
Adaptive circle segmentation Adaptive shape segmentation Histogram segmentation Fixed circle ScanAlyze, GenePix, QuantArray Adaptive circle GenePix, Dapple Adaptive shape Spot, region growing and watershed Histogram method ImaGene, QuantArraym DeArray and adaptive thresholding
47
Fixed circle segmentation
Fits a circle with a constant diameter to all spots in the image Easy to implement The spots need to be of the same shape and size Bad example !
48
Adaptive circle segmentation
Dapple finds spots by detecting edges of spots (second derivative) The circle diameter is estimated separately for each spot Problematic if spot exhibits oval shapes
49
Adaptive shape segmentation
Specification of starting points or seeds Regions grow outwards from the seed points preferentially according to the difference between a pixel’s value and the running mean of values in an adjoining region.
50
Histogram segmentation
Uses a target mask chosen to be larger than any other spot Foreground and background intensity are determined from the histogram of pixel values for pixels within the masked area Example : QuantArray Background : mean between 5th and 20th percentile Foreground : mean between 80th and 95th percentile Unstable when a large target mask is set to compensate for variation in spot size Bkgd Foreground
51
Information extraction
52
Spot intensity The total amount of hybridization for a spot is proportional to the total fluorescence at the spot Spot intensity = sum of pixel intensities within the spot mask Since later calculations are based on ratios between cy5 and cy3, we compute the average* pixel value over the spot mask *alternative : use ratios of medians instead of means
53
Background intensity Motivation : spot’s measured intensity includes a contribution of non-specific hybridization and other chemicals on the glass Fluorescence from regions not occupied by DNA should by different from regions occupied by DNA -> could be interesting to use local negative controls (spotted DNA that should not hybridize) Different background methods : Local background, morphological opening, constant background, no adjustment
54
Local background Focusing on small regions surrounding the spot mask.
Median of pixel values in this region Most software package implement such an approach ScanAlyze ImaGene Spot, GenePix By not considering the pixels immediately surrounding the spots, the background estimate is less sensitive to the performance of the segmentation procedure
55
Morphological opening (spot)
Applied to the original images R and G Use a square structuring element with side length at least twice as large as the spot separation distance Remove all the spots and generate an image that is an estimate of the background for the entire slide For individual spots, the background is estimated by sampling this background image at the nominal center of the spot Lower background estimate and less variable
56
Constant background Global method which subtracts a constant background for all spots Some findings suggests that the binding of fluorescent dyes to ‘negative control spots’ is lower than the binding to the glass slide -> More meaningful to estimate background based on a set of negative control spots If no negative control spots : approximation of the average background = third percentile of all the spot foreground values
57
No adjustment Do not consider the background
58
Quality measures (-> Flag)
How good are foreground and background measurements ? Variability measures in pixel values within each spot mask Spot size Circularity measure Relative signal to background intensity b-value : fraction of background intensities less than the median foreground intensity p-score : extend to which the position of a spot deviates from a rigid rectangular grid Based on these measurements, one can flag a spot
59
Summary Spot, GenePix ScanAlyze M = log2 R/G A = log2 √(R•G) The choice of background correction method has a larger impact on the log-intensity ratios than the segmentation method used The morphological opening method provides a better estimate of background than other methods Low within- and between-slide variability of the log2 R/G Background adjustment has a larger impact on low intensity spots
60
Selected references Yang, Y. H., Buckley, M. J., Dudoit, S. and Speed, T. P. (2001), ‘Comparisons of methods for image analysis on cDNA microarray data’. Technical report #584, Department of Statistics, University of California, Berkeley. Yang, Y. H., Buckley, M. J. and Speed, T. P. (2001), ‘Analysis of cDNA microarray images’. Briefings in bioinformatics, 2 (4), Excellent review in concise format!
61
Download the limma package and work through the Swirl zebrafish example.
62
Differentially expressed genes Sample class prediction etc.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation
63
63
64
Normalization - two problems
How do we detect biases? Which genes should we use for estimating biases among chips/channels? How do we remove the biases?
65
Why normalize? Microarray data have significant systematic variation both within arrays and between arrays that is not true biological variation Accurate comparison of genes’ relative expression within and across conditions requires normalization of effects Sources of variation: Spatial location on the array Dye biases which vary with spot intensity Plate origin Printing/spotting quality Experimenter
66
Why is normalization important?
Experiment: Comparison of gene expression response in mouse heart and kidney in response to drug Most biological effects are swamped by systematic effects! Source:
67
Other Sources of Systematic Bias
Individual Factors Print (20% - 30%) Experimenter (20% - 30%) Organism (3% - 10%) Date (5%) Software (2%) Number of tips (3%) Interactions Print - Experimenter (40%) Print - Date (40%) Experimenter - Date (40%) (based on ~4,600 experiments in Stanford Microarray Database analyzed by ANOVA) (slide from Catherine Ball)
68
Clearly visible plate effects
KO #8 Probes: ~6,000 cDNAs, including 200 related to lipid metabolism. Arranged in a 4x4 array of 19x21 sub-arrays.
69
Spatial Biases Solution: spatial background estimation/subtraction
(Gavin Sherlock) Solution: spatial background estimation/subtraction
70
Spatial plots: background from two slides
71
Highlighting extreme log ratios
Top (black) and bottom (green) 5% of log ratios
72
Pin group (sub-array) effects
Lowess lines through points from pin groups Boxplots of log ratios by pin group
73
Boxplots and highlighting pin group effects
Log-ratios Print-tip groups Clear example of spatial bias
74
Time of printing effects
spot number Green channel intensities (log2G). Printing over 4.5 days. The previous slide depicts a slide from this print run.
75
Normalization in a nutshell
Goal is to measure the ratios of gene expression levels, (ratio)i = Ri/Gi Where Ri/Gi are, respectively, the measured intensities for the ith spot In a self hybridzation, we would expect all ratios to be equal to one: Ri/Gi = 1 for all i. But they probably won’t be… Why? noise (systematics bias) signal (true differences) Normalization brings appropriate ratios closer to 1
76
76
77
The Starting Point: The Ratio (2-color arrays)
(Gavin Sherlock)
78
Log ratios treat up- and down-regulated genes equally
(Gavin Sherlock) (two-color arrays) log2(1) = 0 log2(2) = 1 log2(1/2) = -1
79
A note about Affymetrix (1-color) pre-processing
Typical Affymetrix probe intensity distribution Log transform After log-transform
80
Normalization methods
81
Which Genes to use for bias detection?
All genes on the chip Assumption: Most of the genes are equally expressed in the compared samples, the proportion of the differential genes is low (<20%). Limits: Not appropriate when comparing highly heterogeneous samples (different tissues) Not appropriate for analysis of ‘dedicated chips’ (apoptosis chips, inflammation chips etc)
82
Which Genes to use for bias detection?
Housekeeping genes Assumption: based on prior knowledge a set of genes can be regarded as equally expressed in the compared samples Affy novel chips: ‘normalization set’ of 100 genes NHGRI’s cDNA microarrays: 70 "house-keeping" genes set Limits: The validity of the assumption is questionable Housekeeping genes are usually expressed at high levels, not informative for the low intensities range
83
Which Genes to use for bias detection?
Spiked-in controls from other organism, over a range of concentrations Limits: low number of controls- less robust Can’t detect biases due to differences in RNA extraction protocols “Invariant set” Trying to identify genes that are expressed at similar levels in the compared samples without relying on any prior knowledge: Rank the genes in each chip according to their expression level Find genes with small change in ranks
84
1. Global normalization (Scaling)
A single normalization factor (k) is computed for balancing chips\channels: Xinorm = k*Xi or log2 R/G log2 R/G – c (2-color) Multiplying intensities by this factor equalizes the mean (median) intensity among compared chips Assumption: Total RNA (mass) used is same for both samples. So, averaged across thousands of genes, total hybridization should be the same for both samples.
85
Global Normalization (1-color, e.g. Affymetrix)
Before After Xinorm = k*Xi
86
Global Normalization (2-color)
Un-normalized Normalized Frequency (Gavin Sherlock) Log-ratios log2 R/G log2 R/G – c where c = log2 (∑Ri/ ∑Gi)
87
2. Intensity-dependent normalization (Yang, Speed)
(Lowess – local linear fit) Compensate for intensity-dependent biases
88
Detect Intensity-dependent Biases: M vs A plots (also called R-I plot)
X axis: A – average intensity A = 0.5*log(Cy3*Cy5) Y axis: M – log ratio M = log(Cy3/Cy5)
89
Intensity-dependent bias
High intensities M>0: Cy3>Cy5 M = log(Cy3/Cy5) Low intensities M<0: Cy3<Cy5 * Global normalization cannot remove intensity-dependent biases A
90
We expect the M vs A plot to look like:
M = log(Cy3/Cy5) A
91
LOWESS (Locally Weighted Scatterplot Smoothing)
Local linear regression model Tri-cube weight function Least Squares Estimated values of log2(Cy5/Cy3) as function of log10(Cy3*Cy5)
92
A note about Affymetrix (1-color) pre-processing
within-chip cross-chip sequence specific background correction within-probe set aggregation of intensity values (Johannes Freudenberg) Two “standard” methods MAS 5.0 (now GCOS/GDAS) by Affymetrix (compares PM and MM probes) RMA by Speed group (UC Berkeley) (ignores MM probes)
93
Normalization – Thoughts
There are many different ways to normalize data Global median, LOWESS, LOESS, etc By print tip, spatial, etc Choose one wisely BUT: don’t expect it to fix bad data! Won’t make up for lack of replicates Won’t make up for horrible slides
94
For next time.. Read Quackenbush paper on normalization
Look up the paper on Robust Multichip Averaging (RMA) out of Terry Speed’s lab What is meant by least squares? Visit the Gene Expression Omnibus (GEO) resource at NCBI and explore what is there If you aren’t familiar with the statistical computing environment, R, look it up on the web Look up MeV (multi-experiment viewer) on the web.
95
File or archive your e-mail on your own computer
96
Differentially expressed genes Sample class prediction etc.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation
97
Analysis
98
Microarray experiment
Microarray Data Flow Microarray experiment Unsupervised Analysis – clustering Image Analysis Database Data Selection & Missing value estimation Supervised Analysis Normalization & Centering Networks & Data Integration Data Matrix Decomposition techniques 98
99
Differentially expressed genes Sample class prediction etc.
Biological question Differentially expressed genes Sample class prediction etc. Experimental design Microarray experiment Image analysis Normalization Estimation Testing Clustering Discrimination Biological verification and interpretation
100
Interpretation
101
Microarray data on the Web
Several initiatives to create “unified” databases EBI: ArrayExpress NCBI: Gene Expression Omnibus
102
Normalization - tools Normalization is typically provided in microarray vendor’s software/core facilities but you should always understand the data you’re working with How has your data been processed? Are there any lingering effects? Bioconductor (both Affymetrix and cDNA): Packages in R language dChip (Affymetrix): Quantile, Invariant set MAANOVA Microarray ANOVA analysis
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.