Download presentation
1
Previous Lecture: Proteomics Informatics
Example data – MALDI-TOF Peptide intensity vs m/z
2
Gene Expression Analysis (I)
This Lecture Gene Expression Analysis (I)
3
Learning Objectives Microarray experimental details
Microarray data formats QC analysis and data exploration Normalization Differential expression Functional enrichment Databases
4
The Central Dogma of Molecular Biology DNA is transcribed into RNA which is then translated into protein protein RNA DNA transcription translation replication Measured by Microarray
5
What is a Microarray A simple concept: Dot Blot + Northern
Reverse the hybridization - put the probes on the filter and label the bulk RNA Make probes for lots of genes - a massively parallel experiment Make it tiny so you don’t need so much RNA from your experimental cells. Make quantitative measurements
6
Microarrays are Popular
At NYU Med Center we are now collecting about 3 GB of microarray data per week (60 chips, 6-10 different experiments) PubMed search "microarray"= 13,948 papers 2005 = 4406 2004 = 3509 2003 = 2421 2002 = 1557 2001 = 834 2000 = 294
7
A Filter Array
8
DNA Chip Microarrays Put a large number (~100K) of cDNA sequences or synthetic DNA oligomers onto a glass slide (or other subtrate) in known locations on a grid. Label an RNA sample and hybridize Measure amounts of RNA bound to each square in the grid Make comparisons Cancerous vs. normal tissue Treated vs. untreated Time course Many applications in both basic and clinical research
9
cDNA Microarray Technologies
Spot cloned cDNAs onto a glass microscope slide usually PCR amplified segments of plasmids Label 2 RNA samples with 2 different colors of flourescent dye - control vs. experimental Mix two labeled RNAs and hybridize to the chip Make two scans - one for each color Combine the images to calculate ratios of amounts of each RNA that bind to each spot
10
Spot your own Chip (plans available for free from Pat Brown’s website)
Robot spotter Ordinary glass microscope slide
12
Combine scans for Red & Green
False color image is made from digitized fluorescence data, not by superimposing scanned images
13
cDNA Spotted Microarrays
15
Data Acquisition Scan the arrays Quantitate each spot
Subtract background Normalize Export a table of fluorescent intensities for each gene in the array
16
Affymetrix “Gene chip” system
Uses 25 base oligos synthesized in place on a chip (20 pairs of oligos for each gene) RNA labeled and scanned in a single “color” one sample per chip Can have as many as 20,000 genes on a chip Arrays get smaller every year (more genes) Chips are expensive Proprietary system: “black box” software, can only use their chips
17
Affymetrix Gene Chip
20
Affymetrix Technology
23
Affymetrix Software Affymetrix System is totally automated
Computes a single value for each gene from 40 probes - (using surprisingly kludgy math) Highly reproducible (re-scan of same chip or hyb. of duplicate chips with same labeled sample gives very similar results) Incorporates false results due to image artefacts dust, bubbles pixel spillover from bright spot to neighboring dark spots
24
Affymetrix Pivot Table
25
Plot of raw data (PM probes)
26
Plot of log2 data (PM probes)
27
MA plot: log of fold change (M) vs log of Intensity (A)
M = log2 (A/B) A = ½ log2 (A*B) = ½ (log2 (A) + log2 (B)) Hypox1 vs Hypox2 Hypox3 Norm1 Norm2 Norm3
28
Goals of a Microarray Experiment
Find the genes that change expression between experimental and control samples Classify samples based on a gene expression profile Find patterns: Groups of biologically related genes that change expression together across samples/treatments
29
Basic Data Analysis Fold change (relative increase or decrease in intensity for each gene) Set cutoff filter for low values (background +noise) Cluster genes by similar changes - only really meaningful across multiple treatments or time points Cluster samples by similar gene expression profiles
30
Streamlined Affy Analysis
Normalize Filter Raw data •Present/Absent •Minimum value •Fold change (RMA) Significance Classification Clustering •t-test •SAM •Rank Product •PAM •Machine learning Gene lists Function (Genome Ontology)
31
Scatter plot of all genes in a simple comparison of two control (A) and two treatments (B: high vs. low glucose) showing changes in expression greater than 2.2 and 3 fold.
32
Thomas Hudson, Montreal Genome Center
33
Normalization Can control for many of the experimental sources of variability (systematic, not random or gene specific) Bring each image to the same average brightness Can use simple math or fancy - divide by the mean (whole chip or by sectors) LOESS (locally weighted regression) No sure biological standards
34
RMA Robust Multichip Average log(medpol(PMij − BG)) = µ i + α j + e ij
Bolstad, B.M., Irizarry R. A., Astrand, M., and Speed, T.P. (2003), A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Bias and Variance. Bioinformatics 19(2): log(medpol(PMij − BG)) = µ i + α j + e ij for (array i, probe j)
35
Are the Treatments Different?
Analysis of microarray data has tended to focus on making lists of genes that are up or down regulated between treatments Before making these lists, ask the question: "Are the treatments different?" PCA/MDS or cluster the samples If the treatment is responsible for differences, then use statistical methods to find the genes most responsible If there are not significant overall differences, then lists of genes with large fold changes may only reflect random variability.
36
Statistics When you have variability in measurements, you need replication and statistics to find real differences It’s not just the genes with 2 fold increase, but those with a significant p-value across replicates Non-parametric (i.e. rank or permutation) or paired value statistics may be more appropriate (low number of samples, high standard deviation)
37
Multiple Comparisons In a microarray experiment, each gene (each probe or probe set) is really a separate experiment Yet if you treat each gene as an independent comparison, you will always find some with significant differences (the tails of a normal distribution) Different genes are NOT independent
38
False Discovery Statisticians call false positives a "type 1 error" or a "False Discovery" The FDR must be smaller than the number of real differences that you find - which in turn depends on the size of the differences and variability of the measured expression values You can’t know the true false discovery rate for your data, but it can be estimated in a number of different ways. In biology we tend to be comfortable with an estimated FDR of 5-10%
39
SAM Significance Analysis of Microarrays R package, Excel plugin Free
Tusher, Tibshirani and Chu (2001): Significance analysis of microarrays applied to the ionizing radiation response. PNAS : , (Apr 24). R package, Excel plugin Free Permutation based Most published method of microarray data analysis
40
SAM- procedure overview
Sample genes expression scale Define and calculate a statistic, d(i) Generate permutated samples Estimate attributes of d(i)’s distribution Identify potentially Significant genes Choose Δ Estimate FDR
41
Calculate “relative difference” – a value that incorporates the change in expression between conditions and the variation of measurements in each condition Calculate “expected relative difference” – derived from controls generated by permutations of data Plot against each other, set cutoff to identify deviating genes Calculate FDR for chosen cutoff from the control permutations
42
Relative Difference Mean expression of gene i in conditions I and U
Gene-specific scatter Constant to reduce variation of low expressed genes
43
SAM Two-Class Unpaired
Permutation tests For each gene, compute the d-value (similar to a t-statistic). This is the observed d-value (di) for that gene. ii) Randomly shuffle the expression values between groups A and B. Compute the d-value for each randomized set. iii) Take the average of the randomized d-values for each gene. This is the ‘expected relative difference’ (dE) of that gene. Difference between (di) and (deE) is used to measure significance. iv) Plot d(i) vs. dE(i) v) Calculate FDR = average number of genes that exceed in the permuted data. Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Exp 6 Gene 1 Group A Group B Original grouping Exp 1 Exp 4 Exp 5 Exp 2 Exp 3 Exp 6 Gene 1 Group A Group B Randomized grouping
44
SAM Two-Class Unpaired
Significant positive genes (i.e., mean expression of group B > mean expression of group A) SAM Two-Class Unpaired “Observed d = expected d” line Plot d(i) vs. dE(i) For most of the genes: The more a gene deviates from the “observed = expected” line, the more likely it is to be significant. Any gene beyond the first gene in the +ve or –ve direction on the x-axis (including the first gene), whose observed exceeds the expected by at least delta, is considered significant. Significant negative genes (i.e., mean expression of group A > mean expression of group B)
45
Higher Level Microarray data analysis
Clustering and pattern detection Data mining and visualization Controls and normalization of results Statistical validatation Linkage between gene expression data and gene sequence/function/metabolic pathways databases Discovery of common sequences in co-regulated genes Meta-studies using data from multiple experiments
46
Types of Clustering Herarchical Self Organizing Maps (SOM)
Link similar genes, build up to a tree of all Self Organizing Maps (SOM) Split all genes into similar sub-groups Finds its own groups (machine learning) Principle Component every gene is a dimension (vector), find a single dimension that best represents the differences in the data
47
Cluster by fold change
48
GeneSpring
50
SOM Clusters
51
Classification How to sort samples into two classes based on gene expression data Cancer vs. normal Cancer sub-types (benign vs. malignant) Responds well to drug vs. poor response (i.e. tamoxifen for breast cancer)
52
PAM: Prediction Analysis for Microarrays
Class Prediction and Survival Analysis for Genomic Expression Data Mining Performs sample classification from gene expression data, via "nearest shrunken centroid method'' of Tibshirani, Hastie, Narasimhan and Chu (2002): "Diagnosis of multiple cancer types by shrunken centroids of gene expression" PNAS : (May 14).
53
BioConductor All of these normalization, statistical, and clustering methods are available in a free software package called BioConductor, which is part of the R statistical environment command line interface > data(SpikeIn) > pms <- pm(SpikeIn) > mms <- mm(SpikeIn) > par(mfrow = c(1, 2)) > concentrations <- matrix(as.numeric(sampleNames(SpikeIn)), 20, + 12, byrow = TRUE) > matplot(concentrations, pms, log = "xy", main = "PM", ylim = c(30, )) > lines(concentrations[1, ], apply(pms, 2, mean), lwd = 3) > matplot(concentrations, mms, log = "xy", main = "MM", ylim = c(30, > lines(concentrations[1, ], apply(mms, 2, mean), lwd = 3)
54
Functional Genomics Take a list of "interesting" genes and find their biological relationships Gene lists may come from significance/classfication analysis of microarrays, proteomics, or other high-throughput methods Requires a reference set of "biological knowledge"
55
Genome Ontology How to organize biological knowledge?
Biologists work on a variety of different research organisms: yeast, fruit fly, mouse, … human the same gene can have very different functions (antennapedia) and very different names (sonic hedgehog…)
56
GO Biologists got together and developed a sensible system called Genome Ontology (GO) 3 hierarchical sets of terminology Biological Process Cellular Component (location within cell) Molecular Function about 1000 categories of functions
58
List (and convert) gene identifiers from many genomic resources including NCBI, PIR and Uniprot/SwissProt as well as Illumina and Affymetrix gene IDs Gene IDs matched to GO function annotations (for human) Test for enrichment of GO categories (or KEGG pathways, disease associations, etc.) in list. Groups significant categories into clusters
59
DAVID enrichment score: EASE
DAVID uses a modified Fishers Exact text to get p-values for enrichment. Basic idea: is enrichment of this category in this list greater than frequency of the category in the genome. A Hypothetical Example: In human genome background (20,000 gene total), 40 genes are involved in p53 signaling pathway. A given gene list has found that 3 out of 300 belong to p53 signaling pathway. Then we ask the question if 3/300 is more than random chance comparing to the human background of 40/20000. Fisher Exact P-Value = However, EASE Score is more conservative. EASE Score = 0.06 (using 3-1 instead of 3). Since P-Value > 0.01, this user gene list is specifically associated (enriched) in p53 signaling pathway no more than random chance
60
Microarray Databases Large experiments may have hundreds of individual array hybridizations Core lab at an institution or multiple investigators using one machine - data archive and validate across experiments Data-mining - look for similar patterns of gene expression across different experiments
61
Public Databases Gene Expression data is an essential aspect of annotating the genome Publication and data exchange for microarray experiments Data mining/Meta-studies Common data format - XML MIAME (Minimal Information About a Microarray Experiment)
62
Array Express at EMBL
64
GEO at the NCBI
69
Sumary Microarray experimental details Microarray data formats
QC analysis and data exploration Normalization Differential expression Functional enrichment Databases
70
Next Lecture: Next Generation Sequencing Informatics
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.