Microarray Gene Expression Analysis

Microarray Gene Expression Analysis
23/03/2009 Daniele Merico PhD, Molecular and Cellular Biology Bader Lab & Emili Lab

Gene expression analysis: general workflow
Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups

Identify the Functional Groups
Different Strategies GENE SETS PATHWAYS NETWORKS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Score the pathways exploiting gene expression and topology Identify sub-networks (i.e. modules) satisfying some joint gene expression and topology requirement

A brief history of life microarrays About 5 min.

Microarray Chronology
Number of PubMed publications by year Using a query containing keywords such as microarray, transcriptomics, etc..

First Microarray Publication [1] 45 Arabidopsis genes [1] Schena M, Shalon D, Davis RW, Brown PO.; Quantitative monitoring of gene expression patterns with a complementary DNA microarray.; Science Oct 20;270(5235):

Full Yeast Genome on microarray [2] [2] Lashkari DA, DeRisi JL, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO, Davis RW. Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proc Natl Acad Sci U S A Nov 25;94(24):

Gene Ontology Consortium. Hierarchical Clustering and heat-maps [3] [3] M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 (1998), pp –14868.

Gene Ontology enrichment, (hypergeometric)

Gene expression profiling on interaction networks [4] [4] Discovering regulatory and signalling circuits in molecular interaction networks. Ideker T, Ozier O, Schwikowski B, Siegel AF. Bioinformatics. 2002;18 Suppl 1:S

Full Human Genome on microarray Affymetrix HGU-133 plus 2.0

GSEA Enrichment [5] Gene Ontology, Pathways, other gene-sets [5] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A Oct 25;102(43):

Gene expression analysis: general workflow
Define the experimental design Generate the expression signals Explorative Analysis and Pre-processing Select Diff. Genes Group into Clusters Identify the Functional Groups

The Experimental Design
About 5 min.

Experimental Design: Tissue Specificity
Class Neuron Class Gland Class Bone Class Blood Affy ID Neuron Gland Bone Osteoblast Blood White Cell 98063_at 1138.4 127.1 54.3 26.0 100080_at 17.0 592.5 27.2 372.8 103012_at 672.4 792.9 510.9 850.3 … Expression Matrix Expression Signal

Experimental Design: Disease
Disease state Wild Type Heart Disease Time 08 w 16 w 24 w Heart Disease 3 Wild Type Experimental Design Matrix Number of replicates

Important Points Clearly define your biological question(s)
Replicate experiments Biological variability must be factored-in through replication: repeat the experiment using different biological samples Use clear and balanced designs Use the same number of replicates in every class Minimize experimental variability Experimental variability arises from different platforms, different protocols, different experimenters, different days, etc… Minimize all these factors Control the assumptions of your design Many studies on human patients assume two-class designs; however, the patients may exhibit heterogeneous phenotypes (e.g. different cancer stages) and hence different transcriptomes The Explorative Analysis might reveal a different picture than you expected

Generating the Expression Signals
About 5 min.

Oligonucleotide Microarray

Oligonucleotide Microarray Technology
A transcript is recognized by probe pairs 25 nucleotides long Raw fluorescence image

Oligonucleotide Microarray: Primary Signals
After essential image processing, we have signals for every probe We need to integrate those signals into transcript/gene signals Different techniques are available; two of the most popular ones are: (MAS-5) detection p-value p-value on a test of presence/absence of the transcript relies on perfect match (PM) and mismatch (MM) probe signals used for tissue-specificity or for pre-filtering rma signal continuous signal relies only on perfect match probe signal pre-normalized (no further normalization required) used for differential expression (i.e. comparing two or more classes)

Enter the Matrix…

The explorative Analysis
About 15 min.

Aims of the Explorative Analysis
Quality Control Are the samples directly comparable… or are they affected by systematic biases?  explore the signal distributions of the samples Experimental Design (and beyond) are the samples grouped according to the classes entailed by the experimental design? are replicated experiments similar enough?  use dimensionality reduction techniques (e.g. clustering, PCA, MDS) to explore global patterns

Explore the Distributions
Distributions can be explored using boxplots Boxplots enable to visually compare many distributions at once outliers max point satisfying: (x - Q3) < 1.5 * IQR 3rd quartile median 1st quartile

An example with real data

What should we do if we see differences in the distributions? Moderate differences can be corrected by normalization (addressed in the pre-processing section) Very large differences may be a sign of quality problems use other diagnostics (e.g. look at the raw image files) repeat single experiments discard certain samples

Hierarchical Clustering
Hierarchical clustering enables to summarize the (dis)similarity structure of the samples which samples are most similar and can be grouped together what are the similarity relations between such groups The distance is proportional to dissimilarity

Hierarchical clustering can reveal sample anomalies Somebody had fun the night before the experiment…

Hierarchical clustering can reveal poor separation between classes

Hierarchical Clustering Technical Notes
Choose accurately the dissimilarity score 1 - Pearson Correlation Euclidean Make sure you have normalized samples

Heat Map It is common to use a heat-map in combination with hierarchical clustering Due to visualization limit, it is common not to use all the genes, but only the most differential ones Caveat: the most differential genes may have a sample-clustering pattern different than the global one Heat-maps can also be used to display the patterns of gene-sets

Principal Component Analysis (PCA)
PCA projects the data into a new data-space the new dimensions are ranked by the amount of variance “explained” the top-ranked dimensions can be picked for visual exploration

Dimensionality Reduction by Projection
The objects in a 3D space Reduction to 2D space

In microarray explorative analysis Samples are treated as objects Genes/transcripts are treated as dimensions samples samples Principal Components genes

It is common to visualize the first two components in a bi-plot unfortunately, the number of components that can be visualized altogether is limited empirical approaches can be used to evaluate the number of “informative” principal components

Hierarchical Clustering vs PCA
Hierarchical clustering (HC) of samples and PCA can display partially different pictures Cons of HC More sensitive to noise Assumes binary aggregations Not suitable for time-course designs Cons of PCA Only 2-3 dimensions can be displayed simultaneously

Pre-processing About 10 min.

Sample Normalization Sample normalization can be used to correct global biases in gene signals “Sample” because after normalization sample distributions will look more similar Normalization, like all data transformations, must be used thoughtfully Different levels Equalization of the Central Value (Mean or Median) Equalization of the Central Value and Spread (Standard Dev. or IQR) Equalization of the Distribution Shape Quantile Normalization

Sample Normalization Equalization of the Central Value
Equalization of the Central Value and Spread Note: the mean (μ) can be replaced by the median, the standard deviation (sd) can be replaced by the Inter-Quartile Range (IQR)

Quantile Normalization
G A 97 72 50 B A F 81 45 41 E G A 97 72 50 B A F 97 72 50 1. Sort the distributions 2. Replace values

Quantile Normalization
After quantile normalization, the distributions look exactly the same

A Real-world Example Which is the normalized data-set?
What normalization did I use? Are the distributions identical after normalization?

Gene Signal Standardization
Standardization can be used to make the gene signal scales (i.e. ranges) comparable It is a transformation commonly used: Before PCA (often done automatically by the software routine) Before gene clustering Before mapping to the heatmap color-scale

Differential Gene Expression
About 15 min.

Differential Gene Expression
The majority of experimental designs are two-class comparisons, or can be broken down to two-class comparisons E.g. treated vs. untreated, transgenic vs. wild-type For such designs it is interesting to identify genes displaying different signals in the two classes (differential gene expression)

Differential Expression Scores
Oriented to Pure Strength: Absolute Difference Ratio of classes Fold-change Oriented to Statistical Significance: t-test Signal-to-noise SAM (Significance Analysis) -- recommended These scores can be used to: Select gene-sets Prioritize gene lists Input for the identification of differential functional groups

Differential Expression Statistics
Oriented to Pure Strength these statistics focus only on the magnitude of the change, but not on its consistency across replicated experiments Oriented to Statistical Significance these scores take into account the consistency of change across replicates; genes/transcripts with small but consistent changes can receive relatively high scores; however, they are usually preferable

Differential Expression Statistics
Absolute Difference t statistic Signal-to-noise SAM statistic Stabilizing constant [SAM] Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A Apr 24;98(9): (PMID: )

From Statistics to Statistical Significance
Statistical significance is often expressed in the form of a p-value When we compute a statistic (e.g. t statistic, SAM statistic) we then have to compute a p-value p-values can be compared across different experiments, and p-value thresholds can be directly related to false positive incidence For the t-test, we use the a-priori know distribution of the t-statistic For the SAM statistic, we have to use a permutation approach

SAM Permutation Approach
Class A Class B Class A Class B Permuted (rand) Real

Functional Groups About 30 min.

Identify the Functional Groups
Different Strategies GENE SETS PATHWAYS NETWORKS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Score the pathways exploiting gene expression and topology Identify sub-networks (i.e. modules) satisfying some joint gene expression and topology requirement

Gene-set Enrichment: Competitive vs. Self-contained
Two different strategies for enrichment: Self-contained A differentiality statistic is computed for the gene-set The statistical significance is evaluated by shuffling the columns of the gene expression matrix, and re-computing the differentiality statistic Competitive The enrichment of the gene-set is evaluated in comparison to the entire data-set, or random samples of genes (of the same size)

Testing Gene-sets: Fisher’s Exact Test / Hypergeometric Test
Two-Class or Clusters Is the intersection larger than expected by random sampling? UP Threshold-dependent!! Gene-set Collection

Testing Gene-sets: GSEA (Gene-Set Enrichment Analysis)
Statistics based on the cumulative sum-of-ranks ESSet = Max (ES) Weighting options P-value and FDR estimated using permutations Randomly sample gene sets Shuffled phenotype labels

Competitive vs. Self-contained
How would you consider the Fisher’s Test and GSEA?

Testing Gene-sets: GSEA (Gene-Set Enrichment Analysis)
Statistics based on the cumulative sum-of-ranks ESSet = Max (ES) Weighting options P-value and FDR estimated using permutations Randomly sample gene sets (competitive) Shuffled phenotype labels (hybrid)

Enrichment Maps About 10 min.

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr
GO: taxis E GO: chemotaxis E GO: adaptive immune response based on somatic recombination E GO: adaptive immune response E GO: leukocyte mediated immunity GO: B cell mediated immunity GO: myeloid cell differentiation GO: immune effector process GO: regulation of phagocytosis GO: positive regulation of phagocytosis GO: lymphocyte mediated immunity GO: growth factor binding GO: protein polymerization GO: endoplasmic reticulum membrane GO: immunoglobulin mediated immune response GO: heart development GO: response to bacterium GO: regulation of endocytosis GO: acute inflammatory response GO: positive regulation of endocytosis GO: myeloid leukocyte activation GO: amino acid biosynthetic process GO: regulation of inflammatory response GO: activation of immune response GO: positive regulation of immune system process GO: positive regulation of immune response GO: antigen processing and presentation GO: regulation of immune system process GO: regulation of immune response GO: negative regulation of enzyme activity GO: phagocytosis GO: myeloid leukocyte differentiation GO: humoral immune response GO: lymphocyte activation GO: leukocyte chemotaxis GO: negative regulation of protein kinase activity GO: negative regulation of transferase activity GO: transforming growth factor beta receptor signaling pathw GO: insulin-like growth factor binding GO: T cell activation GO: humoral immune response mediated by circulating immunogl GO: cytosolic ribosome (sensu Eukaryota) GO: protein amino acid N-linked glycosylation GO: positive regulation of multicellular organismal process GO: chemokine receptor binding GO: chemokine activity GO: Wnt receptor signaling pathway

Re-organizing the Gene Ontology
Gene Ontology is hierarchical, and terms are highly redundant / inter-related / inter-dependent Enrichment Maps are not hierarchical, yet they neatly group redundant / inter-related / inter-dependent terms

Gene-set Overlap Measures
Jaccard Coefficient Overlap Coefficient

Enrichment Map Visual Style
UP Correlation to HD phenotype DOWN Anti-correlation to HD phenotype

Immune Cell Proliferation AcCoA Metabolism / Krebs Cycle
Cell Differentiation Immune Cell Proliferation AcCoA Metabolism / Krebs Cycle Carbohydrate Metabolism / Glycosylation Endomembrane System Immune Response Aminoacid Metabolism NFkB Phagocytosis Coagulation Oxidative Metabolism / Mitochondrion Fatty Acid Metabolism Peroxisome Cell Motility Antigen Recognition Vacuole / Lysosome Mitochondrial Ribosome Metabolism Heart Contraction / Blood Pressure Regulation Protein Folding Adherens Junctions Ubq-dependent Protein Degradation Growth Factor Extracelluar Matrix Embryonic Development Apoptosis Adhesion / Matrix / Tissue Remodeling RNA Processing / Translation Bone / Cartilage Development Protease Inhibitor Angiogenesis Tyr Kinase / Phosphatase Phospho-inositide Ruffle Actin Cytoskeleton Remodeling Cytoskeleton / Cell Cycle Miscellanea Microtubule Cytoskeleton Mitotic Cell Cycle Ras/Rho Signaling

Further Reading Allison DB, Cui X, Page GP, Sabripour M.
Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet Jan;7(1): Review. PMID: D'haeseleer P. How does gene expression clustering work? Nat Biotechnol Dec;23(12): PMID: Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform May;9(3): PMID: Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform Jan;10(1):24-34. PMID: Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, et al. (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2: PMID:

Contact and Links Email daniele.merico@gmail.com Web-site

Microarray Gene Expression Analysis

Similar presentations

Presentation on theme: "Microarray Gene Expression Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microarray Gene Expression Analysis

Similar presentations

Presentation on theme: "Microarray Gene Expression Analysis"— Presentation transcript:

Similar presentations

About project

Feedback