Microarray Gene Expression Analysis

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break 14:45 – 15:15Regulatory pathways lecture 15:15 – 15:45Exercise.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Analysis of microarray data
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Whole Genome Expression Analysis
Gene Set Enrichment Analysis (GSEA)
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Probe-Level Data Normalisation: RMA and GC-RMA Sam Robson Images courtesy of Neil Ward, European Application Engineer, Agilent Technologies.
CSCE555 Bioinformatics Lecture 16 Identifying Differentially Expressed Genes from microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
Analysis of the yeast transcriptional regulatory network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
The Broad Institute of MIT and Harvard Differential Analysis.
Microarray Data Analysis The Bioinformatics side of the bench.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.
Canadian Bioinformatics Workshops
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
A New Statistical Method for Analyzing Longitudinal Multifactor Expression Data and It ’ s Application to Time Course Burn Data Baiyu Zhou Department of.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Clustering Manpreet S. Katari.
Raw data VS. Residual value
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Microarray - Leukemia vs. normal GeneChip System.
Differential Gene Expression
Canadian Bioinformatics Workshops
Genome Wide Association Studies using SNP
Microarray Technology and Applications
Molecular Classification of Cancer
Microarray Clustering
Significance Analysis of Microarrays (SAM)
Day 2: Session 8: Questions and follow-up…. James C. Fleet, PhD
1 Department of Engineering, 2 Department of Mathematics,
Computational Diagnostics
1 Department of Engineering, 2 Department of Mathematics,
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
1 Department of Engineering, 2 Department of Mathematics,
Schedule for the Afternoon
Significance Analysis of Microarrays (SAM)
Cluster Analysis in Bioinformatics
Inferring Connection Maps from AfCS Experimental Data and
Getting the numbers comparable
Volume 23, Issue 4, Pages (April 2018)
Anastasia Baryshnikova  Cell Systems 
Michal Levin, Tamar Hashimshony, Florian Wagner, Itai Yanai 
Gene Expression Analysis
Single Sample Expression-Anchored Mechanisms Predict Survival in Head and Neck Cancer Yang et al Presented by Yves A. Lussier MD PhD The University.
Microarray Data Analysis
Maria S. Robles, Sean J. Humphrey, Matthias Mann  Cell Metabolism 
Volume 2, Issue 3, Pages (March 2016)
Presentation transcript:

Microarray Gene Expression Analysis 23/03/2009 Daniele Merico PhD, Molecular and Cellular Biology PDF @ Bader Lab & Emili Lab

Gene expression analysis: general workflow Define the experimental design Collect the biological samples Generate the expression data Identify the Differential Genes Identify the Functional Groups

Identify the Functional Groups Different Strategies GENE SETS PATHWAYS NETWORKS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Score the pathways exploiting gene expression and topology Identify sub-networks (i.e. modules) satisfying some joint gene expression and topology requirement

A brief history of life microarrays About 5 min.

Microarray Chronology Number of PubMed publications by year Using a query containing keywords such as microarray, transcriptomics, etc..

Microarray Chronology First Microarray Publication [1] 45 Arabidopsis genes [1] Schena M, Shalon D, Davis RW, Brown PO.; Quantitative monitoring of gene expression patterns with a complementary DNA microarray.; Science. 1995 Oct 20;270(5235):467-70.

Microarray Chronology Full Yeast Genome on microarray [2] [2] Lashkari DA, DeRisi JL, McCusker JH, Namath AF, Gentile C, Hwang SY, Brown PO, Davis RW. Yeast microarrays for genome wide parallel genetic and gene expression analysis. Proc Natl Acad Sci U S A. 1997 Nov 25;94(24):13057-62

Microarray Chronology Gene Ontology Consortium. Hierarchical Clustering and heat-maps [3] [3] M. B. Eisen, P. T. Spellman, P. O. Brown and D. Botstein, Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. USA 95 (1998), pp. 14863–14868.

Microarray Chronology Gene Ontology enrichment, (hypergeometric)

Microarray Chronology Gene expression profiling on interaction networks [4] [4] Discovering regulatory and signalling circuits in molecular interaction networks. Ideker T, Ozier O, Schwikowski B, Siegel AF. Bioinformatics. 2002;18 Suppl 1:S233-40.

Microarray Chronology Full Human Genome on microarray Affymetrix HGU-133 plus 2.0

Microarray Chronology GSEA Enrichment [5] Gene Ontology, Pathways, other gene-sets [5] Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43):15545-50

Gene expression analysis: general workflow Define the experimental design Generate the expression signals Explorative Analysis and Pre-processing Select Diff. Genes Group into Clusters Identify the Functional Groups

The Experimental Design About 5 min.

Experimental Design: Tissue Specificity Class Neuron Class Gland Class Bone Class Blood Affy ID Neuron Gland Bone Osteoblast Blood White Cell 98063_at 1138.4 127.1 54.3 26.0 100080_at 17.0 592.5 27.2 372.8 103012_at 672.4 792.9 510.9 850.3 … Expression Matrix Expression Signal

Experimental Design: Disease Disease state Wild Type Heart Disease Time 08 w 16 w 24 w Heart Disease 3 Wild Type Experimental Design Matrix Number of replicates

Important Points Clearly define your biological question(s) Replicate experiments Biological variability must be factored-in through replication: repeat the experiment using different biological samples Use clear and balanced designs Use the same number of replicates in every class Minimize experimental variability Experimental variability arises from different platforms, different protocols, different experimenters, different days, etc… Minimize all these factors Control the assumptions of your design Many studies on human patients assume two-class designs; however, the patients may exhibit heterogeneous phenotypes (e.g. different cancer stages) and hence different transcriptomes The Explorative Analysis might reveal a different picture than you expected

Generating the Expression Signals About 5 min.

Oligonucleotide Microarray

Oligonucleotide Microarray Technology A transcript is recognized by 11-20 probe pairs 25 nucleotides long Raw fluorescence image

Oligonucleotide Microarray: Primary Signals After essential image processing, we have signals for every probe We need to integrate those signals into transcript/gene signals Different techniques are available; two of the most popular ones are: (MAS-5) detection p-value p-value on a test of presence/absence of the transcript relies on perfect match (PM) and mismatch (MM) probe signals used for tissue-specificity or for pre-filtering rma signal continuous signal relies only on perfect match probe signal pre-normalized (no further normalization required) used for differential expression (i.e. comparing two or more classes)

Enter the Matrix…

The explorative Analysis About 15 min.

Aims of the Explorative Analysis Quality Control Are the samples directly comparable… or are they affected by systematic biases?  explore the signal distributions of the samples Experimental Design (and beyond) are the samples grouped according to the classes entailed by the experimental design? are replicated experiments similar enough?  use dimensionality reduction techniques (e.g. clustering, PCA, MDS) to explore global patterns

Explore the Distributions Distributions can be explored using boxplots Boxplots enable to visually compare many distributions at once outliers max point satisfying: (x - Q3) < 1.5 * IQR 3rd quartile median 1st quartile

Explore the Distributions An example with real data

Explore the Distributions What should we do if we see differences in the distributions? Moderate differences can be corrected by normalization (addressed in the pre-processing section) Very large differences may be a sign of quality problems use other diagnostics (e.g. look at the raw image files) repeat single experiments discard certain samples

Hierarchical Clustering Hierarchical clustering enables to summarize the (dis)similarity structure of the samples which samples are most similar and can be grouped together what are the similarity relations between such groups The distance is proportional to dissimilarity

Hierarchical Clustering Hierarchical clustering can reveal sample anomalies Somebody had fun the night before the experiment…

Hierarchical Clustering Hierarchical clustering can reveal poor separation between classes

Hierarchical Clustering Technical Notes Choose accurately the dissimilarity score 1 - Pearson Correlation Euclidean Make sure you have normalized samples

Heat Map It is common to use a heat-map in combination with hierarchical clustering Due to visualization limit, it is common not to use all the genes, but only the most differential ones Caveat: the most differential genes may have a sample-clustering pattern different than the global one Heat-maps can also be used to display the patterns of gene-sets

Principal Component Analysis (PCA) PCA projects the data into a new data-space the new dimensions are ranked by the amount of variance “explained” the top-ranked dimensions can be picked for visual exploration

Dimensionality Reduction by Projection The objects in a 3D space Reduction to 2D space

Principal Component Analysis (PCA) In microarray explorative analysis Samples are treated as objects Genes/transcripts are treated as dimensions samples samples Principal Components genes

Principal Component Analysis (PCA) It is common to visualize the first two components in a bi-plot unfortunately, the number of components that can be visualized altogether is limited empirical approaches can be used to evaluate the number of “informative” principal components

Hierarchical Clustering vs PCA Hierarchical clustering (HC) of samples and PCA can display partially different pictures Cons of HC More sensitive to noise Assumes binary aggregations Not suitable for time-course designs Cons of PCA Only 2-3 dimensions can be displayed simultaneously

Pre-processing About 10 min.

Sample Normalization Sample normalization can be used to correct global biases in gene signals “Sample” because after normalization sample distributions will look more similar Normalization, like all data transformations, must be used thoughtfully Different levels Equalization of the Central Value (Mean or Median) Equalization of the Central Value and Spread (Standard Dev. or IQR) Equalization of the Distribution Shape Quantile Normalization

Sample Normalization Equalization of the Central Value Equalization of the Central Value and Spread Note: the mean (μ) can be replaced by the median, the standard deviation (sd) can be replaced by the Inter-Quartile Range (IQR)

Quantile Normalization G A 97 72 50 B A F 81 45 41 E G A 97 72 50 B A F 97 72 50 1. Sort the distributions 2. Replace values

Quantile Normalization After quantile normalization, the distributions look exactly the same

A Real-world Example Which is the normalized data-set? What normalization did I use? Are the distributions identical after normalization?

Gene Signal Standardization Standardization can be used to make the gene signal scales (i.e. ranges) comparable It is a transformation commonly used: Before PCA (often done automatically by the software routine) Before gene clustering Before mapping to the heatmap color-scale

Differential Gene Expression About 15 min.

Differential Gene Expression The majority of experimental designs are two-class comparisons, or can be broken down to two-class comparisons E.g. treated vs. untreated, transgenic vs. wild-type For such designs it is interesting to identify genes displaying different signals in the two classes (differential gene expression)

Differential Expression Scores Oriented to Pure Strength: Absolute Difference Ratio of classes Fold-change Oriented to Statistical Significance: t-test Signal-to-noise SAM (Significance Analysis) -- recommended These scores can be used to: Select gene-sets Prioritize gene lists Input for the identification of differential functional groups

Differential Expression Statistics Oriented to Pure Strength these statistics focus only on the magnitude of the change, but not on its consistency across replicated experiments Oriented to Statistical Significance these scores take into account the consistency of change across replicates; genes/transcripts with small but consistent changes can receive relatively high scores; however, they are usually preferable

Differential Expression Statistics Absolute Difference t statistic Signal-to-noise SAM statistic Stabilizing constant [SAM] Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-21. (PMID: 11309499)

From Statistics to Statistical Significance Statistical significance is often expressed in the form of a p-value When we compute a statistic (e.g. t statistic, SAM statistic) we then have to compute a p-value p-values can be compared across different experiments, and p-value thresholds can be directly related to false positive incidence For the t-test, we use the a-priori know distribution of the t-statistic For the SAM statistic, we have to use a permutation approach

SAM Permutation Approach Class A Class B Class A Class B Permuted (rand) Real

Functional Groups About 30 min.

Identify the Functional Groups Different Strategies GENE SETS PATHWAYS NETWORKS Spindle Gene.1 Gene.2 Gene.3 P53 signaling Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes Just visual, or Score the pathways exploiting gene expression and topology Identify sub-networks (i.e. modules) satisfying some joint gene expression and topology requirement

Gene-set Enrichment: Competitive vs. Self-contained Two different strategies for enrichment: Self-contained A differentiality statistic is computed for the gene-set The statistical significance is evaluated by shuffling the columns of the gene expression matrix, and re-computing the differentiality statistic Competitive The enrichment of the gene-set is evaluated in comparison to the entire data-set, or random samples of genes (of the same size)

Testing Gene-sets: Fisher’s Exact Test / Hypergeometric Test Two-Class or Clusters Is the intersection larger than expected by random sampling? UP Threshold-dependent!! Gene-set Collection

Testing Gene-sets: GSEA (Gene-Set Enrichment Analysis) Statistics based on the cumulative sum-of-ranks ESSet = Max (ES) Weighting options P-value and FDR estimated using permutations Randomly sample gene sets Shuffled phenotype labels

Competitive vs. Self-contained How would you consider the Fisher’s Test and GSEA?

Testing Gene-sets: GSEA (Gene-Set Enrichment Analysis) Statistics based on the cumulative sum-of-ranks ESSet = Max (ES) Weighting options P-value and FDR estimated using permutations Randomly sample gene sets (competitive) Shuffled phenotype labels (hybrid)

Enrichment Maps About 10 min.

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO:0042330 taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998 GO:0006935 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998 GO:0002460 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002250 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002443 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739 GO:0019724 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347 GO:0030099 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833 GO:0002252 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773 GO:0050764 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971 GO:0050766 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971 GO:0002449 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132 GO:0019838 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852 GO:0051258 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968 GO:0005789 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158 GO:0016064 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739 GO:0007507 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304 GO:0009617 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637 GO:0030100 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889 GO:0002526 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116 GO:0045807 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171 GO:0002274 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339 GO:0008652 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579 GO:0050727 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076 GO:0002253 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376 GO:0002684 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447 GO:0050778 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447 GO:0019882 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889 GO:0002682 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206 GO:0050776 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206 GO:0043086 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224 GO:0006909 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747 GO:0002573 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596 GO:0006959 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565 GO:0046649 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355 GO:0030595 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559 GO:0006469 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145 GO:0051348 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145 GO:0007179 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309 GO:0005520 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832 GO:0042110 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863 GO:0002455 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143 GO:0005830 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781 GO:0006487 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553 GO:0051240 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773 GO:0042379 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406 GO:0008009 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406 GO:0016055 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897

Re-organizing the Gene Ontology Gene Ontology is hierarchical, and terms are highly redundant / inter-related / inter-dependent Enrichment Maps are not hierarchical, yet they neatly group redundant / inter-related / inter-dependent terms

Gene-set Overlap Measures Jaccard Coefficient Overlap Coefficient

Enrichment Map Visual Style UP Correlation to HD phenotype DOWN Anti-correlation to HD phenotype

Immune Cell Proliferation AcCoA Metabolism / Krebs Cycle Cell Differentiation Immune Cell Proliferation AcCoA Metabolism / Krebs Cycle Carbohydrate Metabolism / Glycosylation Endomembrane System Immune Response Aminoacid Metabolism NFkB Phagocytosis Coagulation Oxidative Metabolism / Mitochondrion Fatty Acid Metabolism Peroxisome Cell Motility Antigen Recognition Vacuole / Lysosome Mitochondrial Ribosome Metabolism Heart Contraction / Blood Pressure Regulation Protein Folding Adherens Junctions Ubq-dependent Protein Degradation Growth Factor Extracelluar Matrix Embryonic Development Apoptosis Adhesion / Matrix / Tissue Remodeling RNA Processing / Translation Bone / Cartilage Development Protease Inhibitor Angiogenesis Tyr Kinase / Phosphatase Phospho-inositide Ruffle Actin Cytoskeleton Remodeling Cytoskeleton / Cell Cycle Miscellanea Microtubule Cytoskeleton Mitotic Cell Cycle Ras/Rho Signaling

Further Reading Allison DB, Cui X, Page GP, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet. 2006 Jan;7(1):55-65. Review. PMID: 16369572 D'haeseleer P. How does gene expression clustering work? Nat Biotechnol. 2005 Dec;23(12):1499-501. PMID: 16333293 Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008 May;9(3):189-97. PMID: 18202032 Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform. 2009 Jan;10(1):24-34. PMID: 18836208 Cline MS, Smoot M, Cerami E, Kuchinsky A, Landys N, et al. (2007) Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2: 2366-2382. PMID: 17947979

Contact and Links Email daniele.merico@gmail.com Web-site http://baderlab.org/DanieleMerico