Microarray Data Preprocessing and Clustering Analysis

Slides:



Advertisements
Similar presentations
Basic Gene Expression Data Analysis--Clustering
Advertisements

Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
Cluster analysis for microarray data Anja von Heydebreck.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
Microarray technology and analysis of gene expression data Hillevi Lindroos.
OHRI Bioinformatics Introduction to the Significance Analysis of Microarrays application Stem.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)
September 24, 2003 Microarray data analysis. Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
DNA Microarray Bioinformatics - #27612 Normalization and Statistical Analysis.
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Gene Expression Data Analyses (3)
Differentially expressed genes
Microarray II. What is a microarray Microarray Experiment RT-PCR LASER DNA “Chip” High glucose Low glucose.
Figure 1: (A) A microarray may contain thousands of ‘spots’. Each spot contains many copies of the same DNA sequence that uniquely represents a gene from.
Statistical Analysis of Microarray Data
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Normalization of 2 color arrays Alex Sánchez. Dept. Estadística Universitat de Barcelona.
GCB/CIS 535 Microarray Topics John Tobias November 8th, 2004.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Introduction to Bioinformatics - Tutorial no. 12
Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:
Making Sense of Complicated Microarray Data
GCB/CIS 535 Microarray Topics John Tobias November 15 th, 2004.
Statistical Analysis of Microarray Data
Persistent Systems Pvt. Ltd. Gene Expression Analysis Using Microarrays Dr Mushtaq Ahmed Technology Incubation Division Persistent.
Analysis of microarray data
B IOINFORMATICS Dr. Aladdin HamwiehKhalid Al-shamaa Abdulqader Jighly Lecture 8 Analyzing Microarray Data Aleppo University Faculty of technical.
Filtering and Normalization of Microarray Gene Expression Data Waclaw Kusnierczyk Norwegian University of Science and Technology Trondheim, Norway.
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
The following slides have been adapted from to be presented at the Follow-up course on Microarray Data Analysis.
Analysis and Management of Microarray Data Dr G. P. S. Raghava.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Introduction to DNA Microarray Technology Steen Knudsen Uma Chandran.
Panu Somervuo, March 19, cDNA microarrays.
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data –Subtraction of Background Noise –Global/local.
Agenda Introduction to microarrays
Microarray data analysis
Microarrays.
CS 5263 Bioinformatics Lecture 23 Microarray Data Analysis.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
An Overview of Clustering Methods Michael D. Kane, Ph.D.
Pabio590B – week 1 Microarrays  Overview  Design & hybridization  Data analysis.
Course Work Project Project title “Data Analysis Methods for Microarray Based Gene Expression Analysis” Sushil Kumar Singh (batch ) IBAB, Bangalore.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Cluster Analysis, an Overview Laurie Heyer. Why Cluster? Data reduction – Analyze representative data points, not the whole dataset Hypothesis generation.
Microarray Data Analysis Xuming He Department of Statistics University of Illinois at Urbana-Champaign.
Microarray data analysis
Differential Gene Expression
Presentation transcript:

Microarray Data Preprocessing and Clustering Analysis Spotted Microarray Workshop Microarray Data Preprocessing and Clustering Analysis Liangjiang (LJ) Wang ljwang@ksu.edu KSU Bioinformatics Center, Biology Division June, 2005

Outline Overview of microarray data analysis. Microarray data preprocessing. Statistical inference of significant genes. Clustering analysis and visualization. Microarray databases and standards.

Spotted Microarray Gene expression matrix (ratios) Samples Genes Reference Cells Experimental Cells Samples Gene expression matrix (ratios) Extract mRNA Genes Make and label cDNA Array image data Hybridize Probes

Overview of Microarray Data Analysis Microarray experiment Image analysis and data normalization Sample classification Statistical inference of significant genes Clustering analysis of co-expressed genes List of significant or co-expressed genes Promoter analysis, gene function prediction, and pathway analysis

Microarray Image Analysis Spot finding: place a grid to identify spot locations. Segmentation: separate each spot (foreground) from the background. Spot intensity extraction: often use mean or median intensity of all the pixels within a spot. Background subtraction: may subtract local background or globally estimated background.

Microarray Data Normalization To remove the systemic bias in the data so that meaningful biological comparisons can be made: Unequal quantities of starting RNA. Differences in labeling (e.g., Cy3 versus Cy5). Different detection efficiencies between the dyes. Differences in hybridization and washing. Other experimental variations. Normalization is based on some assumptions: A subset of genes (housekeeping genes) is assumed to be constant. The total intensity or overall intensity distributions between the two channels are comparable.

Global Normalization Total intensity normalization: A normalization factor is calculated by summing the measured intensities in both channels and then taking the ratio: All the intensities in one channel are multiplied by the normalization factor: A subset of genes (housekeeping genes) may be also used for the global normalization.

Scatter Plot of Cy3 vs Cy5 Intensities Intensities from “self-self” hybridization After normalization Before normalization (Quackenbush, 2001)

Lowess Normalization Probably the most widely used approach for spotted microarray normalization. A locally weighted linear repression is used to estimate the systematic bias in the data. Ratio-Intensity (R-I) plot (also called MA plot) Raw data log ratio, log2(R / G) (Quackenbush, 2001) After lowess log ratio, log2(R / G) (Quackenbush, 2001)

Why Log Transformation? Log 2 (R / G) treats up-regulated and down-regulated genes in a similar fashion: If R / G = 4, log 2 (R / G) = 2. If R / G = 1/4 = 0.25, log 2 (1/4) = -2. Log normalizes distribution.

Finding Significant Genes Fold change: uses a single fold change threshold to select genes; does not take into account the biological and experimental variability. Statistical tests: t test, SAM and ANOVA; require a number of replicates for each condition.

Volcano Plot Statistical significance → high (Wolfinger et al., 2001) Larger fold changes does not necessarily mean higher significance levels.

Student’s t Test To test whether there is a significant difference in gene expression measurements between two conditions (A and B): H0: no difference in gene expression, H1: the gene is differentially expressed, Test statistic: Calculate the probability (p value) of the t statistic with degree of freedom, df = nA + nB - 2. Assume a 95% confidence level (i.e., 5% false positive rate). If p ≤ 0.05, reject the null hypothesis.

Problem of Multiple Testing Suppose that you have 5,000 genes on your microarray, and you select the genes with p ≤ 0.05 (i.e., 5% false positive rate). Because you have applied 5,000 times of the t test, you may have 5,000 x 0.05 = 250 false positives!

Correction for Multiple Testing Bonferroni correction: Set the significance cutoff, p' = α / N, where α is the false positive rate, and N is the number of genes. For example, if you have 5,000 genes in your microarray, and you expect 5% of false positives, the significance cutoff, p' = 0.05 / 5000 = 1.0 E -5. False Discovery Rate (FDR): Rank all the genes by significance (p value) so that the top gene has the most significant p value. Start from the top of the list, and accept the genes if i: the rank of the gene in the list. N: the number of genes in the array. q: the desired FDR.

SAM: Significance Analysis of Microarrays SAM (http://www-stat.stanford.edu/~tibs/SAM/) is a modified t test. The observed d statistic is computed from the data, and the expected d statistic is assessed by permutation. With a user-defined FDR, SAM derives the significance cutoffs for selecting up- and down-regulated genes. Down-regulated Up-regulated Expected d statistic Observed d statistic SAM Plot Observed d = expected d Significance cutoffs

ANOVA ANalysis Of VAriance (ANOVA) is used to find significant genes in more than two conditions: For each gene, compute the F statistic. Calculate the p value for the F statistic. Adjust the significance cutoff for multiple testing. Gene Disease A Disease B Disease C A1 A2 A3 B1 B2 B3 C1 C2 C3 g1 0.9 1.1 1.4 1.9 2.1 2.5 3.1 2.9 2.6 g2 4.2 3.9 3.5 5.1 4.6 4.3 1.8 2.4 1.5 g3 0.7 1.2 0.6 0.8 g4 2.0 1.7 4.0 3.2 2.8 6.3 5.7 ∙ ∙ ∙

Clustering Analysis Clustering analysis is to partition a dataset into a few groups (clusters) such that: Homogeneity: objects in the same cluster are similar to each other. Separation: dissimilar objects are placed in different clusters. In microarray data analysis, this means to find groups of genes (or samples) with similar gene expression patterns. Two key questions: How to measure similarity of gene expression? How to find these gene clusters?

Distance Metrics Expression vector: each gene can be represented as a vector in the N-dimensional hyperspace, where N is the number of samples. Euclidean distance: Vector angle: Pearson correlation coefficient: Sample 1 Sample 2 A a2 a1 B b2 b1 d α

Z Transformation If Euclidean distance is used for clustering analysis, z transformation of the gene expression matrix may be necessary. For each gene, calculate the z scores of the expression values: Log (ratio) Samples — Gene A — Gene B dAB = 3.58 Z score Samples — Gene A — Gene B dAB = 0.36

Hierarchical Clustering Initialization: each object is a cluster Iteration Merge two clusters which are most similar to each other Until all objects are merged into a single cluster b d c e a Step 0 Step 1 Step 2 Step 3 Step 4 a b a b c d e c d e d e Agglomerative approach

Hierarchical Clustering (Cont’d) Calculating distances between clusters: Single linkage: takes the shortest distance between two clusters. Complete linkage: uses the largest Average linkage: uses the average The clustering results are visualized using a tree (called dendrogram) with color-coded gene expression levels. Hierarchical clustering can be applied to genes, samples, or both. CL SL AL

Sample Clustering Alizadeh, et al., 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature, 403:503-511.

k-Means Clustering Initialization User-defined k (# clusters) Randomly place k vectors (called centroids) in the data space Iteration Each object is assigned to its closest centroid Re-compute each centroid by taking the mean of data vectors currently assigned to the cluster Until the cluster centroids no longer change Iteration 0: 1: 2: 3: k = 2

Self-Organizing Map (SOM) The user defines an initial geometry of nodes (reference vectors) for the partitions such as a 3 x 2 rectangular grid. During the iterative “training” process, the nodes migrate to fit the gene expression data. The genes are mapped to the most similar reference vector.

k-means SOM 237 genes 194 genes Clustering analysis of a yeast cell cycle time-series dataset k-means SOM 237 genes 194 genes

Tools for Microarray Data Analysis GenePix (http://www.axon.com/GN_GenePixSoftware.html): commercial software for microarray image analysis. GeneSpring (http://www.silicongenetics.com/cgi/SiG.cgi/Products/GeneSpring/index.smf): commercial software for microarray data analysis. TIGR MeV (http://www.tm4.org/mev.html): free software for clustering, visualization, classification and statistical analysis of microarray data. Bioconductor (http://www.bioconductor.org/): open source, free software for the analysis of genomic data. For microarray data analysis, most of the statistical methods are implemented in R.

Microarray Databases Gene Expression Omnibus (GEO) at NCBI (http://www.ncbi.nlm.nih.gov/geo/): a public repository for high throughput gene expression data. ArrayExpress at EBI (http://www.ebi.ac.uk/arrayexpress/): a public repository for microarray gene expression data; MIAME compliant. Stanford Microarray Database (SMD at http://genome-www5.stanford.edu/): stores raw and normalized microarray data; provides data retrieval and online data processing.

The MIAME Standard MIAME (Minimum Information About a Microarray Experiment) is a microarray data standard proposed by the Microarray Gene Expression Database group (MGED, http://www.mged.org/). MIAME (http://www.mged.org/Workgroups/MIAME/) is needed to interpret the results from a microarray experiment and potentially to reproduce the microarray experiment. MIAME checklist helps authors, reviewers and editors of scientific journals to meet the MIAME requirements and to make microarray data available to the community in a useful way.

Summary Image analysis and data normalization are important preprocessing steps for microarray data analysis. Statistical methods are available for selecting significantly up- or down-regulated genes. Clustering analysis is widely used to explore and visualize microarray data. The resulting significant or co-expressed genes can be further investigated using Gene Ontology annotation and promoter analysis.