Microarray data analysis

Slides:



Advertisements
Similar presentations
M. Kathleen Kerr “Design Considerations for Efficient and Effective Microarray Studies” Biometrics 59, ; December 2003 Biostatistics Article Oncology.
Advertisements

Microarray technology and analysis of gene expression data Hillevi Lindroos.
Microarray Data Analysis Stuart M. Brown NYU School of Medicine.
September 24, 2003 Microarray data analysis. Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Microarrays Dr Peter Smooker,
Gene Expression And Regulation Bioinformatics January 11, 2006 D. A. McClellan
Microarray Data Preprocessing and Clustering Analysis
Differentially expressed genes
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Introduce to Microarray
Today Concepts underlying inferential statistics
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
Analysis of High-throughput Gene Expression Profiling
Analysis of microarray data
(4) Within-Array Normalization PNAS, vol. 101, no. 5, Feb Jianqing Fan, Paul Tam, George Vande Woude, and Yi Ren.
with an emphasis on DNA microarrays
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
CDNA Microarrays Neil Lawrence. Schedule Today: Introduction and Background 18 th AprilIntroduction and Background 25 th AprilcDNA Mircoarrays 2 nd MayNo.
Chapter 5 Nucleic Acid Hybridization Assays A. Preparation of nucleic acid probes: 1. Labeling DNA & RNA - Nick Translation - Random primed DNA labeling.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
‘Omics’ - Analysis of high dimensional Data
CDNA Microarrays MB206.
Data Type 1: Microarrays
Gene Expression Data Qifang Xu. Outline cDNA Microarray Technology cDNA Microarray Technology Data Representation Data Representation Statistical Analysis.
Applying statistical tests to microarray data. Introduction to filtering Recall- Filtering is the process of deciding which genes in a microarray experiment.
Agenda Introduction to microarrays
We calculated a t-test for 30,000 genes at once How do we handle results, present data and results Normalization of the data as a mean of removing.
Microarray - Leukemia vs. normal GeneChip System.
Scenario 6 Distinguishing different types of leukemia to target treatment.
ARK-Genomics: Centre for Comparative and Functional Genomics in Farm Animals Richard Talbot Roslin Institute and R(D)SVS University of Edinburgh Microarrays.
Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine
Microarrays and Gene Expression Analysis. 2 Gene Expression Data Microarray experiments Applications Data analysis Gene Expression Databases.
Gene Expression and Evolution. Why are Evolutionists Interested in Gene Expression? Divergence in gene expression can underlie differences between taxa.
Bioinformatics approaches to gene expression Introduction to Bioinformatics ME: J. Pevsner
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Genomics I: The Transcriptome
Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Gene Expression Analysis. 2 DNA Microarray First introduced in 1987 A microarray is a tool for analyzing gene expression in genomic scale. The microarray.
Statistics for Differential Expression Naomi Altman Oct. 06.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Microarray Technology. Introduction Introduction –Microarrays are extremely powerful ways to analyze gene expression. –Using a microarray, it is possible.
Microarray (Gene Expression) DNA microarrays is a technology that can be used to measure changes in expression levels or to detect SNiPs Microarrays differ.
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics
Overview of Microarray. 2/71 Gene Expression Gene expression Production of mRNA is very much a reflection of the activity level of gene In the past, looking.
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
ANALYSIS OF GENE EXPRESSION DATA. Gene expression data is a high-throughput data type (like DNA and protein sequences) that requires bioinformatic pattern.
Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.
DNA Microarray Overview and Application. Table of Contents Section One : Introduction Section Two : Microarray Technique Section Three : Types of DNA.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Expression Biology 224 Instructor: Tom Peavy October 4 & 6, 2010
Other uses of DNA microarrays
Microarray: An Introduction
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Statistical Inferences for Variance Objectives: Learn to compare variance of a sample with variance of a population Learn to compare variance of a sample.
Microarray data analysis
Microarray - Leukemia vs. normal GeneChip System.
Functional Genomics in Evolutionary Research
Microarray Technology and Applications
Microarray Data Analysis
Data Type 1: Microarrays
Presentation transcript:

Microarray data analysis Gene expression: Microarray data analysis

Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

Compare gene expression in this cell type… …after viral infection …relative to a knockout …in samples from patients …after drug treatment …at a later developmental time …in a different body region

Gene expression is context-dependent, and is regulated in several basic ways • by region (e.g. brain versus kidney) • in development (e.g. fetal versus adult tissue) • in dynamic response to environmental signals (e.g. immediate-early response genes) in disease states by gene activity Page 157

UniGene: unique genes via ESTs • Find UniGene at NCBI: www.ncbi.nlm.nih.gov/UniGene UniGene clusters contain many ESTs • UniGene data come from many cDNA libraries. Thus, when you look up a gene in UniGene you get information on its abundance and its regional distribution. Page 164

Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

Microarrays: tools for gene expression A microarray is a solid support (such as a membrane or glass microscope slide) on which DNA of known sequence is deposited in a grid-like array. Page 173

Microarrays: tools for gene expression The most common form of microarray is used to measure gene expression. RNA is isolated from matched samples of interest. The RNA is typically converted to cDNA, labeled with fluorescence (or radioactivity), then hybridized to microarrays in order to measure the expression levels of thousands of genes. Page 173

Advantages of microarray experiments Fast Data on >20,000 transcripts in ~2 weeks Comprehensive Entire yeast or mouse genome on a chip Flexible Custom arrays can be made to represent genes of interest Easy Submit RNA samples to a core facility Cheap? Chip representing 20,000 genes for $300 Table 6-4 Page 175

Disadvantages of microarray experiments Cost ■ Some researchers can’t afford to do appropriate numbers of controls, replicates RNA ■ The final product of gene expression is protein significance ■ “Pervasive transcription” of the genome is poorly understood (ENCODE project) ■ There are many noncoding RNAs not yet represented on microarrays Quality ■ Impossible to assess elements on array surface control ■ Artifacts with image analysis ■ Artifacts with data analysis ■ Not enough attention to experimental design ■ Not enough collaboration with statisticians Table 6-5 Page 176

Sample acquisition Data acquisition Data analysis Data confirmation Fig. 6.16 Page 176 Biological insight

Stage 1: Experimental design Stage 2: RNA and probe preparation Stage 3: Hybridization to DNA arrays Stage 4: Image analysis Stage 5: Microarray data analysis Stage 6: Biological confirmation Stage 7: Microarray databases Fig. 6.16 Page 176

Stage 1: Experimental design [1] Biological samples: technical and biological replicates: determine the data analysis approach at the outset [2] RNA extraction, conversion, labeling, hybridization: except for RNA isolation, routinely performed at core facilities [3] Arrangement of array elements on a surface: randomization can reduce spatially-based artifacts Page 177

(e.g. Affymetrix or radioactivity-based platforms) One sample per array (e.g. Affymetrix or radioactivity-based platforms) Sample 1 Sample 2 Sample 3 Fig. 6.17 Page 177

Two samples per array (competitive hybridization) switch dyes Sample 1, pool Sample 2, pool Fig. 6.17 Page 177

Stage 2: RNA preparation For Affymetrix chips, need total RNA (about 5 ug) Confirm purity by running agarose gel Measure a260/a280 to confirm purity, quantity One of the greatest sources of error in microarray experiments is artifacts associated with RNA isolation; be sure to create an appropriately balanced, randomized experimental design. Page 178

Stage 3: Hybridization to DNA arrays The array consists of cDNA or oligonucleotides Oligonucleotides can be deposited by photolithography The sample is converted to cRNA or cDNA (Note that the terms “probe” and “target” may refer to the element immobilized on the surface of the microarray, or to the labeled biological sample; for clarity, it may be simplest to avoid both terms.) Page 178-179

Microarrays: array surface Fig. 6.18 Page 179 Southern et al. (1999) Nature Genetics, microarray supplement

Stage 4: Image analysis RNA transcript levels are quantitated Fluorescence intensity is measured with a scanner, or radioactivity with a phosphorimager Page 180

Differential Gene Expression on a cDNA Microarray Control a B Crystallin is over-expressed in Rett Syndrome Rett Fig. 6.19 Page 180

Fig. 6.20 Page 181

Stage 5: Microarray data analysis Hypothesis testing How can arrays be compared? Which RNA transcripts (genes) are regulated? Are differences authentic? What are the criteria for statistical significance? Clustering Are there meaningful patterns in the data (e.g. groups)? Classification Do RNA transcripts predict predefined groups, such as disease subtypes? Page 180

Stage 6: Biological confirmation Microarray experiments can be thought of as “hypothesis-generating” experiments. The differential up- or down-regulation of specific RNA transcripts can be measured using independent assays such as -- Northern blots -- polymerase chain reaction (RT-PCR) -- in situ hybridization Page 182

Stage 7: Microarray databases There are two main repositories: Gene expression omnibus (GEO) at NCBI ArrayExpress at the European Bioinformatics Institute (EBI) Page 182

MIAME In an effort to standardize microarray data presentation and analysis, Alvis Brazma and colleagues at 17 institutions introduced Minimum Information About a Microarray Experiment (MIAME). The MIAME framework standardizes six areas of information: ►experimental design ►microarray design ►sample preparation ►hybridization procedures ►image analysis ►controls for normalization Visit http://www.mged.org Page 182

Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

Microarray data analysis • begin with a data matrix (gene expression values versus samples) genes (RNA transcript levels) Fig. 7.1 Page 190

Microarray data analysis • begin with a data matrix (gene expression values versus samples) Typically, there are many genes (>> 20,000) and few samples (~ 10) Fig. 7.1 Page 190

Microarray data analysis • begin with a data matrix (gene expression values versus samples) Preprocessing Inferential statistics Descriptive statistics Fig. 7.1 Page 190

Microarray data analysis: preprocessing Observed differences in gene expression could be due to transcriptional changes, or they could be caused by artifacts such as: different labeling efficiencies of Cy3, Cy5 uneven spotting of DNA onto an array surface variations in RNA purity or quantity variations in washing efficiency variations in scanning efficiency Page 191

Microarray data analysis: preprocessing The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription. A basic assumption of most normalization procedures is that the average gene expression level does not change in an experiment. Page 191

Data analysis: global normalization Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope. Page 192

Data analysis: global normalization Global normalization is used to correct two or more data sets Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units Then the uncorrected ratio for a gene could show 2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation. Page 192

Data analysis: global normalization Global normalization procedure Step 1: subtract background intensity values (use a blank region of the array) Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets) Page 192

Scatter plots Useful to represent gene expression values from two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value Most dots fall along a line Outliers represent up-regulated or down-regulated genes Page 193

Outline: microarray data analysis Gene expression Microarrays Preprocessing normalization scatter plots Inferential statistics t-test ANOVA Exploratory (descriptive) statistics distances clustering principal components analysis (PCA)

Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level a to p < 0.05. Page 199

variability (standard error Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? x1 – x2 difference between mean values SE variability (standard error of the difference) Page 199

variability (standard error Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Notes t is a ratio (it thus has no units) We assume the two populations are Gaussian The two groups may be of different sizes Obtain a P value from t using a table For a two-sample t test, the degrees of freedom is N -2. For any value of t, P gets smaller as df gets larger x1 – x2 difference between mean values SE variability (standard error of the difference)

t-test to determine statistical significance disease vs normal Error difference between mean of disease and normal t statistic = variation due to error

ANOVA partitions total data variability Before partitioning After partitioning Subject disease vs normal disease vs normal Error Error Tissue type variation between DS and normal F ratio = variation due to error

Inferential statistics Paradigm Parametric test Nonparametric Compare two unpaired groups Unpaired t-test Mann-Whitney test paired groups Paired t-test Wilcoxon test Compare 3 or ANOVA more groups Table 7-2 Page 198-200

Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05. You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x 10-6 The Bonferroni correction is generally considered to be too conservative. Page 199

Inferential statistics: false discovery rate The false discovery rate (FDR) is a popular multiple corrections correction. A false positive (also called a type I error) is sometimes called a false discovery. The FDR equals the p value of the t-test times the number of genes measured (e.g. for 10,000 genes and a p value of 0.01, there are 100 expected false positives). You can adjust the false discovery rate. For example: FDR # regulated transcripts # false discoveries 0.1 100 10 0.05 45 3 0.01 20 1 Would you report 100 regulated transcripts of which 10 are likely to be false positives, or 20 transcripts of which one is likely to be a false positive?