Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology.

Slides:



Advertisements
Similar presentations
Supervised and unsupervised analysis of gene expression data Bing Zhang Department of Biomedical Informatics Vanderbilt University
Advertisements

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
PSY 307 – Statistics for the Behavioral Sciences Chapter 20 – Tests for Ranked Data, Choosing Statistical Tests.
Ordinal Data. Ordinal Tests Non-parametric tests Non-parametric tests No assumptions about the shape of the distribution No assumptions about the shape.
INTRODUCTION TO NON-PARAMETRIC ANALYSES CHI SQUARE ANALYSIS.
Statistics Tools in GeneSpring The Center for Bioinformatics UNC at Chapel Hill Jianping Jin Ph.D. Bioinformatics Scientist Phone: (919)
September 24, 2003 Microarray data analysis. Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech  Distance Metrics: Measuring similarity using the Euclidean and Correlation.
Microarray Data Preprocessing and Clustering Analysis
Differentially expressed genes
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman.
Statistics 07 Nonparametric Hypothesis Testing. Parametric testing such as Z test, t test and F test is suitable for the test of range variables or ratio.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Microarray Technology Types Normalization Microarray Technology Microarray: –New Technology (first paper: 1995) Allows study of thousands of genes at.
PSYC512: Research Methods PSYC512: Research Methods Lecture 9 Brian P. Dyre University of Idaho.
PSYC512: Research Methods PSYC512: Research Methods Lecture 8 Brian P. Dyre University of Idaho.
Educational Research by John W. Creswell. Copyright © 2002 by Pearson Education. All rights reserved. Slide 1 Chapter 8 Analyzing and Interpreting Quantitative.
PSY 307 – Statistics for the Behavioral Sciences Chapter 19 – Chi-Square Test for Qualitative Data Chapter 21 – Deciding Which Test to Use.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Today Concepts underlying inferential statistics
Clustering and MDS Exploratory Data Analysis. Outline What may be hoped for by clustering What may be hoped for by clustering Representing differences.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 14: Non-parametric tests Marshall University Genomics.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Review I volunteer in my son’s 2nd grade class on library day. Each kid gets to check out one book. Here are the types of books they picked this week:
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
Chapter 14: Nonparametric Statistics
© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.
1 STATISTICAL HYPOTHESES AND THEIR VERIFICATION Kazimieras Pukėnas.
Hypothesis Testing Charity I. Mulig. Variable A variable is any property or quantity that can take on different values. Variables may take on discrete.
© 2013 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
Copyright © 2008 by Pearson Education, Inc. Upper Saddle River, New Jersey All rights reserved. John W. Creswell Educational Research: Planning,
More Analysis of Gene Expression Data Brent D. Foy, Ph.D. Wright State University.
DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.
Independent samples- Wilcoxon rank sum test. Example The main outcome measure in MS is the expanded disability status scale (EDSS) The main outcome measure.
Choosing and using statistics to test ecological hypotheses
Essential Statistics in Biology: Getting the Numbers Right
Copyright © 2012 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 17 Inferential Statistics.
Copyright © 2008 Wolters Kluwer Health | Lippincott Williams & Wilkins Chapter 22 Using Inferential Statistics to Test Hypotheses.
Chapter 14 Nonparametric Statistics. 2 Introduction: Distribution-Free Tests Distribution-free tests – statistical tests that don’t rely on assumptions.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Data Analysis (continued). Analyzing the Results of Research Investigations Two basic ways of describing the results Two basic ways of describing the.
Nonparametric Statistics aka, distribution-free statistics makes no assumption about the underlying distribution, other than that it is continuous the.
Microarray data analysis
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Inference and Inferential Statistics Methods of Educational Research EDU 660.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
Educational Research Chapter 13 Inferential Statistics Gay, Mills, and Airasian 10 th Edition.
Lesson 15 - R Chapter 15 Review. Objectives Summarize the chapter Define the vocabulary used Complete all objectives Successfully answer any of the review.
Experimental Design and Statistics. Scientific Method
Experimental Psychology PSY 433 Appendix B Statistics.
Experimental Research Methods in Language Learning Chapter 10 Inferential Statistics.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Analyzing Expression Data: Clustering and Stats Chapter 16.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Instructor: Dr. Amery Wu
HYPOTHESIS TESTING FOR DIFFERENCES BETWEEN MEANS AND BETWEEN PROPORTIONS.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Nonparametric statistics. Four levels of measurement Nominal Ordinal Interval Ratio  Nominal: the lowest level  Ordinal  Interval  Ratio: the highest.
Inferential Statistics Assoc. Prof. Dr. Şehnaz Şahinkarakaş.
Chapter 4 Selected Nonparemetric Techniques: PARAMETRIC VS. NONPARAMETRIC.
Nonparametric Statistics Overview. Objectives Understand Difference between Parametric and Nonparametric Statistical Procedures Nonparametric methods.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Microarray data analysis
Statistics for Psychology
Cluster Analysis in Bioinformatics
Dimension reduction : PCA and Clustering
Presentation transcript:

Microarray data analysis David A. McClellan, Ph.D. Introduction to Bioinformatics Brigham Young University Dept. Integrative Biology 25 January 2006

Inferential statistics Inferential statistics are used to make inferences about a population from a sample. Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as: “There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference. We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications, we set the significance level  to p < Page 199

Inferential statistics A t-test is a commonly used test statistic to assess the difference in mean values between two groups. t = = Questions Is the sample size (n) adequate? Are the data normally distributed? Is the variance of the data known? Is the variance the same in the two groups? Is it appropriate to set the significance level to p < 0.05? Page 199 x 1 – x 2  difference between mean values variability (noise)

Inferential statistics ParadigmParametric testNonparametric Compare two unpaired groupsUnpaired t-testMann-Whitney test Compare two paired groupsPaired t-testWilcoxon test Compare 3 orANOVA more groups Page

ANOVA ANalysis Of VAriance ANOVA calculates the probability that several conditions all come from the same distribution

Parametric vs. Nonparametric Parametric tests are applied to data sets that are sampled from a normal distribution (t- tests & ANOVAs) Nonparametric tests do not make assumptions about the population distribution – they rank the outcome variable from low to high and analyze the ranks

Mann-Whitney test (a two-sample rank test) Actual measurements are not employed; the ranks of the measurements are used instead n 1 and n 2 are the number of observations in samples 1 and 2, and R 1 is the sum of the ranks of the observations in sample 1

Mann-Whitney example

Mann-Whitney table

Wilcoxon paired-sample test A nonparametric analogue to the paired- sample t-test, just as the Mann-Whitney test is a nonparametric procedure analogous to the unpaired-sample t-test

Wilcoxon example

Wilcoxon table

Inferential statistics Is it appropriate to set the significance level to p < 0.05? If you hypothesize that a specific gene is up-regulated, you can set the probability value to You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the thousands of repeated measurements you are making, some researchers apply a Bonferroni correction. The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes: p < (0.05)/10,000 or p < 5 x Page 199

Page 200 Significance analysis of microarrays (SAM) SAM-- an Excel plug-in -- URL: www-stat.stanford.edu/~tibs/SAM -- modified t-test -- adjustable false discovery rate

Page 202

up- regulated Page 202 down- regulated expected observed

Descriptive statistics Microarray data are highly dimensional: there are many thousands of measurements made from a small number of samples. Descriptive (exploratory) statistics help you to find meaningful patterns in the data. A first step is to arrange the data in a matrix. Next, use a distance metric to define the relatedness of the different data points. Two commonly used distance metrics are: -- Euclidean distance -- Pearson coefficient of correlation 203

Euclidean Distance

Pearson Correlation Coefficient

Descriptive statistics: clustering Clustering algorithms offer useful visual descriptions of microarray data. Genes may be clustered, or samples, or both. We will next describe hierarchical clustering. This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first). In each case, we end up with a tree having branches and nodes. Page 204

Agglomerative clustering a b c d e a,b Page 206

a b c d e a,b d,e Agglomerative clustering Page 206

a b c d e a,b d,e c,d,e Agglomerative clustering Page 206

a b c d e a,b d,e c,d,e a,b,c,d,e Agglomerative clustering …tree is constructed Page 206

Divisive clustering a,b,c,d,e Page 206

Divisive clustering c,d,e a,b,c,d,e Page 206

Divisive clustering d,e c,d,e a,b,c,d,e Page 206

Divisive clustering a,b d,e c,d,e a,b,c,d,e Page 206

Divisive clustering a b c d e a,b d,e c,d,e a,b,c,d,e …tree is constructed Page 206

divisive agglomerative a b c d e a,b d,e c,d,e a,b,c,d,e Page 206

Page 207

Cluster and TreeView Page 208

Cluster and TreeView clustering PCASOMK means Page 208

Cluster and TreeView Page 208

Cluster and TreeView Page 208

Page 209 Two-way clustering of genes (y-axis) and cell lines (x-axis) (Alizadeh et al., 2000)

Self-Organizing Maps (SOM) To download GeneCluster:

Page 211 SOMs are unsupervised neural net algorithms that identify coregulated genes

Two pre-processing steps essential to apply SOMs 1. Variation Filtering: Data are passed through a variation filter to eliminate those genes showing no significant change in expression across the k samples. This step is needed to prevent nodes from being attracted to large sets of invariant genes. 2. Normalization: The expression level of each gene is normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.

An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D For a matrix of m genes x n samples, create a new covariance matrix of size n x n Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs). Principal components analysis (PCA) Page 211

Principal component axis #2 (10%) Principal component axis #1 (87%) PC#3: 1% C3 C4 C2 C1 N2 N3 N4 P1 P4 P2 P3 Lead (P) Sodium (N) Control (C) Legend Principal components analysis (PCA), an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines

Principal components analysis (PCA): objectives to reduce dimensionality to determine the linear combination of variables to choose the most useful variables (features) to visualize multidimensional data to identify groups of objects (e.g. genes/samples) to identify outliers Page 211

Page 212

Page 212

Page 212

Chr 21 Use of PCA to demonstrate increased levels of gene expression from Down syndrome (trisomy 21) brain