SPH 247 Statistical Analysis of Laboratory Data. Cystic Fibrosis Data Set The 'cystfibr' data frame has 25 rows and 10 columns. It contains lung function.

Slides:



Advertisements
Similar presentations
Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Advertisements

BIOL 582 Lecture Set 22 One-Way MANOVA, Part II Post-hoc exercises Discriminant Analysis.
Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components.
Logistic Regression Psy 524 Ainsworth.
SPH 247 Statistical Analysis of Laboratory Data 1April 2, 2013SPH 247 Statistical Analysis of Laboratory Data.
Logistic Regression.
Lecture 7: Principal component analysis (PCA)
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.
1 Logistic Regression EPP 245 Statistical Analysis of Laboratory Data.
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
SPH 247 Statistical Analysis of Laboratory Data 1April 23, 2010SPH 247 Statistical Analysis of Laboratory Data.
Discrim Continued Psy 524 Andrew Ainsworth. Types of Discriminant Function Analysis They are the same as the types of multiple regression Direct Discrim.
1 Multiple Regression EPP 245/298 Statistical Analysis of Laboratory Data.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Inference.ppt - © Aki Taanila1 Sampling Probability sample Non probability sample Statistical inference Sampling error.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Topic 3: Regression.
1 Multi-Class Cancer Classification Noam Lerner. 2 Our world – very generally Genes. Gene samples. Our goal: classifying the samples. Example: We want.
1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data.
1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Linear Regression and Correlation Explanatory and Response Variables are Numeric Relationship between the mean of the response variable and the level of.
Foundation of High-Dimensional Data Visualization
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
SPH 247 Statistical Analysis of Laboratory Data April 9, 2013SPH 247 Statistical Analysis of Laboratory Data1.
Practical statistics for Neuroscience miniprojects Steven Kiddle Slides & data :
SPH 247 Statistical Analysis of Laboratory Data May 26, 2015SPH 247 Statistical Analysis of Laboratory Data1.
Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
1 The Receiver Operating Characteristic (ROC) Curve EPP 245 Statistical Analysis of Laboratory Data.
PCA Example Air pollution in 41 cities in the USA.
Classification (Supervised Clustering) Naomi Altman Nov '06.
Multivariate Analysis of Variance (MANOVA). Outline Purpose and logic : page 3 Purpose and logic : page 3 Hypothesis testing : page 6 Hypothesis testing.
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
Xuhua Xia Slide 1 MANOVA All statistical methods we have learned so far have only one continuous DV and one or more IVs which may be continuous or categorical.
Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.
PCB 3043L - General Ecology Data Analysis. OUTLINE Organizing an ecological study Basic sampling terminology Statistical analysis of data –Why use statistics?
Chapter 14 Repeated Measures and Two Factor Analysis of Variance
Copyright ©2011 Brooks/Cole, Cengage Learning Inference about Simple Regression Chapter 14 1.
More About Clustering Naomi Altman Nov '06. Assessing Clusters Some things we might like to do: 1.Understand the within cluster similarity and between.
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Chapter 13 Repeated-Measures and Two-Factor Analysis of Variance
Analyzing Expression Data: Clustering and Stats Chapter 16.
PCB 3043L - General Ecology Data Analysis.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Nonparametric Statistics
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L9.1 Lecture 9: Discriminant function analysis (DFA) l Rationale.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Stats Methods at IC Lecture 3: Regression.
Nonparametric Statistics
Factor and Principle Component Analysis
Exploring Microarray data
Discriminant Analysis
Nonparametric Statistics
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
15.1 The Role of Statistics in the Research Process
Parametric Methods Berlin Chen, 2005 References:
EPP 245 Statistical Analysis of Laboratory Data
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Unsupervised Learning
Presentation transcript:

SPH 247 Statistical Analysis of Laboratory Data

Cystic Fibrosis Data Set The 'cystfibr' data frame has 25 rows and 10 columns. It contains lung function data for cystic fibrosis patients (7-23 years old) We will examine the relationships among the various measures of lung function April 30, 2010SPH 247 Statistical Analysis of Laboratory Data2

age: a numeric vector. Age in years. sex: a numeric vector code. 0: male, 1:female. height: a numeric vector. Height (cm). weight: a numeric vector. Weight (kg). bmp: a numeric vector. Body mass (% of normal). fev1: a numeric vector. Forced expiratory volume. rv: a numeric vector. Residual volume. frc: a numeric vector. Functional residual capacity. tlc: a numeric vector. Total lung capacity. pemax: a numeric vector. Maximum expiratory pressure. April 30, 2010SPH 247 Statistical Analysis of Laboratory Data3

Scatterplot matrices We have five variables and may wish to study the relationships among them We could separately plot the (5)(4)/2 = 10 pairwise scatterplots In R we can use the pairs() function, or the splom() function in the lattice package. In Stata, we can use graph matrix April 30, 2010SPH 247 Statistical Analysis of Laboratory Data4

Scatterplot matrices > pairs(lungcap) > library(lattice) > splom(lungcap).graph matrix fev1 rv frc tlc pemax April 30, 2010SPH 247 Statistical Analysis of Laboratory Data5

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data6

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data7

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data8

Principal Components Analysis The idea of PCA is to create new variables that are combinations of the original ones. If x 1, x 2, …, x p are the original variables, then a component is a 1 x 1 + a 2 x 2 +…+ a p x p We pick the first PC as the linear combination that has the largest variance The second PC is that linear combination orthogonal to the first one that has the largest variance, and so on HSAUR2 Chapter 16 April 30, 2010SPH 247 Statistical Analysis of Laboratory Data9

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data10

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data11 > lungcap.pca <- prcomp(lungcap,scale=T) > plot(lungcap.pca) > names(lungcap.pca) [1] "sdev" "rotation" "center" "scale" "x" > lungcap.pca$sdev [1] > lungcap.pca$center fev1 rv frc tlc pemax > lungcap.pca$scale fev1 rv frc tlc pemax > plot(lungcap.pca$x[,1:2]) Always use scaling before PCA unless all variables are on the Same scale. This is equivalent to PCA on the correlation matrix instead of the covariance matrix

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data12 Scree Plot

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data13

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data14. pca fev1 rv frc tlc pemax Principal components/correlation Number of obs = 25 Number of comp. = 5 Trace = 5 Rotation: (unrotated = principal) Rho = Component | Eigenvalue Difference Proportion Cumulative Comp1 | Comp2 | Comp3 | Comp4 | Comp5 | Principal components (eigenvectors) Variable | Comp1 Comp2 Comp3 Comp4 Comp5 | Unexplained fev1 | | 0 rv | | 0 frc | | 0 tlc | | 0 pemax | |

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data15

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data16

Fisher’s Iris Data This famous (Fisher's or Anderson's) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are _Iris setosa_, _versicolor_, and _virginica_. April 30, 2010SPH 247 Statistical Analysis of Laboratory Data17

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data18 > data(iris) > help(iris) > names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species" > attach(iris) > iris.dat <- iris[,1:4] > splom(iris.dat) > splom(iris.dat,groups=Species) > splom(~ iris.dat | Species) > summary(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100 setosa :50 1st Qu.: st Qu.: st Qu.: st Qu.:0.300 versicolor:50 Median :5.800 Median :3.000 Median :4.350 Median :1.300 virginica :50 Mean :5.843 Mean :3.057 Mean :3.758 Mean : rd Qu.: rd Qu.: rd Qu.: rd Qu.:1.800 Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data19

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data20

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data21

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data22 > data(iris) > iris.pc <- prcomp(iris[,1:4],scale=T) > plot(iris.pc$x[,1:2],col=rep(1:3,each=50)) > names(iris.pc) [1] "sdev" "rotation" "center" "scale" "x" > plot(iris.pc) > iris.pc$sdev [1] > iris.pc$rotation PC1 PC2 PC3 PC4 Sepal.Length Sepal.Width Petal.Length Petal.Width

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data23

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data24

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data25. summarize Variable | Obs Mean Std. Dev. Min Max index | sepallength | sepalwidth | petallength | petalwidth | species | 0. graph matrix sepallength sepalwidth petallength petalwidth

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data26

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data27. pca sepallength sepalwidth petallength petalwidth Principal components/correlation Number of obs = 150 Number of comp. = 4 Trace = 4 Rotation: (unrotated = principal) Rho = Component | Eigenvalue Difference Proportion Cumulative Comp1 | Comp2 | Comp3 | Comp4 | Principal components (eigenvectors) Variable | Comp1 Comp2 Comp3 Comp4 | Unexplained sepallength | | 0 sepalwidth | | 0 petallength | | 0 petalwidth | |

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data28

Discriminant Analysis An alternative to logistic regression for classification is discrimininant analysis This comes in two flavors, (Fisher’s) Linear Discriminant Analysis or LDA and (Fisher’s) Quadratic Discriminant Analysis or QDA In each case we model the shape of the groups and provide a dividing line/curve April 30, 2010SPH 247 Statistical Analysis of Laboratory Data29

One way to describe the way LDA and QDA work is to think of the data as having for each group an elliptical distribution We allocate new cases to the group for which they have the highest likelihoods This provides a linear cut-point if the ellipses are assumed to have the same shape and a quadratic one if they may be different April 30, 2010SPH 247 Statistical Analysis of Laboratory Data30

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data31 > library(MASS) > iris.lda <- lda(iris[,1:4],iris[,5]) > iris.lda Call: lda(iris[, 1:4], iris[, 5]) Prior probabilities of groups: setosa versicolor virginica Group means: Sepal.Length Sepal.Width Petal.Length Petal.Width setosa versicolor virginica Coefficients of linear discriminants: LD1 LD2 Sepal.Length Sepal.Width Petal.Length Petal.Width Proportion of trace: LD1 LD

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data32 > plot(iris.lda,col=rep(1:3,each=50)) > iris.pred <- predict(iris.lda) > names(iris.pred) [1] "class" "posterior" "x" > iris.pred$class[71:80] [1] virginica versicolor versicolor versicolor versicolor versicolor versicolor [8] versicolor versicolor versicolor Levels: setosa versicolor virginica > iris.pred$posterior[71:80,] setosa versicolor virginica e e e e e e e e e e e e e e e e e e e e-08 > sum(iris.pred$class != iris$Species) [1] 3

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data33

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data34 iris.cv <- function(ncv,ntrials) { nwrong <- 0 n <- 0 for (i in 1:ntrials) { test <- sample(150,ncv) test.ir <- data.frame(iris[test,1:4]) train.ir <- data.frame(iris[-test,1:4]) lda.ir <- lda(train.ir,iris[-test,5]) lda.pred <- predict(lda.ir,test.ir) nwrong <- nwrong + sum(lda.pred$class != iris[test,5]) n <- n + ncv } print(paste("total number classified = ",n,sep="")) print(paste("total number wrong = ",nwrong,sep="")) print(paste("percent wrong = ",100*nwrong/n,"%",sep="")) } > iris.cv(10,1000) [1] "total number classified = 10000" [1] "total number wrong = 213" [1] "percent wrong = 2.13%"

Lymphoma Data Set Alizadeh et al. Nature (2000) “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling” We will analyze a subset of the data consisting of 61 arrays on patients with 45 Diffuse large B-cell lymphoma (DLBCL/DL) 10 Chronic lymphocytic leukaemia (CLL/CL) 6 Follicular leukaemia (FL) April 30, 2010SPH 247 Statistical Analysis of Laboratory Data35

Data Available The original Nature paper The expression data in the form of unbackground corrected log ratios. A common reference was always on Cy3 with the sample on Cy5 (array.data.txt and array.data.zip) by 61 A file with array codes and disease status for each of the 61 arrays, ArrayID.txt April 30, 2010SPH 247 Statistical Analysis of Laboratory Data36

Identify Differentially Expressed Genes We will assume that the log ratios are on a reasonable enough scale that we can use them as is For each gene, we can run a one-way ANOVA and find the p-value, obtaining 9,216 of them. Adjust p-values with p.adjust Identify genes with small adjusted p-values April 30, 2010SPH 247 Statistical Analysis of Laboratory Data37

Develop Classifier Reduce dimension with ANOVA gene selection or with PCA Use logistic regression or LDA Evaluate the four possibilities and their sub- possibilities with cross validation. With 61 arrays one could reasonable omit 10% or 6 at random April 30, 2010SPH 247 Statistical Analysis of Laboratory Data38

Differential Expression We can locate genes that are differentially expressed; that is, genes whose expression differs systematically by the type of lymphoma. To do this, we could use lymphoma type to predict expression, and see when this is statistically significant. For one gene at a time, this is ANOVA. April 30, 2010SPH 247 Statistical Analysis of Laboratory Data39

It is almost equivalent to locate genes whose expression can be used to predict lymphoma type, this being the reverse process. If there is significant association in one direction there should logically be significant association in the other This will not be true exactly, but is true approximately We can also easily do the latter analysis using the expression of more than one gene using logistic regression, LDA, and QDA April 30, 2010SPH 247 Statistical Analysis of Laboratory Data40

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data41

Significant Genes There are 3845 out of 9261 genes that have significant p-values from the ANOVA of less than 0.05, compared to 463 expected by chance There are 2733 genes with FDR adjusted p-values less than 0.05 There are only 184 genes with Bonferroni adjusted p- values less than 0.05 April 30, 2010SPH 247 Statistical Analysis of Laboratory Data42

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data43

Logistic Regression We will use logistic regression to distinguish DLBCL from CLL and DLBCL from FL We will do this first by choosing the variables with the smallest overall p-values in the ANOVA We will then evaluate the results within sample and by cross validation April 30, 2010SPH 247 Statistical Analysis of Laboratory Data44

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data45

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data46 Number of Variables DL/CL Errors DL/FL Errors Within Sample Errors

Evaluation of performance Within sample evaluation of performance like this is unreliable This is especially true if we are selecting predictors from a very large set One useful yardstick is the performance of random classifiers April 30, 2010SPH 247 Statistical Analysis of Laboratory Data47

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data48 Number of Variables Percent Errors 124.0% 224.7% 323.3% 424.7% 528.7% 624.7% 726.3% 823.6% 925.7% % Left is CV performance of best k variables Random = 25.4%

Conclusion Logistic regression on the variables with the smallest p-values does not work very well This cannot be identified by looking at the within sample statistics Cross validation is a requirement to assess performance of classifiers April 30, 2010SPH 247 Statistical Analysis of Laboratory Data49

April 30, 2010SPH 247 Statistical Analysis of Laboratory Data50 alkfos {ISwR}R Documentation Alkaline phosphatase data Repeated measurements of alkaline phosphatase in a randomized trial of Tamoxifen treatment of breast cancer patients. Format A data frame with 43 observations on the following 8 variables. grp a numeric vector, group code (1=placebo, 2=Tamoxifen). c0 a numeric vector, concentration at baseline. c3 a numeric vector, concentration after 3 months. c6 a numeric vector, concentration after 6 months. c9 a numeric vector, concentration after 9 months. c12 a numeric vector, concentration after 12 months. c18 a numeric vector, concentration after 18 months. c24 a numeric vector, concentration after 24 months.

Exercises In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components of the whole group with color coding for the treatment and control subjects. Conduct a linear discriminant analysis of the two groups using the 7 variables. How well can you predict the treatment? Is this the usual kind of analysis you would see? April 30, 2010SPH 247 Statistical Analysis of Laboratory Data51