Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components.

Slides:



Advertisements
Similar presentations
BIOL 582 Lecture Set 22 One-Way MANOVA, Part II Post-hoc exercises Discriminant Analysis.
Advertisements

SPH 247 Statistical Analysis of Laboratory Data 1April 2, 2013SPH 247 Statistical Analysis of Laboratory Data.
SPH 247 Statistical Analysis of Laboratory Data. Supervised and Unsupervised Learning Logistic regression and Fisher’s LDA and QDA are examples of supervised.
Correlation Correlation is the relationship between two quantitative variables. Correlation coefficient (r) measures the strength of the linear relationship.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Independent Samples and Paired Samples t-tests PSY440 June 24, 2008.
Data mining and statistical learning, lecture 5 Outline  Summary of regressions on correlated inputs  Ridge regression  PCR (principal components regression)
Correlation. Two variables: Which test? X Y Contingency analysis t-test Logistic regression Correlation Regression.
SPH 247 Statistical Analysis of Laboratory Data. Cystic Fibrosis Data Set The 'cystfibr' data frame has 25 rows and 10 columns. It contains lung function.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
SPH 247 Statistical Analysis of Laboratory Data April 9, 2013SPH 247 Statistical Analysis of Laboratory Data1.
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
SPH 247 Statistical Analysis of Laboratory Data May 26, 2015SPH 247 Statistical Analysis of Laboratory Data1.
SPH 247 Statistical Analysis of Laboratory Data May 19, 2015SPH 247 Statistical Analysis of Laboratory Data1.
Classification (Supervised Clustering) Naomi Altman Nov '06.
230 Jeopardy Unit 4 Chi-Square Repeated- Measures ANOVA Factorial Design Factorial ANOVA Correlation $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500.
© Department of Statistics 2012 STATS 330 Lecture 25: Slide 1 Stats 330: Lecture 25.
Much of the meaning of terms depends on context. 1.
FOR 373: Forest Sampling Methods Simple Random Sampling What is it? How to do it? Why do we use it? Determining Sample Size Readings: Elzinga Chapter 7.
Alexander Statnikov Discovery Systems Laboratory Department of Biomedical Informatics Vanderbilt University 10/3/
Repeated Measures  The term repeated measures refers to data sets with multiple measurements of a response variable on the same experimental unit or subject.
LINEAR CLASSIFICATION METHODS STAT 597 E Fengjuan Xuan Caimiao Wei Bogdan Ilie.
Exercise 1 You have a clinical study in which 10 patients will either get the standard treatment or a new treatment Randomize which 5 of the 10 get the.
Evaluating Results of Learning Blaž Zupan
Linear Discriminant Analysis and Its Variations Abu Minhajuddin CSE 8331 Department of Statistical Science Southern Methodist University April 27, 2002.
Multivariate Data Analysis Chapter 1 - Introduction.
Diagnostics – Part II Using statistical tests to check to see if the assumptions we made about the model are realistic.
Linear Discriminant Analysis (LDA). Goal To classify observations into 2 or more groups based on k discriminant functions (Dependent variable Y is categorical.
Exercise In the bcmort data set, the four-level factor cohort can be considered the product of two two-level factors, say “period” ( or )
Correlation and Regression: The Need to Knows Correlation is a statistical technique: tells you if scores on variable X are related to scores on variable.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Linear Prediction Correlation can be used to make predictions – Values on X can be used to predict values on Y – Stronger relationships between X and Y.
Subjects Review Introduction to Statistical Learning Midterm: Thursday, October 15th :00-16:00 ADV2.
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
Logistic Regression Saed Sayad 1www.ismartsoft.com.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
SPH 247 Statistical Analysis of Laboratory Data. Binary Classification Suppose we have two groups for which each case is a member of one or the other,
Blackbox classifiers for preoperative discrimination between malignant and benign ovarian tumors C. Lu 1, T. Van Gestel 1, J. A. K. Suykens 1, S. Van Huffel.
Exercise 1 You have a clinical study in which 10 patients will either get the standard treatment or a new treatment Randomize which 5 of the 10 get the.
Exercise 1 You have a clinical study in which 10 patients will either get the standard treatment or a new treatment Randomize which 5 of the 10 get the.
 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?
D/RS 1013 Discriminant Analysis. Discriminant Analysis Overview n multivariate extension of the one-way ANOVA n looks at differences between 2 or more.
Revision Questions Experimentation. 2 Explain Independent variable The variable that is changed by the person doing the experiment. Remember: If I am.
1 Statistics & R, TiP, 2011/12 Multivariate Methods  Multivariate data  Data display  Principal component analysis Unsupervised learning technique 
Week 7: General linear models Overview Questions from last week What are general linear models? Discussion of the 3 articles.
 Naïve Bayes  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing.
Covariance/ Correlation
BINARY LOGISTIC REGRESSION
Evaluating Results of Learning
Statistical Techniques
Discriminant Analysis
X AND R CHART EXAMPLE IN-CLASS EXERCISE
12 Inferential Analysis.
Covariance/ Correlation
Multivariate statistics
Example of PCR, interpretation of calibration equations
Introduction PCA (Principal Component Analysis) Characteristics:
12 Inferential Analysis.
2/28/2019 Exercise 1 In the bcmort data set, the four-level factor cohort can be considered the product of two two-level factors, say “period” (
Covariance/ Correlation
Inferential Statistics
Ch 4.1 & 4.2 Two dimensions concept
Statistical Quality Control, 7th Edition by Douglas C. Montgomery.
Machine Learning – a Probabilistic Perspective
8/22/2019 Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal.
Exercise 1 The standard deviation of measurements at low level for a method for detecting benzene in blood is 52 ng/L. What is the Critical Level if we.
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Exercise 1 In the ISwR data set alkfos, do a PCA of the placebo and Tamoxifen groups separately, then together. Plot the first two principal components of the whole group with color coding for the treatment and control subjects. For this and other parts to this assignment, omit the patients with missing data. Conduct a linear discriminant analysis of the two groups using the 7 variables. How well can you predict the treatment? Is this the usual kind of analysis you would see? Use logistic regression to predict the group based on the measurements. Compare the in-sample error rates. Use cross-validation with repeated training subsamples of 30/35 and test sets of size 5/35. What can you now conclude about the two methods? May 28, 2013SPH 247 Statistical Analysis of Laboratory Data1

Exercise 2 In the ISwR data set alkfos, cluster the data based on the 7 measurements using hclust(), kmeans(), and Mclust(). Compare the 2-group clustering with the placebo/Tamoxifen classification. May 28, 2013SPH 247 Statistical Analysis of Laboratory Data2

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data3 > alkfos2 <- na.omit(alkfos)# omits missing values > pc1 <- prcomp(alkfos2[alkfos2[,1]==1,2:8],scale=T) > pc2 <- prcomp(alkfos2[alkfos2[,1]==2,2:8],scale=T) > pc.all <- prcomp(alkfos2[,2:8],scale=T) Standard deviations: [1] Rotation: PC1 PC2 PC3 PC4 PC5 PC6 PC7 c c c c c c c > plot(pc.all) > plot(pc.all$x,col=alkfos2[,1])

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data4

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data5

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data6 > library(MASS) > alkfos.lda <- lda(alkfos2[,2:8],grouping=alkfos2[,1]) > alkfos.lda Call: lda(alkfos2[, 2:8], grouping = alkfos2[, 1]) Prior probabilities of groups: Group means: c0 c3 c6 c9 c12 c18 c Coefficients of linear discriminants: LD1 c c c c c c c

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data7 > plot(alkfos.lda) > alkfos.pred <- predict(alkfos.lda) > table(alkfos2$grp,alkfos.pred$class) in 35 correct.

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data8

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data9 > alkfos.glm <- glm(as.factor(grp) ~ 1,data=alkfos2,family=binomial) > step(alkfos.glm,scope=formula(~ c0+c3+c6+c9+c12+c18+c24),steps=2) Start: AIC=49.11 as.factor(grp) ~ 1 Df Deviance AIC + c c c c c c c Step: AIC=42.47 as.factor(grp) ~ c6

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data10 > alkfos.glm <- glm(as.factor(grp) ~ 1,data=alkfos2,family=binomial) > step(alkfos.glm,scope=formula(~ c0+c3+c6+c9+c12+c18+c24),steps=2) Step: AIC=42.47 as.factor(grp) ~ c6 Df Deviance AIC + c c c c c c c Step: AIC=30.28 as.factor(grp) ~ c6 + c0 We used step limited to two steps to avoid a model with undetermined coefficients. Once the predictions are perfect (with three or more variables in this case), nothing can be distinguished.

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data11 alkfos.lda.cv <- function(ncv,ntrials) { require(MASS) data(alkfos) alkfos2 <- na.omit(alkfos) n1 <- dim(alkfos2)[1] nwrong <- 0 npred <- 0 for (i in 1:ntrials) { test <- sample(n1,ncv) test.set <- data.frame(alkfos2[test,2:8]) train.set <- data.frame(alkfos2[-test,2:8]) lda.ap <- lda(train.set,alkfos2[-test,1]) lda.pred <- predict(lda.ap,test.set) nwrong <- nwrong + sum(lda.pred$class != alkfos2[test,1]) npred <- npred + ncv } print(paste("total number classified = ",npred,sep="")) print(paste("total number wrong = ",nwrong,sep="")) print(paste("percent wrong = ",100*nwrong/npred,"%",sep="")) }

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data12 alkfos.glm.cv <- function(ncv,ntrials) { require(MASS) data(alkfos) alkfos2 <- na.omit(alkfos) alkfos2$grp <- as.factor(alkfos2$grp) n1 <- dim(alkfos2)[1] nwrong <- 0 npred <- 0 for (i in 1:ntrials) { test <- sample(n1,ncv) test.set <- alkfos2[test,] train.set <- alkfos2[-test,] glm.ap <- glm(grp ~ 1,data=train.set,family=binomial) glmstep.ap <- step(glm.ap,scope=formula(~ c0+c3+c6+c9+c12+c18+c24),steps=2,trace=0) glm.pred <- predict(glmstep.ap,newdata=test.set,type="response") grp.pred 0.5)+1 nwrong <- nwrong + sum(grp.pred != test.set$grp) npred <- npred + ncv } print(paste("total number classified = ",npred,sep="")) print(paste("total number wrong = ",nwrong,sep="")) print(paste("percent wrong = ",100*nwrong/npred,"%",sep="")) }

Results of Cross Validation LDA has 1 error in 35 in sample (2.9%) Cross-Validated seven-fold this is 720/10000 = 7.2% Stepwise logistic regression with two variables has 3 errors in 35 in sample (8.6%) Cross-Validated seven-fold this is 1830/10000 = 18.3% May 28, 2013SPH 247 Statistical Analysis of Laboratory Data13

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data14 > ap.hc <- hclust(dist(alkfos2[,2:8])) > plot(ap.hc) > cutree(ap.hc, 2) > table(cutree(ap.hc, 2),alkfos2$grp) > table(kmeans(alkfos2[,2:8],2)$cluster,alkfos2$grp) > library(mclust) > Mclust(alkfos2[,2:8]) 'Mclust' model object: best model: ellipsoidal, equal shape (VEV) with 6 components > table(Mclust(alkfos2[,2:8])$class,alkfos2$grp) > table(Mclust(alkfos2[,2:8],G=2)$class,alkfos2$grp)

May 28, 2013SPH 247 Statistical Analysis of Laboratory Data15