LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015.

Slides:



Advertisements
Similar presentations
Associate Collaborator for LISA Department of Statistics, VT
Advertisements

Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
An Introduction to Multivariate Analysis
Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.
Chapter Nineteen Factor Analysis.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Principal Components Analysis Babak Rasolzadeh Tuesday, 5th December 2006.
An introduction to Principal Component Analysis (PCA)
Psychology 202b Advanced Psychological Statistics, II April 7, 2011.
Principal Component Analysis
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Principal component analysis (PCA)
Dimensional reduction, PCA
Data mining and statistical learning, lecture 4 Outline Regression on a large number of correlated inputs  A few comments about shrinkage methods, such.
CHAPTER 19 Correspondence Analysis From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon.
Exploring Microarray data Javier Cabrera. Outline 1.Exploratory Analysis Steps. 2.Microarray Data as Multivariate Data. 3.Dimension Reduction 4.Correlation.
Principal component analysis (PCA) Purpose of PCA Covariance and correlation matrices PCA using eigenvalues PCA using singular value decompositions Selection.
T-T ESTS AND A NALYSIS OF V ARIANCE Jennifer Kensler.
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Separate multivariate observations
LISA Short Course Series R Statistical Analysis Ning Wang Summer 2013 LISA: R Statistical AnalysisSummer 2013.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Summarized by Soo-Jin Kim
Chapter 2 Dimensionality Reduction. Linear Methods
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Chapter 3 Data Exploration and Dimension Reduction 1.
Shuyu Chu Department of Statistics February 17, 2014 Lisa Short Course Series R Statistical Analysis Laboratory for Interdisciplinary Statistical Analysis.
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
Factor Analysis Psy 524 Ainsworth. Assumptions Assumes reliable correlations Highly affected by missing data, outlying cases and truncated data Data screening.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Hong Tran, April 21, 2015.
Principal Component Analysis (PCA). Data Reduction summarization of data with many (p) variables by a smaller set of (k) derived (synthetic, composite)
Chapter 7 Multivariate techniques with text Parallel embedded system design lab 이청용.
Principal Components Analysis. Principal Components Analysis (PCA) A multivariate technique with the central aim of reducing the dimensionality of a multivariate.
Lecture 12 Factor Analysis.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Education 795 Class Notes Factor Analysis Note set 6.
Principle Component Analysis and its use in MA clustering Lecture 12.
T-T ESTS AND A NALYSIS OF V ARIANCE Jennifer Kensler July 13, 2010 Fralin Auditorium, Virginia Tech This presentation is annotated. Please click on the.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Principal Component Analysis (PCA)
Principal Component Analysis Zelin Jia Shengbin Lin 10/20/2015.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Principal Component Analysis
Principal Components Analysis ( PCA)
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Unsupervised Learning II Feature Extraction
Principal Component Analysis
Exploring Microarray data
Principal Component Analysis
Principal Component Analysis (PCA)
Principal Component Analysis
Principal Component Analysis (PCA)
Descriptive Statistics vs. Factor Analysis
Principal Components Analysis
Principal Component Analysis (PCA)
Multivariate Statistical Methods
Chapter_19 Factor Analysis
Principal Component Analysis
Lecture 8: Factor analysis (FA)
Unsupervised Learning
Presentation transcript:

LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015

Laboratory for Interdisciplinary Statistical Analysis Collaboration: Visit our website to request personalized statistical advice and assistance with: Designing Experiments Analyzing Data Interpreting Results Grant Proposals Software (R, SAS, JMP, Minitab...) LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. All services are FREE for VT researchers. We assist with research—not class projects or homework. LISA helps VT researchers benefit from the use of Statistics LISA also offers: Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions <30 mins. See our website for additional times and locations.

1. What is multivariate analysis? 2. Summarizing and plotting multivariate data in R 3. Dimension reduction vs. clustering 4. Principal component analysis (PCA) (in R) 5. Factor analysis (in R) 6. Relationship between PCA and factor analysis Outline LISA: Multivariate Analysis in RMar. 3, 2015

Data: Fisher’s Iris Data LISA: Multivariate Analysis in RMar. 3, 2015 Sepal lengthSepal widthPetal lengthPetal widthSpecies Iris setosa Iris setosa …...……… Iris virginica 50 samples from each of three species of Iris ( Iris setosa, Iris virginica and Iris versicolor). 4 features for each sample: the length of the sepal, the length of the petal, the width of the sepal, the width of the petal in centimeters.

Univariate analysis is used when one variable is measured for each observation. – Possible approaches: histogram; bar chart; descriptive statistics Multivariate analysis is used when more than one outcome variables are measured for each observation. E.g., the Iris data. – Possible approaches: principal component analysis, factor analysis, classification, clustering What is Multivariate Analysis? LISA: Multivariate Analysis in RMar. 3, 2015

To get some idea of the data, we start with calculating summary statistics such as the mean and standard deviation for each variable. R function sapply() can be used to apply some function to each column in a data frame, eg. sapply(mydataframe,sd) A good reference for apply functions – ction_r.htm#sapply Summarizing Multivariate Data in R LISA: Multivariate Analysis in RMar. 3, 2015

Since multiple variables are measured simultaneously, we expect some extent of correlation among the variables. Scatterplot matrix is an ideal option to visualize their relationship. Install the “car” package in R, and then use R function scatterplotMatrix(). Pairwise pearson correlation coefficient could be calculated using R function cor() on the data frame. Plotting Multivariate Data in R LISA: Multivariate Analysis in RMar. 3, 2015

Dimension Reduction: – to transform a larger number of variables into a much smaller set of variables – manipulation on variables (columns) Clustering: – to place observations into groups – manipulation on observations (rows) Dimension Reduction vs. Clustering LISA: Multivariate Analysis in RMar. 3, 2015

PCA is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components. The principal components retain as much information from the original variables as possible. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015 Scatterplot in the original axes x: Length y: Width x and y are highly correlated rotate the data, spatial relationship does not change Scatterplot in the new axis 1 st Axis: size 2 nd Axis: shape 1 st Axis and 2 nd Axis are uncorrelated largest variation on the 1 st Axis, and the second largest variation on the 2 nd Axis

PCA produces linear combinations of the original variables to generate the new variables (axes), known as principal components (PCs) The variations on the PCs are in a descending order, i.e., the first PC accounts for the greatest possible variance, the second PC accounts for the second largest variance, etc. The PCs are uncorrelated with (perpendicular to) each other. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

Main idea: – The first PC Y 1 = a 11 X 1 + a 12 X a 1p X p with the constraint: a a a 1p 2 =1 – The second PC Y 2 = a 21 X 1 + a 22 X a 2p X p with similar constraint. – Continue until p PCs are calculated such that the sum of the variances of all the PCs is equal to that of all the original variables. In summary, we need to find the matrix A, where a ij is the ith row and jth column element. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

How to get A: – The rows of matrix A are the eigenvectors of matrix S x, the variance-covariance matrix of the original data. – The elements of an eigenvector are the weights a ij, known as loadings. – The elements in the diagonal of matrix Sy, the variance- covariance matrix of the principal components, are the corresponding eigenvalues. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

Other related terms: – Score: the positions of each observation in the new coordinate system of PCs. For instance, the score for the r th sample on the k th PC is Y kr = a k1 x 1r + a k2 x 2r a kp x pr – Scree plot: a graphical display of the variance of each PC to determine how many PCs should be selected in order to retain a high percentage of the variation in the data. The plot shows the variance for the first component and then for the subsequent components, it shows the additional variance that each component is adding. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

How to determine how many PCs should be retained: – Criteria 1: To include all those PCs up to a predetermined total percent variance explained, such as 80% or 90% – Criteria 2: To ignore PCs at the point where the next PC offer little increase in the total variance explained. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

A rule of thumb when we do PCA: – If you want to compare different variables that have different units or with very different variances, it is a good idea to first standardize the variables so that they all have mean 0 and variance 1. – This will allow us to find the PCs that provide the best low- dimensional representation of the variation in the original data, without being overly raised by those variables that show the most variance in the original data. – May standardize variables in R using the function scale(). Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015

1. Summarizing and plotting the data. 2. Decide how many PCs to keep. 3. Extract the PCs, i.e., find the loadings matrix A. 4. Rotate the PCs. 5. Interpret the results. 6. Computer PC scores. Refer to the R codes for details. Principal Component Analysis (PCA) Steps (in R) LISA: Multivariate Analysis in RMar. 3, 2015

Factor Analysis is to uncover the latent structure in a given set of variables. It looks for a smaller set of latent variables (called factors) that can explain the relationships among the observed variables. Correlated factors are common, but not required, in the factor analysis model. Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015

The model can be written as X i = b 1 F 1 + b 2 F b p F p + e i X i is the i th observed variable (i=1,…,k), F j are the factors (j=1,...,p), and p<k. E i is the portion of variable x i unique to that variable. Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015

There are many methods of extracting common factors, including maximum likelihood (ml), iterated principal axis (pa), weighted least square (wls), generalized weighted least squares (gls), and minimum residual (minres). We may identify which method to use in R code. Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015

1. Summarizing and plotting the data. 2. Decide how many factors to keep. 3. Extract the factors, i.e., find the loadings matrix A. 4. Rotate the factors. 5. Interpret the results. 6. Computer factor scores if needed. Refer to the R codes for details. Factor Analysis Steps (in R) LISA: Multivariate Analysis in RMar. 3, 2015

Relationship between PCA and Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015 x1 x2 x3 x4 x5 PC1 PC2 Figure A: Principal Component Analysis Model F1 e1 F2 X1 X2 X3 X4 X5 e2 e3 e4 e5 Figure B: Factor Analysis Model Source: R in Action, Data Analysis and Graphics with R, Robert I. Kabacoff

R in Action: Data Analysis and Graphics with R, Robert I. Kabacoff aTutorial.pdf aTutorial.pdf References LISA: Multivariate Analysis in RMar. 3, 2015

Please don’t forget to fill the sign in sheet and to complete the survey that will be sent to you by . Thank you! LISA: Multivariate Analysis in RMar. 3, 2015