Download presentation
Presentation is loading. Please wait.
Published byAllan Fitzgerald Modified over 9 years ago
1
LISA Short Course Series Multivariate Analysis in R Liang (Sally) Shan March 3, 2015 LISA: Multivariate Analysis in RMar. 3, 2015
2
Laboratory for Interdisciplinary Statistical Analysis Collaboration: Visit our website to request personalized statistical advice and assistance with: Designing Experiments Analyzing Data Interpreting Results Grant Proposals Software (R, SAS, JMP, Minitab...) LISA statistical collaborators aim to explain concepts in ways useful for your research. Great advice right now: Meet with LISA before collecting your data. All services are FREE for VT researchers. We assist with research—not class projects or homework. LISA helps VT researchers benefit from the use of Statistics www.lisa.stat.vt.edu LISA also offers: Educational Short Courses: Designed to help graduate students apply statistics in their research Walk-In Consulting: Available Monday-Friday from 1-3 PM in the Old Security Building (OSB) for questions <30 mins. See our website for additional times and locations.
3
1. What is multivariate analysis? 2. Summarizing and plotting multivariate data in R 3. Dimension reduction vs. clustering 4. Principal component analysis (PCA) (in R) 5. Factor analysis (in R) 6. Relationship between PCA and factor analysis Outline LISA: Multivariate Analysis in RMar. 3, 2015
4
Data: Fisher’s Iris Data LISA: Multivariate Analysis in RMar. 3, 2015 Sepal lengthSepal widthPetal lengthPetal widthSpecies 5.13.51.40.2Iris setosa 4.93.01.40.2Iris setosa …...……… 5.93.05.11.8Iris virginica 50 samples from each of three species of Iris ( Iris setosa, Iris virginica and Iris versicolor). 4 features for each sample: the length of the sepal, the length of the petal, the width of the sepal, the width of the petal in centimeters.
5
Univariate analysis is used when one variable is measured for each observation. – Possible approaches: histogram; bar chart; descriptive statistics Multivariate analysis is used when more than one outcome variables are measured for each observation. E.g., the Iris data. – Possible approaches: principal component analysis, factor analysis, classification, clustering What is Multivariate Analysis? LISA: Multivariate Analysis in RMar. 3, 2015
6
To get some idea of the data, we start with calculating summary statistics such as the mean and standard deviation for each variable. R function sapply() can be used to apply some function to each column in a data frame, eg. sapply(mydataframe,sd) A good reference for apply functions – http://www.ats.ucla.edu/stat/r/library/advanced_fun ction_r.htm#sapply Summarizing Multivariate Data in R LISA: Multivariate Analysis in RMar. 3, 2015
7
Since multiple variables are measured simultaneously, we expect some extent of correlation among the variables. Scatterplot matrix is an ideal option to visualize their relationship. Install the “car” package in R, and then use R function scatterplotMatrix(). Pairwise pearson correlation coefficient could be calculated using R function cor() on the data frame. Plotting Multivariate Data in R LISA: Multivariate Analysis in RMar. 3, 2015
8
Dimension Reduction: – to transform a larger number of variables into a much smaller set of variables – manipulation on variables (columns) Clustering: – to place observations into groups – manipulation on observations (rows) Dimension Reduction vs. Clustering LISA: Multivariate Analysis in RMar. 3, 2015
9
PCA is a data reduction technique that transforms a larger number of correlated variables into a much smaller set of uncorrelated variables called principal components. The principal components retain as much information from the original variables as possible. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
10
Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015 Scatterplot in the original axes x: Length y: Width x and y are highly correlated rotate the data, spatial relationship does not change Scatterplot in the new axis 1 st Axis: size 2 nd Axis: shape 1 st Axis and 2 nd Axis are uncorrelated largest variation on the 1 st Axis, and the second largest variation on the 2 nd Axis
11
PCA produces linear combinations of the original variables to generate the new variables (axes), known as principal components (PCs) The variations on the PCs are in a descending order, i.e., the first PC accounts for the greatest possible variance, the second PC accounts for the second largest variance, etc. The PCs are uncorrelated with (perpendicular to) each other. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
12
Main idea: – The first PC Y 1 = a 11 X 1 + a 12 X 2 +... + a 1p X p with the constraint: a 11 2 +a 12 2 +...+a 1p 2 =1 – The second PC Y 2 = a 21 X 1 + a 22 X 2 +... + a 2p X p with similar constraint. – Continue until p PCs are calculated such that the sum of the variances of all the PCs is equal to that of all the original variables. In summary, we need to find the matrix A, where a ij is the ith row and jth column element. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
13
How to get A: – The rows of matrix A are the eigenvectors of matrix S x, the variance-covariance matrix of the original data. – The elements of an eigenvector are the weights a ij, known as loadings. – The elements in the diagonal of matrix Sy, the variance- covariance matrix of the principal components, are the corresponding eigenvalues. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
14
Other related terms: – Score: the positions of each observation in the new coordinate system of PCs. For instance, the score for the r th sample on the k th PC is Y kr = a k1 x 1r + a k2 x 2r +... + a kp x pr – Scree plot: a graphical display of the variance of each PC to determine how many PCs should be selected in order to retain a high percentage of the variation in the data. The plot shows the variance for the first component and then for the subsequent components, it shows the additional variance that each component is adding. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
15
How to determine how many PCs should be retained: – Criteria 1: To include all those PCs up to a predetermined total percent variance explained, such as 80% or 90% – Criteria 2: To ignore PCs at the point where the next PC offer little increase in the total variance explained. Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
16
A rule of thumb when we do PCA: – If you want to compare different variables that have different units or with very different variances, it is a good idea to first standardize the variables so that they all have mean 0 and variance 1. – This will allow us to find the PCs that provide the best low- dimensional representation of the variation in the original data, without being overly raised by those variables that show the most variance in the original data. – May standardize variables in R using the function scale(). Principal Component Analysis (PCA) LISA: Multivariate Analysis in RMar. 3, 2015
17
1. Summarizing and plotting the data. 2. Decide how many PCs to keep. 3. Extract the PCs, i.e., find the loadings matrix A. 4. Rotate the PCs. 5. Interpret the results. 6. Computer PC scores. Refer to the R codes for details. Principal Component Analysis (PCA) Steps (in R) LISA: Multivariate Analysis in RMar. 3, 2015
18
Factor Analysis is to uncover the latent structure in a given set of variables. It looks for a smaller set of latent variables (called factors) that can explain the relationships among the observed variables. Correlated factors are common, but not required, in the factor analysis model. Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015
19
The model can be written as X i = b 1 F 1 + b 2 F 2 +... + b p F p + e i X i is the i th observed variable (i=1,…,k), F j are the factors (j=1,...,p), and p<k. E i is the portion of variable x i unique to that variable. Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015
20
There are many methods of extracting common factors, including maximum likelihood (ml), iterated principal axis (pa), weighted least square (wls), generalized weighted least squares (gls), and minimum residual (minres). We may identify which method to use in R code. Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015
21
1. Summarizing and plotting the data. 2. Decide how many factors to keep. 3. Extract the factors, i.e., find the loadings matrix A. 4. Rotate the factors. 5. Interpret the results. 6. Computer factor scores if needed. Refer to the R codes for details. Factor Analysis Steps (in R) LISA: Multivariate Analysis in RMar. 3, 2015
22
Relationship between PCA and Factor Analysis LISA: Multivariate Analysis in RMar. 3, 2015 x1 x2 x3 x4 x5 PC1 PC2 Figure A: Principal Component Analysis Model F1 e1 F2 X1 X2 X3 X4 X5 e2 e3 e4 e5 Figure B: Factor Analysis Model Source: R in Action, Data Analysis and Graphics with R, Robert I. Kabacoff
23
R in Action: Data Analysis and Graphics with R, Robert I. Kabacoff http://strata.uga.edu/software/pdf/pc aTutorial.pdf http://strata.uga.edu/software/pdf/pc aTutorial.pdf References LISA: Multivariate Analysis in RMar. 3, 2015
24
Please don’t forget to fill the sign in sheet and to complete the survey that will be sent to you by email. Thank you! LISA: Multivariate Analysis in RMar. 3, 2015
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.