Factor Analysis BMTRY 726 7/19/2018.

Slides:



Advertisements
Similar presentations
Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
Advertisements

Dimension reduction (1)
Lecture 7: Principal component analysis (PCA)
Design of Engineering Experiments - Experiments with Random Factors
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Factor Analysis Purpose of Factor Analysis Maximum likelihood Factor Analysis Least-squares Factor rotation techniques R commands for factor analysis References.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
Factor Analysis Purpose of Factor Analysis
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Independent Component Analysis (ICA) and Factor Analysis (FA)
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Chapter 11 Multiple Regression.
Linear and generalised linear models
Chapter 9 Multicollinearity
Linear and generalised linear models
Linear and generalised linear models Purpose of linear models Least-squares solution for linear models Analysis of diagnostics Exponential family and generalised.
Review Guess the correlation. A.-2.0 B.-0.9 C.-0.1 D.0.1 E.0.9.
Example of Simple and Multiple Regression
Regression and Correlation Methods Judy Zhong Ph.D.
Summarized by Soo-Jin Kim
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Some matrix stuff.
Additive Data Perturbation: data reconstruction attacks.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Chapter 9 Factor Analysis
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
DOX 6E Montgomery1 Design of Engineering Experiments Part 9 – Experiments with Random Factors Text reference, Chapter 13, Pg. 484 Previous chapters have.
Chapter 14 Repeated Measures and Two Factor Analysis of Variance
Lecture 12 Factor Analysis.
Education 795 Class Notes Factor Analysis Note set 6.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Advanced Statistics Factor Analysis, I. Introduction Factor analysis is a statistical technique about the relation between: (a)observed variables (X i.
Principle Components Analysis A method for data reduction.
Principal Component Analysis
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Applied Regression Analysis BUSI 6220
Lecture 2 Survey Data Analysis Principal Component Analysis Factor Analysis Exemplified by SPSS Taylan Mavruk.
Estimating standard error using bootstrap
An introduction to Dynamic Factor Analysis
Chapter 14 Introduction to Multiple Regression
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
CHAPTER 7 Linear Correlation & Regression Methods
LECTURE 11: Advanced Discriminant Analysis
Factor Analysis An Alternative technique for studying correlation and covariance structure.
Statistical Data Analysis - Lecture10 26/03/03
The Maximum Likelihood Method
Chapter 11 Simple Regression
Chapter 9 Hypothesis Testing.
9 Tests of Hypotheses for a Single Sample CHAPTER OUTLINE
Review for Exam 2 Some important themes from Chapters 6-9
Discrete Event Simulation - 4
EM for Inference in MV Data
Simple Linear Regression
OVERVIEW OF LINEAR MODELS
Factor Analysis An Alternative technique for studying correlation and covariance structure.
Feature space tansformation methods
Principal Component Analysis
Fixed, Random and Mixed effects
Product moment correlation
Multivariate Linear Regression
Principal Components Analysis
Principal Component Analysis
EM for Inference in MV Data
Lecture 8: Factor analysis (FA)
9. Binary Dependent Variables
Exploratory Factor Analysis. Factor Analysis: The Measurement Model D1D1 D8D8 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 F1F1 F2F2.
Presentation transcript:

Factor Analysis BMTRY 726 7/19/2018

Uses Goal: Similar to PCA… describe the covariance of a large set of measured traits using a few linear combinations of underlying latent traits Why: again, similar reasons to PCA (1) Dimension Reduction (use k of p components) (2) Remove redundancy/duplication from a set of correlated variables (3) Represent correlated variables with a smaller set of “derived” variables (4) Create “new” factor variables that are independent

For Examples Say we want to define “frailty” in a population of cancer patients We have a concept of what “frailty” is but no direct way to measure it We believe an individual’s frailty has to do with their weight, strength, speed, agility, balance, etc. We therefore want to be able to define frailty as some composite measure of all of these factors…

Key Concepts Fj is a latent underlying variable ( j = 1, 2, …, m) X’s are observed variables related to what we think Fj might be ei is the measurement error for Xi, i = 1, 2, …, p lij are the factor “loadings” for Xi

Orthogonal Factor Model Consider data with p observed variables:

Model Assumptions We must make some serious assumptions… Note, these are very strong assumptions which implies only narrow application These models are also best when p >> m

Model Assumptions Our assumptions can be related back to the variability of our original X’s

Model Assumptions Our assumptions can be related back to the variability of our original X’s

Model Terms Decomposition of the variance of Xi The proportion of variance of the ith measurement Xj contributed by the m factors F1, F2, …, Fm is called the ith communality

Model Terms Decomposition of the variance of Xi The remaining proportion of the variance of the ith measurement, associated with ei, is called the uniqueness or specific variance Note, we are assuming that the variances and covariances of X can be reconstructed from our pm factor loadings lij and the p specific variances

Limitations Linearity -Assuming factors = linear combinations of our X’s -Factors unobserved so we can not verify this assumption -If relationship non-linear, linear combinations may provide a good approximation for only a small range of values The elements of S described by mp factor loadings in L and p specific variances {yi} -Model most useful for small m, but often mp + p parameters not sufficient and S is not close to LL’+ y

Limitations Even when m < p, we find L such that S = LL’ + e… but L is not unique

Limitations Non-Unique L… -So what happens to our moments if we use these new factors and factor loadings?

Potential Pitfall The problem is, most covariance matrices can not be factored in the manor we have defined for a factor model: X = LF + e Cov(X)=S = LL’ + y For example… Consider data with 3 variables which we are trying to describe with 1 factor

Potential Pitfall We can write our factor model as follows: Using our factor model representation of variance, LL’ + y, we can define the following six equations

Potential Pitfall Use these equations to find the factor loadings and specific variances:

Potential Pitfall However, this results in the following problems:

Methods of Estimation We need to estimate: We have a random sample from n subjects from a population Measure p attributes for each of the n subjects latent factors L

Methods of Estimation We could also standardize our variables: Methods of Estimation 1. Principal Components method 2. Principal Factor method 3. Maximum likelihood method

Principal Component Method Given S (or S if we have a sample from the population) Consider decomposition

Principal Component Method Problem here is m = p so we want to drop some factors We drop l’s that are small (i.e. stop at lm)

Principal Component Method Estimate L and y by substituting estimate eigenvectors/values for S or R: To make diagonal elements of , , we let

Principal Component Method The optimality of using to approximate S due to: Note, the sum of squared elements is an approximation of the sum of squared error We can also estimate the proportion of the total sample variance due to the jth factor

Phthalate and Birth Outcomes Recall the phthalate study. The PI is interested in examining the impact of phthalate exposure on infant birth outcomes and ideally identify underlying factors (i.e. patterns in the phthalate exposure). Factor analysis could be used to identify these underlying factors and then see if these are associated. Phthalates (urine metabolites): MBP, MBZP, MEHHP, MEHP, MEOHP, MIBP, MEP, MMP

Urine Phthalate Metabolites Phthalate data includes n = 297 mothers with urine levels of p = 8 variables/metabolites Data standardized and factor analysis performed in sample correlation matrix R MBP MBZP MEHHP MEHP MEOHP MIBP MEP MMP 1 0.802 0.713 0.581 0.679 0.848 0.584 0.638 0.623 0.531 0.591 0.720 0.520 0.561 0.821 0.984 0.661 0.444 0.537 0.834 0.523 0.369 0.446 0.635 0.420 0.510 0.574 0.600 0.419

Urine Phthalate Metabolites Given the eigenvalues/vectors of R, what is our factor model for two factors

Urine Phthalate Metabolites Given the eigenvalues/vectors of R, what is our factor model for two factors

Urine Phthalate Metabolites Given a 2-factor solution, we can find the communalities and specific variances based on our loadings and R.

Urine Phthalate Metabolites What is the cumulative proportion of variance accounted by factor 1, what about both factors?

Urine Phthalate Metabolites What about how our model checks out….

Urine Phthalate Metabolites Two factor (m = 2) solution: Data standardized and FA performed using correlation matrix R How might we interpret these factors? Variables Factor 1 Factor 2 Specific variances h2 lMBP 0.883 0.26 0.151 0.849 lMBzP 0.803 0.29 0.268 0.732 lMEHHP 0.880 -0.42 0.047 0.953 lMEHP 0.779 -0.49 lMEOHP 0.858 -0.46 0.048 0.952 lMiBP 0.826 0.32 0.216 0.784 lMEP 0.617 0.42 0.436 0.564 lMMP 0.729 0.21 0.422 0.578 SS Loading 5.14 1.12 Proportion Variance 0.642 0.141

Principal Component Method Estimated loadings on factors do not change as number of factors increases Diagonal elements of S (or R) exactly equal diagonal elements of , but sample covariances may not be exactly reproduced Select number of factors m to make off-diagonal elements small for residual matrix Contribution of the kth factor to total variance is:

Principal Factor Method Consider the model: Suppose initial estimates available for the communalities or specific variances

Principal Factor Method Then

Principal Factor Method Apply procedure iteratively 1. Start with 2. Compute factor loadings from eigenvalues/vectors of Rr 3. Compute new values 4. Repeat steps 2 and 3 until algorithm converges Problems: - some eigenvalues Rr can be negative -choice of m (m too large, some communalities > 1 and iteration terminates)

Urine Phthalate Metabolites Principal factor method m = 2 factor solution: Variables Factor 1 Factor 2 y h2 lMBP 0.880 0.334 0.115 0.885 lMBzP 0.766 0.295 0.326 0.673 lMEHHP 0.902 -0.406 0.021 0.979 lMEHP 0.754 -0.360 0.302 0.698 lMEOHP 0.882 -0.459 0.012 0.988 lMiBP 0.801 0.247 0.753 lMEP 0.550 0.251 0.634 0.366 lMMP 0.668 0.176 0.522 0.478 SS Loading 4.91 0.91 Proportion Var 0.614 0.114

Urine Phthalate Metabolites Check how closely the model estimated R

Maximum Likelihood Method Likelihood function needed and additional assumptions made: Additional restriction specifying unique solution MLE’s are:

Maximum Likelihood Method For m factors: -estimated communalities -proportion of the total sample variance due to kth factor

Urine Phthalate Metabolites Maximum likelihood method m = 2 factor solution: Variables Factor 1 Factor 2 y lMBP 0.675 0.669 0.097 lMBzP 0.579 0.592 0.315 lMEHHP 0.989 -0.010 0.021 lMEHP 0.834 -0.034 0.303 lMEOHP 0.993 -0.071 0.010 lMiBP 0.609 0.607 0.261 lMEP 0.399 0.439 0.648 lMMP 0.542 0.417 0.532 SS Loading 4.27 1.54 Proportion Var 0.534 0.192

Urine Phthalate Metabolites Check how closely the model estimated R

Large Sample Test for number of factors We want to be able to decide of the number of common factors m we’ve choose in sufficient So if n is large, we do hypothesis testing: We can use estimates in our hypothesis statement…

Large Sample Test for number of factors From this we develop a likelihood ratio test:

Test Results What does it mean if we reject the null hypothesis? -Not an adequate number of factors Test results for our phthalate metabolite data for each approach (from R) Number of Factors PC Approach PF Approach ML Approach 1 5.8 x 10-53 2 1.4 x 10-9 0.22 0.36 3 4.6 x 10-9 0.84 0.88

Test Results Problem with the test -If n is large and m is small compared to p, this test will very often reject the null -Results is we tend to want to keep in more factors -This can defeat the purpose of factor analysis -exercise caution when using this test