Factor Analysis Liz Garrett-Mayer, PhD Dept of PHS, Division of Biostats & Bioinf Biostatistics Shares Resource, Hollings Cancer Center Cancer Control.

Slides:



Advertisements
Similar presentations
Advanced Methods and Analysis for the Learning and Social Sciences PSY505 Spring term, 2012 March 12, 2012.
Advertisements

Factor Analysis and Principal Components Removing Redundancies and Finding Hidden Variables.
Factor Analysis Continued
Dimension reduction (1)
Lecture 7: Principal component analysis (PCA)
1 Multivariate Statistics ESM 206, 5/17/05. 2 WHAT IS MULTIVARIATE STATISTICS? A collection of techniques to help us understand patterns in and make predictions.
Psychology 202b Advanced Psychological Statistics, II April 7, 2011.
Factor Analysis Ulf H. Olsson Professor of Statistics.
Common Factor Analysis “World View” of PC vs. CF Choosing between PC and CF PAF -- most common kind of CF Communality & Communality Estimation Common Factor.
Factor Analysis Research Methods and Statistics. Learning Outcomes At the end of this lecture and with additional reading you will be able to Describe.
When Measurement Models and Factor Models Conflict: Maximizing Internal Consistency James M. Graham, Ph.D. Western Washington University ABSTRACT: The.
1 Carrying out EFA - stages Ensure that data are suitable Decide on the model - PAF or PCA Decide how many factors are required to represent you data When.
Goals of Factor Analysis (1) (1)to reduce the number of variables and (2) to detect structure in the relationships between variables, that is to classify.
Education 795 Class Notes Factor Analysis II Note set 7.
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Factor Analysis Psy 524 Ainsworth.
Social Science Research Design and Statistics, 2/e Alfred P. Rovai, Jason D. Baker, and Michael K. Ponton Factor Analysis PowerPoint Prepared by Alfred.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Principal Components Analysis BMTRY 726 3/27/14. Uses Goal: Explain the variability of a set of variables using a “small” set of linear combinations of.
Some matrix stuff.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Psy 427 Cal State Northridge Andrew Ainsworth PhD.
Factor Analysis in Individual Differences Research: The Basics Psych 437.
MGMT 6971 PSYCHOMETRICS © 2014, Michael Kalsher
Advanced Correlational Analyses D/RS 1013 Factor Analysis.
Applied Quantitative Analysis and Practices
Measurement Models: Exploratory and Confirmatory Factor Analysis James G. Anderson, Ph.D. Purdue University.
Descriptive Statistics vs. Factor Analysis Descriptive statistics will inform on the prevalence of a phenomenon, among a given population, captured by.
Explanatory Factor Analysis: Alpha and Omega Dominique Zephyr Applied Statistics Lab University of Kenctucky.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Applied Quantitative Analysis and Practices
Education 795 Class Notes Factor Analysis Note set 6.
Chapter 13.  Both Principle components analysis (PCA) and Exploratory factor analysis (EFA) are used to understand the underlying patterns in the data.
Department of Cognitive Science Michael Kalsher Adv. Experimental Methods & Statistics PSYC 4310 / COGS 6310 Factor Analysis 1 PSYC 4310 Advanced Experimental.
Advanced Statistics Factor Analysis, I. Introduction Factor analysis is a statistical technique about the relation between: (a)observed variables (X i.
Applied Quantitative Analysis and Practices LECTURE#19 By Dr. Osman Sadiq Paracha.
CFA Model Revision Byrne Chapter 4 Brown Chapter 5.
Principal Component Analysis
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L11.1 Lecture 11: Canonical correlation analysis (CANCOR)
FACTOR ANALYSIS.  The basic objective of Factor Analysis is data reduction or structure detection.  The purpose of data reduction is to remove redundant.
Multivariate statistical methods. Multivariate methods multivariate dataset – group of n objects, m variables (as a rule n>m, if possible). confirmation.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
FACTOR ANALYSIS & SPSS.
Exploratory Factor Analysis
Unsupervised Learning
Mini-Revision Since week 5 we have learned about hypothesis testing:
EXPLORATORY FACTOR ANALYSIS (EFA)
Analysis of Survey Results
Factor analysis Advanced Quantitative Research Methods
Principal Component Analysis (PCA)
Dimension Reduction via PCA (Principal Component Analysis)
Measuring latent variables
Measuring latent variables
Descriptive Statistics vs. Factor Analysis
Measuring latent variables
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Multivariate Statistics
Principal Components Analysis
Principal Component Analysis
Statistical Data Analysis
Factor Analysis BMTRY 726 7/19/2018.
Principal Component Analysis
Examining Data.
Exploratory Factor Analysis. Factor Analysis: The Measurement Model D1D1 D8D8 D7D7 D6D6 D5D5 D4D4 D3D3 D2D2 F1F1 F2F2.
Factor Analysis.
Measuring latent variables
MGS 3100 Business Analysis Regression Feb 18, 2016
Marios Mattheakis and Pavlos Protopapas
Unsupervised Learning
Presentation transcript:

Factor Analysis Liz Garrett-Mayer, PhD Dept of PHS, Division of Biostats & Bioinf Biostatistics Shares Resource, Hollings Cancer Center Cancer Control Journal Club March 3, 2016

Motivating Example

Goals of paper 1. See if previously defined measurement model of hopelessness in advanced cancer fits this sample. Confirmatory Factor Analysis. 2. Describe the factor structure in these two subpopulations (curative and palliative). Exploratory Factor Analysis Evaluate ‘stability’ of factor structure: does the factor structure stay the same after 12 months? Confirmatory Factor Analysis.

(Exploratory) Factor Analysis Data reduction tool Removes redundancy or duplication from a set of correlated variables Represents correlated variables with a smaller set of “derived” variables (aka “factors”). Factors are formed that are relatively independent of one another. Two types of “variables”: latent variables: factors observed variables (aka manifest variables; items)

Examples Diet Air pollution Personality Customer satisfaction Depression Quality of Life

Some Applications of Factor Analysis 1. Identification of Underlying Factors: clusters variables into homogeneous sets creates new variables (i.e. factors) allows us to gain insight to categories 2. Screening of Variables: identifies groupings to allow us to select one variable to represent many useful in regression (recall collinearity) 3. Summary: Allows us to describe many variables using a few factors 4. Clustering of objects: Helps us to put objects (people) into categories depending on their factor scores

“Perhaps the most widely used (and misused) multivariate [technique] is factor analysis. Few statisticians are neutral about this technique. Proponents feel that factor analysis is the greatest invention since the double bed, while its detractors feel it is a useless procedure that can be used to support nearly any desired interpretation of the data. The truth, as is usually the case, lies somewhere in between. Used properly, factor analysis can yield much useful information; when applied blindly, without regard for its limitations, it is about as useful and informative as Tarot cards. In particular, factor analysis can be used to explore the data for patterns, confirm our hypotheses, or reduce the Many variables to a more manageable number. -- Norman Streiner, PDQ Statistics

Exploratory Factor Analysis Takes a set of variables thought to measure an underlying latent variable: Determines which ones “hang together” Identifies how many ‘dimensions’ there are to the latent variable of interest Determines which are the strongest variables: which ones contain most of the information Which ones might be able to be removed due to either redundancy or because they don’t “hang” with other variables. One of the primary goals of factor analysis is often to identify a measurement model for a latent variable This includes : Identifying the items to include in the model Identifying how many ‘factors’ there are in the latent variable (i.e. the dimensionality of the latent variable). Identifying which items are “associated” with which factors

In our example: “Hopelessness” as measured by the Beck hopelessness scale

Beck hopelessness scale

Goals of this EFA What ‘structure’ is there to hopelessness in these patient populations? Can we come up with ‘components’ of hopelessness? If so, what do they look like? How do we do this? Factor analysis is PURELY based on the correlation matrix (or covariance matrix) of your items. It searches for ‘commonality’ among the items based on the correlations Think about it: if the items are all measuring the same ‘traits’. The correlation patterns can be used to see which of the items cluster together, which don’t and how much of each item is simply ‘noise.’ Note about Likert scale variables: Very noisy! Correlation matrices are more apt for truly continuous variables.

Graphically, a two factor EFA (with 7 items in the scale) F1 F2 y2 y1 y3 y4 y5y6 y7

Mathematically, the 2 factor EFA (with 20 items in the scale) Interpretations: F 1 and F 2 are the latent variables (e.g. F 1i is the value of the i th person’s F 1 ) The λ’s are called “loadings” and each one represents, in a loose sense, the correlation between each item from the scale and the factor.

Graphically, a two factor EFA (with 7 items in the scale) F1 F2 y2 y1 y3 y4 y5y6 y7 λ 11 λ 12 λ 21 λ 22 λ 31 λ 71 λ32λ32 λ 72 e1e1 e3e3 e4e4 e5e5 e6e6 e7e7 e2e2

Some statistical stuff Without additional assumptions, the model would be ‘unidentifiable’ and also hard to interpret. Assumptions: F 1 and F 2 are statistically independent (uncorrelated) in most implementations F 1 and F 2 are each normally distributed with mean 0, variance 1 Conditional on the latent variables, the error terms are independent.

Spangenberg Paper Recruited participants as part of a prospective observational study, investigating meaning-focused coping and mental health in cancer patients. 732 eligible adult cancer patients receiving treatment with curative or palliative intention in inpatient and outpatient cancer care facilities in Northern Germany were asked to participate. At baseline, 315 patients participated. At follow-up, 158 could be reassessed at 12 months.

Results from Beck, curative treatment group (n=145)

Factor 1 comprises items reflecting mainly pessimistic/resigned beliefs (e.g. Item 12), whereas Factor 2 especially contains items reflecting positive beliefs explicitly referring to the future (e.g. Item 5). It is noteworthy that Factor 2 solely includes positively worded items, whereas Factor 1 includes negatively worded items only. Factor 1 includes Items 2, 9, 12, 14, 16, 17, and 20 (curative sample, Cronbach alpha 0.88; palliative sample, alpha=0.85). Factor 2 includes Items 5, 6, 8, 10, 15 and 19 (alpha=0.73 in both samples). Both factors are moderately correlated (r=0.45)

Let’s reverse: why two factors? How do you figure out how many factors? That is, what is the “dimensionality” of the latent variable? For a dataset with, let’s say 20 variables, fit a principal components analysis (aka PCA, which is, loosely, a kind of factor analysis). a)The PCA creates a matrix of weights from the data from which you can generate composite variables. b)The weights are chosen so that the 1 st component (i.e. a weighted average of the variables) explains the maximum amount of variance in the variables. c)The weights for the 2 nd component are chosen to maximize the variance remaining in the data AFTER having already computed the 1 st PC d)And so on for the remaining 18 (20 – 2 = 18) components. What? This isn’t as abstract as it sounds There are scales out there that use this kind of ‘component’ or ‘composite’ variable approach. Example: 0.5x x 2 + x 3 = z This approach just finds the optimal weights to maximize the variability explained.

Eigenvalues Each of the principal components (aka eigenvectors) has a corresponding value called an ‘eigenvalue’ which represents the amount of variability explained by the component. The sum of the eigenvalues is equal to the number of variables in your analysis (e.g. for Spangenberg paper, the sum of the eigenvalues is 20). There are several rules of thumb for using the eigenvalues to help determine the number of ‘components’ or ‘factors’ to keep. 1. Keep as many components as have eigenvalues greater than use a screeplot to determine how many components to keep. 3. Preset a threshold for percent variance explained and keep enough to explain a sufficient amount of variance. Think about what it means to have an eigenvalue of 1 or greater?

Screeplot examples

From Spangenberg: Initially, five eigenvalues were >1 in the curative sample (7.42, 2.04, 1.53, 1.25, and 1.01). In the palliative sample, four eigenvalues were >1 (7.13, 1.96, 1.43 and 1.14). The scree plot suggested a two factor structure in both patient groups. Note: THESE SCREE PLOTS ARE INCOMPLETE!

In terms of percent variance explained Divide the eigenvalue by the number of variables in the model and you get the incremental variance explained. You can also calculate cumulative variance of the first, for example, 3 components.

Two factor solution Explains about 50% of the variance in the data. Thus, 50% of the information in the data is ‘discarded’ when only two components are retained. However, there were TWENTY variables that you started with. And, it’s quite possible that there is a lot of noise Likert scale variables How much is noise? How much is ‘signal’ Glass half-full: With only TWO components, you can explain as much as almost TEN variables (on average).

Pick number of factors Based on the examination of eigenvalues, determine the number of ‘factors’ you want to retain in a factor analysis. Fit a factor analysis where you pre-specify the number of factors.

Interpretable? The problem with PCA and an unrotated factor analysis is that the factors are hard to interpret. The first component or factor usually has about equal loadings (+/- depending on the direction of the item) for all items. The second component may have some high, some low loading, but usually not very interpretable. Solution? Factor rotation.

Rotation: More statistical stuff In principal components, the first factor describes most of variability. After choosing number of factors to retain, we want to spread variability more evenly among factors. To do this we “rotate” factors: – redefine factors such that loadings on various factors tend to be very high (-1 or 1) or very low (0) – intuitively, it makes sharper distinctions in the meanings of the factors We use “factor analysis” for rotation NOT principal components!

How can we do this? Doesn’t it change our ‘answer’? Statistically, it doesn’t. The percent variance is the same, etc. For a factor analysis solution to be calculated, there have to be constraints, or assumptions. Some of them are Factors are normally distributed Factors have mean 0 and variance 1 In the initial solution, another constraint is that the first factor explains the most variance, the 2 nd factor explains the ‘next most’ conditional on the first factor, etc. What if we change this last assumption and instead create a different constraint to make it identifiable? That is, focus on ‘shrinking’ loadings?

Rotation types “orthogonal”: maintains independent factors (i.e uncorrelated factors) “oblique”: allows some dependence. Usually not terribly different from orthogonal, but loadings are often ‘shrunk’ more towards 0 or 1 (or -1). Spangenberg assumed that factors would be likely to be correlated, so they used oblique rotation.

Aside Principal factors vs. principal components. The defining characteristic that distinguishes between the two factor analytic models is that in principal components analysis we assume that allvariability in an item should be used in the analysis, while in principal factors analysis we only use the variability in an item that it has in common with the other items. In most cases, these two methods usually yield very similar results. However, principal components analysis is often preferred as a method for data reduction, while principal factors analysis is often preferred when the goal of the analysis is to detect structure. (

Factor 1 comprises items reflecting mainly pessimistic/resigned beliefs (e.g. Item 12), whereas Factor 2 especially contains items reflecting positive beliefs explicitly referring to the future (e.g. Item 5). It is noteworthy that Factor 2 solely includes positively worded items, whereas Factor 1 includes negatively worded items only. Factor 1 includes Items 2, 9, 12, 14, 16, 17, and 20 (curative sample, Cronbach alpha 0.88; palliative sample, alpha=0.85). Factor 2 includes Items 5, 6, 8, 10, 15 and 19 (alpha=0.73 in both samples). Both factors are moderately correlated (r=0.45)

A few more details Keeping vs. dropping items Should look at the fitted model (before or after rotation) to determine the variable’s “uniqueness.” Communality = 1 – Uniqueness Communality is a measure of how much is ‘shared’ between the item and the latent variable structure Uniqueness is what is left over (i.e. noise) Uniqueness DOES depend on the number of factors retained.

Example of uniquenesses

A few more details Estimation

Next steps You can use your model to calculate the estimated factor scores for each subject in your dataset Confirmatory factor analysis: Restrictive approach which forces some of the arrows (i.e. loadings) to be zero. EFA: descriptive approach to determine structure CFS: test a particular structure. Applications?

CFA: compare structure to a fixed model F1 F2 y2 y1 y3 y4 y5y6 y7 λ 11 λ 21 λ 31 λ42λ42 λ 72 e1e1 e3e3 e4e4 e5e5 e6e6 e7e7 e2e2 λ52λ52 λ62λ62

Spangenberg: Used CFA CFA was used to determine if the patients in this study demonstrated the same factor structure for BHS as advanced cancer patients. They found that the structure in this sample of patients differed from previous studies.

CFA CFA is often thought of as a ‘special case’ of structural equation models You make assumptions about how the variables are associated using ‘arrows’ to join variables (both latent and observed).

“Stability” Spangenberg et al. also evaluated structure over time. Did the factors stay the same in their structure (i.e. loadings, dimensionality)? Some measures said it was acceptable, other that is wasn’t. Problem with this paper: they provide the test statistics, but do not show us the descriptive statistics (i.e. what were the loadings in the fit on the 12 month data?)

More details on the process And, a lot more math, statistics and matrices!