From PCA to Confirmatory FA (from using Stata to using Mx and other SEM software) References: Chapter 8 of Hamilton Chapter 10 of Lattin et al Data sets: College.txt, Govern.sav, Adoption.txt
Class 1 Principal Components Exploratory Factor Model Confirmatory Factor Model
Principal Components Basic principles and the use of the method, with an example Chapter 8 of Hamilton, pp
data=read.table("G:/Albert/COURSES/RMMSS/Schools1.txt", header=T) names(data) [1] "School" "SchoolT" "SAT" "Accept" "CostSt" "Top10" "PhD" [8] "Grad" attach(data) pairs(data[,3:8]) lCost=log(CostSt) cdata=cbind(data[,3:4], lCost, data[,6:8]) pairs(cdata)
Principal Components Analysis (PCA) Y j = a j1 PC 1 + a j2 PC 2 + E j, j = 1, 2,... P the Y j are manifest variables E j = a j3 PC a jp PC p the PC are called principal components Let R j 2 the R2 of the (linear) regression of Y j on PC 1 and PC 2 In PCA, the a’s are choosen so to maximize sum j R j 2
plot(lCost, PhD) identify(lCost, PhD) [1] 40 data[40,1] [1] JohnsHopkins
> round(cor(cdata),3) SAT Accept lCost Top10 PhD Grad SAT Accept lCost Top PhD Grad > plot(lCost, PhD) > identify(lCost, PhD) [1] 40 > data[40,1] [1] JohnsHopkins 50 Levels: Amherst Barnard Bates Berkeley Bowdoin Brown BrynMawr... Yale > round(cor(cdata[-40,]),3) SAT Accept lCost Top10 PhD Grad SAT Accept lCost Top PhD Grad >
use "G:\Albert\COURSES\RMMSS\school1.dta", clear. edit - preserve. summarize sat accept costst top10 phd grad Variable | Obs Mean Std. Dev. Min Max sat | accept | costst | top10 | phd | grad |
. gen lcost = log(costst). pca sat accept lcost top10 phd grad, factors(2) (obs=50) (principal components; 2 components retained) Component Eigenvalue Difference Proportion Cumulative Eigenvectors Variable | sat | accept | lcost | top10 | phd | grad | greigen. score f1 f2 (based on unrotated principal components) Scoring Coefficients Variable | sat | accept | lcost | top10 | phd | grad | summarize f1 f2 Variable | Obs Mean Std. Dev. Min Max f1 | e f2 | e Normalized pc
. graph f2 f1, s([_n])
. cor sat accept lcost top10 phd grad f1 f2 (obs=50) | sat accept lcost top10 phd grad f1 f sat | accept | lcost | top10 | phd | grad | f1 | f2 |
library(mva) help('factanal') help('princomp') pca=princomp(cdata,cor=T, scores=T) biplot(pca) > summary(pca) Importance of components: Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Standard deviation Proportion of Variance Cumulative Proportion round(cov(pca$scores[,1:2]),3) Comp.1 Comp.2 Comp Comp
> data[,1] [1] Amherst Swarthmore Williams Bowdoin Wellesley [6] Pomona Wesleyan Middlebury Smith Davidson [11] Vassar Carleton ClarMcKenna Oberlin WashingtonLee [16] Grinnell MountHolyoke Colby Hamilton Bates [21] Haverford Colgate BrynMawr Occidental Barnard [26] Harvard Stanford Yale Princeton CalTech [31] MIT Duke Dartmouth Cornell Columbia [36] UofChicago Brown UPenn Berkeley JohnsHopkins [41] Rice UCLA UVa. Georgetown UNC [46] UMichican CarnegieMellon Northwestern WashingtonU UofRochester
DD=dist(pca$scores[,1:2], method ="euclidean", diag=FALSE) clust=hclust(DD, method="complete", members=NULL) plot(clust, labels=data[,1], cex=.8, col="blue", main="clustering of education")
(Exploratory) Factor Analysis Y j = a j1 F 1 + a j2 F 2 + E j, j = 1, 2,... P E j =.... uncorrelated across j !! The a’s are choosen by principal factor method, ML,... There is no unique solution (model is non-identified). Rotation methods to maximize interpretation (e.g., Varimax). Chapter 8 of Hamilton, pp
. factor sat accept lcost top10 phd grad, factors(3) ipf (obs=50) (iterated principal factors; 3 factors retained) Factor Eigenvalue Difference Proportion Cumulative Factor Loadings Variable | Uniqueness sat | accept | lcost | top10 | phd | grad | Exploratory Factor Analysis
> factanal(cdata, factors=2) Call: fac=factanal(cdata, factors=2, scores="regression") Uniquenesses: SAT Accept lCost Top10 PhD Grad Loadings: Factor1 Factor2 SAT Accept lCost Top PhD Grad Factor1 Factor2 SS loadings Proportion Var Cumulative Var Test of the hypothesis that 2 factors are sufficient. The chi square statistic is on 4 degrees of freedom. The p-value is > Exploratory Factor Analysis > summary(fac) Length Class Mode converged 1 -none- logical loadings 12 loadings numeric uniquenesses 6 -none- numeric correlation 36 -none- numeric criteria 3 -none- numeric factors 1 -none- numeric dof 1 -none- numeric method 1 -none- character scores 100 -none- numeric STATISTIC 1 -none- numeric PVAL 1 -none- numeric n.obs 1 -none- numeric call 4 -none- call >
> plot(fac$scores, type="n") > text(fac$scores[,1], fac$scores[,2], 1:50, cex=.8) >
(Confirmatory) Factor Analysis Y j = a j1 F 1 + a j2 F 2 + E j, j = 1, 2,... P E j =.... uncorrelated across j !! Some of the a’s are free, other restricted a priori (to 0s, 1s, or by equality among them), estimation method is ML, GLS,... There is uniqueness in the solution (an identified model).
Lattin and Roberts data of adoption new technologies p. 366 of Lattin et al. See the data file adoption.txt in RMMRS
Analysis of Adoption data
data=read.table("E:/Albert/COURSES/RMMSS/Mx/ADOPTION.txt", header=T) names(data) [1] "ADOPt1" "ADOPt2" "VALUE1" "VALUE2" "VALUE3" "USAGE1" "USAGE2" "USAGE3" attach(data) round(cov(data, use="complete.obs"),2) ADOPt1 ADOPt2 VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3 ADOPt ADOPt VALUE VALUE VALUE USAGE USAGE USAGE dim(data) [1] 188 8
Data Nimput=8 Nobservations=188 CMatrix Labels ADOPt1 ADOPt2 VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3 Adoption.dat
> data=read.table("E:/Albert/COURSES/RMMSS/Mx/ADOPTION.txt", header=T) > names(data) [1] "ADOPt1" "ADOPt2" "VALUE1" "VALUE2" "VALUE3" "USAGE1" "USAGE2" "USAGE3" attach(data) factanal(cbind(VALUE1, VALUE2,VALUE3,USAGE1, USAGE2,USAGE3), factors=2, rotation="varimax") Call: factanal(x = cbind(VALUE1, VALUE2, VALUE3, USAGE1, USAGE2, USAGE3), factors = 2) Uniquenesses: VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE Loadings: Factor1 Factor2 VALUE VALUE VALUE USAGE USAGE USAGE Factor1 Factor2 SS loadings Proportion Var Cumulative Var Exploratory Factor Analysis, ML method Test of the hypothesis that 2 factors are sufficient. The chi square statistic is 1.82 on 4 degrees of freedom. The p-value is 0.768
Data Nimput=8 Nobservations=188 CMatrix Labels ADOPt1 ADOPt2 VALUE1 VALUE2 VALUE3 USAGE1 USAGE2 USAGE3 Adoption.dat
One factor model for Value
Two factor model
Factor Analysis Charles Spearman, 1904 Acording to the two-factor theory of intelligence, the performance of any intellectual act requires some combination of "g", which is available to the same individual to the same degree for all intellectual acts, and of "specific factors" or "s" which are specific to that act and which varies in strength from one act to another. If one knows how a person performs on one task that is highly saturated with "g", one can safely predict a similar level of performance for a another highly "g" saturated task. Prediction of performance on tasks with high "s" factors are less accurate. Nevertheless, since "g" pervades all tasks, prediction will be significantly better than chance. Thus, the most important information to have about a person's intellectual ability is an estimate of their "g".
Spearman, 1904 Variables CLASSIC = V1 FRENCH = V2 ENGLISH = V3 MATH = V4 DISCRIM = V5 MUSIC = V6 Correlation matrix cases = 23;
Single-Factor Model V1 V4V3V2 F1 ** * * * ** * V6V5 ** * *
EQS code for a factor model
NT analysis RESIDUAL COVARIANCE MATRIX (S-SIGMA) : CLASSIC FRENCH ENGLISH MATH DISCRIM V 1 V 2 V 3 V 4 V 5 CLASSIC V FRENCH V ENGLISH V MATH V DISCRIM V MUSIC V MUSIC V 6 MUSIC V CHI-SQUARE = BASED ON 9 DEGREES OF FREEDOM PROBABILITY VALUE FOR THE CHI-SQUARE STATISTIC IS THE NORMAL THEORY RLS CHI-SQUARE FOR THIS ML SOLUTION IS
Loadings’ estimates, s.e. and z-test statistics CLASSIC =V1 =.960*F E FRENCH =V2 =.866*F E ENGLISH =V3 =.807*F E MATH =V4 =.736*F E DISCRIM =V5 =.688*F E MUSIC =V6 =.653*F E
Estimates of unique-factors E1 -CLASSIC.078*I.064 I I I E2 -FRENCH.251*I.093 I I I E3 -ENGLISH.349*I.118 I I I E4 - MATH.459*I.148 I I I E5 -DISCRIM.527*I.167 I I I E6 -MUSIC.574*I.180 I I I
STANDARDIZED SOLUTION: CLASSIC =V1 =.960*F E1 FRENCH =V2 =.866*F E2 ENGLISH =V3 =.807*F E3 MATH =V4 =.736*F E4 DISCRIM =V5 =.688*F E5 MUSIC =V6 =.653*F E6
Data of Lawley and Maxwell /TITLE Lawley and Maxwell data /SPECIFICATIONS CAS=220; VAR=6; ME=ML; /LABEL v1 =Gaelic; v2 = English; v3 = Histo; v4 =aritm; v5 =Algebra; v6 =Geometry; /EQUATIONS V1= *F1 + E1; V2= *F1 + E2; V3= *F1 + E3; V4= *F1 + E4; V5= *F1 + E5; V6= *F1 + E6; /VARIANCES F1 = 1; E1 TO E6 = *; /COVARIANCES /MATRIX /END /EQUATIONS V1= *F1 + E1; V2= *F1 + E2; V3= *F1 + E3; V4= *F2 + E4; V5= *F2 + E5; V6= *F2 + E6; /VARIANCES F1 = 1; F2=1; E1 TO E6 = *; /COVARIANCES F1, F2 = *; GAELIC =V1 =.687*F E ENGLISH =V2 =.672*F E HISTO =V3 =.533*F E ARITM =V4 =.766*F E ALGEBRA =V5 =.768*F E GEOMETRY=V6 =.616*F E COVARIANCES AMONG INDEPENDENT VARIABLES I F2 - F2.597*I I F1 - F1.072 I M0: M1: M0, Single factor model CHI-SQUARE = , 9 df P-value LESS THAN M1, Two factor model with correlated factors: CHI-SQUARE = 7.953, 8 df P-value =