Sparse Principal Component Analysis Hui Zou, trevor hastie, and Robert tibshirani 2005 Berlene Shipes
Abbreviations Principal Component Analysis PCA Singular value decomposition SVD Sparse Principal Component Analysis SPCA Principal Component PC
Model Specifications n = number of observations p = number of predictors Response vector j=1,…,p
Principal Component Analysis Uses Data-processing Dimension-reduction Computed using singular value decomposition of the data matrix
PCA Optimal Properties Suboptimal Properties Principal components sequentially capture the maximum variability among the columns of X This guarantees minimal information loss Principal components are uncorrelated One principal component is independent of others Suboptimal Properties PC are linear combinations of all p variables Loadings are normally nonzero
Previous Solutions Interpretation of PC Dimensionality reduction Jolliffe (1995) suggested rotation techniques Vines (2000) considered simple principal components Loadings take values from a SMALL set of integers Dimensionality reduction Cadima and Jolliffe (1995) artificially set the loadings with absolute values smaller than some threshold to zero McCabe (1984) found a subset of principal variables Jolliffe, Trendafilov, and Uddin (2003) introduced SCoTLASS to get modified PC with possible zero loadings
Lasso Tibshirani (1996) introduced Lasso as variable selection technique Focused on accurate and sparse models Penalized least squares method Constraint on the L1 norm of the regression coefficients λ is non-negative
Lasso Continued Continuously shrinks the coefficients towards zero Prediction accuracy via the bias variance trade-off Estimated using the LARS algorithm Limitations Number of variables that are selected by lasso is limited by the number of observations Can only select at most n predictors
Elastic Net Zou and Hastie (2005) proposed Elastic Net as a generalization of Lasso Convex combination of the ridge and lasso penalties λ1 and λ2 are non-negative Estimated using the LARS-EN algorithm
Elastic Net Continued p>n Choose λ2 > 0 Removes the limitation on the number of variables that can be included in the fitted model
SCoTLASS Obtains sparse loadings by directly imposing an L1 constraint on PCA Sufficiently small t yields some exact zero loadings Process as below:
SCoTLASS Continued Limitations No guidance on choosing t High computational cost Not sparse enough with a high percentage of explained variance
Simple Regression Approach Theorem 1: For each i, denote by Zi =UiDii the ith principal component. Consider a positive λ and the ridge estimates given by
Theorem 1 Implications Using theorem 1, PCA and a regression method are connected. PCA always gives a unique solution in all situations Extending this to naïve elastic net allows us to flexibly choose a sparse approximation to the ith principal component
SPCA Connecting PCA and regression while using the lasso approach for producing sparse loadings gives the following equation to be optimized:
General SPCA Algorithm 1. Let A start at V[,1:k], the loadings of the first k ordinary principal components. 2. Given a fixed A = [α1, …, αk], solve the following elastic net problem for j = 1,2,…,k 3. For a fixed B = [β1, …, βk], compute the SVD of XTXB = UDVT, then update A = UVT 4. Repeat Steps 2-3, until convergence. 5. Normalization:
Remarks about General SPCA Algorithm Output does not change much regardless of λ If n > p, then λ is defaulted to zero Small λ allows for overcoming collinearity problems in X Algorithm converges quickly Can try multiple combinations of {λ1,j} Choose a value that gives an acceptable compromise between variance and sparsity Prioritize variance
Adjusted Total Variance Take into account the correlations among the modified PCs using the below formula:
Computation Complexity When n > p and p ≥ k, the total computation cost is at most np2 + mO(p3) where m is the number of iterations before convergence and O(p3) represents the maximum number of operations for each elastic net solution SPCA is efficient for huge n and small p p < 100 When p >> n, the total computation cost is of order mkO(pJn+J3) for a positive finite λ. Expensive for large J and p Elastic Net is the most costly Special algorithm for this type of data
SPCA for p>>n Theorem 5.
SPCA for p>>n Using theorem 5, replace step 2 in the general SPCA algorithm with soft- thresholding. Step 2: for j = 1,2,…,k
Pitprops Data 180 Observations with 13 measured variables Classic example showing the difficulty of interpreting PCs Set λ=0 and λ1=(0.06, 0.16, 0.1, 0.5, 0.5, 0.5) Chosen so sparse approximation explained almost the same amount of variance as the ordinary PC
Pitprops Data PCs by SPCA accounts for 75.8% of the variance SCoTLASS accounts for 69.3% of the variance SPCA is more sparse SPCA was completed in seconds SC0TLASS, simple thresholding, then SPCA are increasingly better in terms of variance
Synthetic Data Three hidden factors with 10 observable variables Exact covariance matrix was used to perform PCA, SPCA, and simple thresholding There should be a “correct” sparse representation due to the way the data was imputed SPCA and SCoTLASS produce the ideal sparse PCs Both use the lasso penalty Simple thresholding incorrectly specified variables as most important Additionally the variance explained is lower than SPCA
Ramaswamy Data p=16,063 genes and n=144 samples Goal was to find the set of genes that are biologically relevant to the outcome PCA has been popular for this analysis If sparse principal component can explain a large part of the total variance of gene expression levels, then the subset of genes representing the principal component are considered important Apply SPCA with λ = ∞
Ramaswamy Data SCoTLASS cannot be used for finding sparse PCs Simple thresholding always explains slightly higher variance then SPCA does for the same number of genes. 2% different genes Difference is consistent
Discussion Good method to achieve sparseness should possess the properties: Without any sparsity constraint, the method should reduce to PCA Computationally efficient for both small p and big p data Avoid misidentifying the important variables Simple thresholding approach Not criterion based Has property 1 and 2 Benchmark for any potential better method
Discussion Continued SCoTLASS SPCA Derives sparse loadings Not computationally efficient Lacks an adequate rule for choosing a tuning parameter Cannot be applied to gene expression arrays SPCA Computationally efficient High explained variance Identifies important variables
Questions?