Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12, 2004
It’s all about subspaces Traditional subspaces Traditional subspaces PCA PCA ICA ICA Kernel PCA (& neural network NLPCA) Kernel PCA (& neural network NLPCA) Probabilistic subspaces Probabilistic subspaces
Linear PCA We already know this We already know this Main properties Main properties Approximate reconstruction Approximate reconstruction x ≈ y Orthonormality of the basis Orthonormality of the basis T =I Decorrelated principal components Decorrelated principal components E{y i y j } i≠j = 0
Linear ICA Like PCA, but the components’ distribution is designed to be sub/super Gaussian statistical independence Like PCA, but the components’ distribution is designed to be sub/super Gaussian statistical independence Main properties Main properties Approximate reconstruction Approximate reconstruction x ≈ Ay Nonorthogonality of the basis A Nonorthogonality of the basis A A T A≠I Near factorization of the joint distribution P(y) Near factorization of the joint distribution P(y) P(y) ≈ ∏ p(y i )
Nonlinear PCA (NLPCA) AKA principal curves AKA principal curves Essentially nonlinear regression Essentially nonlinear regression Finds a curved subspace passing “through the middle of the data” Finds a curved subspace passing “through the middle of the data”
Nonlinear PCA (NLPCA) Main properties Main properties Approximate reconstruction Approximate reconstruction y = f (x) Nonlinear projection Nonlinear projection x ≈ g(y) No prior knowledge regarding joint distribution of the components (typical) No prior knowledge regarding joint distribution of the components (typical) P(y) = ? Two main methods Two main methods Neural network encoder Neural network encoder Kernel PCA (KPCA) Kernel PCA (KPCA)
NLPCA neural network encoder Trained to match the output to the input Trained to match the output to the input Uses a “bottleneck” layer to force a lower-dimensional representation Uses a “bottleneck” layer to force a lower-dimensional representation
KPCA Similar to kernel-based nonlinear SVM Similar to kernel-based nonlinear SVM Maps data to a higher dimensional space in which linear PCA is applied Maps data to a higher dimensional space in which linear PCA is applied Nonlinear input mapping Nonlinear input mapping (x): N L, N<L Covariance is computed with dot-products Covariance is computed with dot-products For economy, make (x) implicit For economy, make (x) implicit k(x i,x j ) = ( (x i ) (x j ) )
KPCA Does not require nonlinear optimization Does not require nonlinear optimization Is not subject to overfitting Is not subject to overfitting Requires no prior knowledge of network architecture or number of dimensions Requires no prior knowledge of network architecture or number of dimensions Requires the (unprincipled) selection of an “optimal” kernel and its parameters
Nearest-neighbor recognition Find labeled image most similar to N-dim input vector using a suitable M-dim subspace Find labeled image most similar to N-dim input vector using a suitable M-dim subspace Similarity ex: S(I 1,I 2 ) || ∆ || -1,∆ = I 1 - I 2 Similarity ex: S(I 1,I 2 ) || ∆ || -1,∆ = I 1 - I 2 Observation: Two types of image variation Observation: Two types of image variation Critical:Images of different objects Critical:Images of different objects Incidental:Images of same object under Incidental:Images of same object under different lighting, surroundings, etc. Problem:Preceding subspace projections do Problem:Preceding subspace projections do not help distinguish variation type when calculating similarity
Probabilistic similarity Similarity based on probability that ∆ is characteristic of incidental variations Similarity based on probability that ∆ is characteristic of incidental variations ∆ = image-difference vector (N-dim) ∆ = image-difference vector (N-dim) Ω I = incidental (intrapersonal) variations Ω I = incidental (intrapersonal) variations Ω E = critical (extrapersonal) variations Ω E = critical (extrapersonal) variations
Probabilistic similarity Likelihoods P(∆|Ω) estimated using subspace density estimation Likelihoods P(∆|Ω) estimated using subspace density estimation Priors P(Ω) are set to reflect specific operating conditions (often uniform) Priors P(Ω) are set to reflect specific operating conditions (often uniform) Two images are of the same object if P(Ω I |∆) > P(Ω E |∆) S(∆) > 0.5 Two images are of the same object if P(Ω I |∆) > P(Ω E |∆) S(∆) > 0.5
Subspace density estimation Necessary for each P(∆|Ω), Ω { Ω I, Ω E } Necessary for each P(∆|Ω), Ω { Ω I, Ω E } Perform PCA on training-sets of ∆ for each Ω Perform PCA on training-sets of ∆ for each Ω The covariance matrix (∑) will define a Gaussian The covariance matrix (∑) will define a Gaussian Two subspaces: Two subspaces: F = M-dimensional principal subspace of ∑ F = M-dimensional principal subspace of ∑ F = non-principal subspace orthogonal to F F = non-principal subspace orthogonal to F y i = ∆ projected onto principal eigenvectors y i = ∆ projected onto principal eigenvectors i = ranked eigenvalues i = ranked eigenvalues Non-principal eigenvalues are typically unknown and are estimated by fitting a function of the form f -n to the known eigenvalues Non-principal eigenvalues are typically unknown and are estimated by fitting a function of the form f -n to the known eigenvalues
Subspace density estimation 2 (∆) = PCA residual (reconstruction error) 2 (∆) = PCA residual (reconstruction error) = density in non-principal subspace = density in non-principal subspace ≈ average of (estimated) F eigenvalues ≈ average of (estimated) F eigenvalues P(∆|Ω) is marginalized into each subspace P(∆|Ω) is marginalized into each subspace Marginal density is exact in F Marginal density is exact in F Marginal density is approximate in F Marginal density is approximate in F
Efficient similarity computation After doing PCA, use a whitening transform to preprocess the labeled images into single coefficients for each of the principal subspaces: After doing PCA, use a whitening transform to preprocess the labeled images into single coefficients for each of the principal subspaces: where and V are matrices of the principal eigenvalues and eigenvectors of either ∑ I or ∑ E At run time, apply the same whitening transform to the input image At run time, apply the same whitening transform to the input image
Efficient similarity computation The whitening transform reduces the marginal Gaussian calculations in the principal subspaces F to simple Euclidean distances The whitening transform reduces the marginal Gaussian calculations in the principal subspaces F to simple Euclidean distances The denominators are easy to precompute The denominators are easy to precompute
Efficient similarity computation Further speedup can be gained by using a maximum likelihood (ML) rule instead of a maximum a posteriori (MAP) rule: Further speedup can be gained by using a maximum likelihood (ML) rule instead of a maximum a posteriori (MAP) rule: Typically, ML is only a few percent less accurate than MAP, but ML is twice as fast Typically, ML is only a few percent less accurate than MAP, but ML is twice as fast In general, Ω E seems less important than Ω I In general, Ω E seems less important than Ω I
Similarity Comparison Eigenface (PCA) Similarity Probabilistic Similarity
Experiments 21x12 low-res faces, aligned and normalized 21x12 low-res faces, aligned and normalized 5-fold cross validation 5-fold cross validation ~ 140 unique individuals per subset ~ 140 unique individuals per subset No overlap of individuals between subsets to test generalization performance No overlap of individuals between subsets to test generalization performance 80% of the data only determines subspace(s) 80% of the data only determines subspace(s) 20% of the data is divided into labeled images and query images for nearest-neighbor testing 20% of the data is divided into labeled images and query images for nearest-neighbor testing Subspace dimensions = d = 20 Subspace dimensions = d = 20 Chosen so PCA ~ 80% accurate Chosen so PCA ~ 80% accurate
Experiments KPCA KPCA Empirically tweaked Gaussian, polynomial, and sigmoidal kernels Empirically tweaked Gaussian, polynomial, and sigmoidal kernels Gaussian kernel performed the best, so it is used in the comparison Gaussian kernel performed the best, so it is used in the comparison MAP MAP Even split of the 20 subspace dimensions Even split of the 20 subspace dimensions M E = M I = d/2 = 10 so that M E + M I = 20 M E = M I = d/2 = 10 so that M E + M I = 20
Results Recognition accuracy (percent) N-Dimensional Nearest Neighbor (no subspace)
Results Recognition accuracy vs subspace dimensionality Note:data split 50/50 for training/testing rather than using CV
Conclusions Bayesian matching outperforms all other tested methods and even achieves ≈ 90% accuracy with only 4 projections (2 for each class of variation) Bayesian matching outperforms all other tested methods and even achieves ≈ 90% accuracy with only 4 projections (2 for each class of variation) Bayesian matching is an order of magnitude faster to train than KPCA Bayesian matching is an order of magnitude faster to train than KPCA Bayesian superiority with higher resolution images verified in independent US Army FERIT tests Bayesian superiority with higher resolution images verified in independent US Army FERIT tests Wow! Wow! You should use this You should use this
My results 50% Accuracy 50% Accuracy Why so bad? Why so bad? I implemented all suggested approximations I implemented all suggested approximations Poor data--hand registered Poor data--hand registered Too little data Too little data Note:data split 50/50 for training/testing rather than using CV
My results My data My data His data His data