Effective Dimension Reduction with Prior Knowledge Haesun Park Division of Computational Science and Eng. College of Computing Georgia Institute of Technology Atlanta, GA Joint work w/ Barry Drake, Peg Howland, Hyunsoo Kim, and Cheonghee Park DIMACS, May, 2007
Dimension Reduction Dimension Reduction for Clustered Data: Linear Discriminant Analysis (LDA) Generalized LDA (LDA/GSVD, regularized LDA) Orthogonal Centroid Method (OCM) Dimension Reduction for Nonnegative Data: Nonnegative Matrix Factorization (NMF) Applications: Text classification, Face recognition, Fingerprint classification, Gene clustering in Microarray Analysis …
2D Representation Utilize Cluster Structure if Known 2D representation of 150x1000 data with 7 clusters: LDA vs. SVD
A = [a 1 a n ] mxn, clustered data N i = items in class i, | N i | = n i, total r classes c i = centroid, c = global centroid S b = ∑ 1≤ i≤ r ∑ j ∈ N i (c i – c) (c i – c) T S w = ∑ 1≤ i≤ r ∑ j ∈ N i (a j – c i ) (a j – c i ) T S t = ∑ 1≤ i≤ n (a i – c ) (a i – c ) T, Dimension Reduction for Clustered Data Measure for Cluster Quality S w + S b = S t
Optimal Dimension Reducing Transformation High quality clusters have small trace(S w ) & large trace(S b ) Want: G s.t. min trace(G T S w G) & max trace(G T S b G) max trace ((G T S w G) -1 (G T S b G)) LDA (Fisher 36, Rao 48) max trace (G T S b G) Orthogonal Centroid (Park et al. 03) max trace (G T (S w +S b )G) PCA (Pearson 1901, Hotelling 33) max trace (G T A A T G) LSI (Deerwester et al. 90) G T y qx1, q << m G T qxm y mx1 G T G=I
Classical LDA (Fisher ’36, Rao ‘48) max trace ((G T S w G) -1 (G T S b G)) G : leading (r-1) e.vectors of S w -1 S b Fails when m>n (undersampled), S w singular SwSw HwHw HwTHwT = x S b =H b H b T, H b n 1 (c -c), …, n r (c r - c) ] : mxr S w =H w H w T, H w =[a 1 -c 1, a 2 -c 1, …, a n -c r ] : mxn
LDA based on GSVD (LDA/GSVD) (Howland, Jeon, Park, SIMAX03, Howland and Park, IEEE TPAMI 04) S w -1 S b x = x S b x= S w x 2 H b H b T x = 2 H w H w T x U T H b T X V T H w T X b 0) = w 0) = 0 0 X T H b H b T X = X T S b X and X T H w H w T X = X T S w X Classical LDA is a special case of LDA/GSVD X T S b X = I DbDb 0 0 X T S w X = 0 Dw Dw I 0
Generalization of LDA for Undersampled Problems Regularized LDA (Friedman ’89, Zhao et al. ’99 … ) LDA/GSVD : Solution G = [ X 1 X 2 ] (Howland, Jeon, Park ’03) Solutions based on Null(S w ) and Range(S b )… (Chen et al. ’00, Yu & Yang ’01, Park & Park ’03 …) Two-stage methods: Face Recognition: PCA + LDA (Swets & Weng ’96, Zhao et al. 99 ) Information Retrieval: LSI + LDA (Torkkola ’01) Mathematical Equivalence: (Howland and Park ’03) PCA+ LDA/GSVD = LDA/GSVD LSI +LDA/GSVD = LDA/GSVD More efficient = QRD + LDA/GSVD
QRD Preprocessing in Dim. Reduction (Distance Preserving Dim. Redution) AQ1Q1 R For undersampled data A:mxn, m>>n = Q 1 : orthonormal basis for span(A) Dimension reduction of A by Q 1 T, Q 1 T A = R: nxn Q 1 T preserves distance of L 2 norm: || a i || 2 = || Q 1 T a i || 2 || a i - a j || 2 = || Q 1 T (a i - a j )|| 2 In cos distance: cos(a i, a j ) = cos(Q 1 T a i, Q 1 T a j ) Q1Q1 Q2Q2 = R 0 Applicable to PCA, LDA, LDA/GSVD, Isomap, LTSA, LLE, …
Data Dim.# r LDA/GSVDregLDA (LDA) QR+LDA/GSVDQR+LDA/regGSVD Text 5896 x Yale x AT&T x Feret 3000 x OptDigit 64 x Isolet 617 x Speed Up with QRD Preprocessing (computation time)
ClassificationMedline Data (1250 items, 5 Clusters) Methods FullOCMLDA/GSVD Dim centroid (L 2 ) centroid (Cosine) nn (L 2 ) nn (Cosine) nn(L 2 ) nn (Cosine) SVM Text Classification with Dim. Reduction Classification accuracy (%) Similarity measures: L 2 norm and Cosine (Kim, Howland, Park, JMLR03) Reuters Data (9579 items, 90 Clusters) FullOCM
Dim. Red. Method Dim kNN k=1 k=5 k=9 Full Space LDA/GSVD (90) Regularized LDA (85) Proj. to null (S w ) (84) ( Chen et al., ’00) Transf. to range(S b ) (82) ( Yu & Yang, ’01) Prediction Accuracy in %, leave-one-out ( and average of 100 random split) Yale Face Database: 243 x 320 pixels = full dimension of images/person x 15 people = 165 images After Preprocessing (avg 3x3): 8586 x 165 Face Recognition on Yale Data (C. Park and H. Park, icdm04)
Fingerprint Classification Results on NIST Fingerprint Database fingerprint images of size 512x512 By KDA/GSVD, dimension reduced from 105x105 to 4 KDA/GSVD: Nonlinear Extension of LDA/GSVD based on Kernel Functions Rejection rate(%) KDA/GSVD kNN & NN Jain et al., SVM Yao et al., (C. Park and H. Park, Pattern Recognition, 2005)
Nonnegativity Preserving Dim. Reduction Nonnegative Matrix Factorization (Paatero&Tappa 94, Lee&Seung NATURE 99, Pauca et al. SIAM DM 04, Hoyer 04, Lin 05, Berry 06, Kim and Park 06, …) AW H Given A:mxn with A>=0 and k << min (m,n), find W:mxk and H:kxn with W>=0 and H>=0 s.t. ~=~= min || A – WH || F NMF/ANLS: Two-block Coordinate Descent Method in Bound-constrained Opt. Iterate the following ANLS ( Kim and Park, Bioinformatics, to appear ) : fixing W, solve min H>=0 || W H –A|| F fixing H, solve min W>=0 || H T W T –A T || F Any limit point is a stationary point (Grippo and Siandrone 00)
Nonnegativity Constraints? Better Approximation vs. Better Representation/Interpretation Given A : m x n and k < min(m, n) SVD: Best Approximation min ||A – W H|| F, A = U V T, A U k V k T NMF: Better Representation/Interpretation? min ||A – W H|| F, W>=0, H>=0 ? Nonnegative Constraints are physically meaningful Pixels in digital image, Molecule concentration in bioinformatics Signal Intensities, Visualization…. Interpretation of analysis results: nonsubtractive combinations of nonnegative basis vectors
Performance of NMF Algorithms The relative residuals vs. the number of iterations for NMF/ANLS, NMF/MUR, and NMF/ALS on a zero residual artificial problem A:200x50
Recovery of Factors by SVD and NMF A: 2500x28, W:2500x3, H:3x28 where A=W*H Recovery of the factors W and H by SVD and NMF/ANLS
Summary Effective Algorithms for Dimension Reduction and Matrix Decompositions that exploits prior knowledge Design of New Algorithms: e.g. for undersampled data Take Advantage of Prior Knowledge for Physically More Meaningful Modeling Storage and Efficiency Issues for Massive Scale Data Adaptive Algorithms * Applicable to a wide range of problems (Text classification, Facial recognition, Fingerprint classification, Gene class discovery in Microarray data, Protein secondary structure prediction … ) Thank you!