Download presentation
Presentation is loading. Please wait.
Published byWhitney Booker Modified over 6 years ago
1
Variational Graph Embedding for Globally and Locally Consistent Feature Extraction
Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha S. Kevin Zhou Siemens Corporate Research Inc., USA Bao-Gang Hu Chinese Academy of Sciences, China Presented by Yang on 09/08/2009 at ECML PKDD 2009
2
Motivation No.1 Graph-Based Learning
Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Data graph: G= (V, E, W) V: node set a node v a data x E: edge set W: edge weights wij=w(xi,xj)
3
Motivation No.1 Graph-Based Learning
Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Where comes a graph? Existing graph construction: Prior: prior knowledge, side information, etc Naive: gaussian similarity, ε-ball, kNN, b-matching, etc. Data graph: G= (V, E, W) V: node set a node v a data x E: edge set W: edge weights wij=w(xi,xj)
4
Motivation No.1 Graph-Based Learning
Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Where comes a graph? Existing graph construction: Prior: prior knowledge, side information, etc Naive: gaussian similarity, ε-ball, kNN, b-matching, etc. GBL is very sensitive to graph structure and edge weight. Q1: How to construct a reliable & effective graph? Data graph: G= (V, E, W) V: node set a node v a data x E: edge set W: edge weights wij=w(xi,xj)
5
Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy
6
Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy
7
Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy
8
Motivation No.2 Feature Extraction
find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy Q2: Is it possible to combine the two pros into one framework?
9
This Paper Well-defined graph can be established based on theoretically justified learning measures. Take feature extraction as an example Graph-based Learning task, Mutual Information (MI) and Bayes Error Rate (BER) as example measures, we show that: Both global/statistical and local/geometric structures of the data can be captured using a single objective, leading to high-quality feature learner with various appealing properties Algorithm: Variational Graph-Embedding: iterative between graph learning and spectral graph embedding
10
Variational Graph Embedding for FE
Outline Motivation Variational Graph Embedding for FE Theoretically Optimal Measures: MI and BER Variational Graph Embedding (MI) Variational Graph Embedding (BER) Experiment
11
Optimal Feature Extraction
Theoretical Optimal Measures for feature learning Mutual Information (MI) accounts for high-order statistics, i.e., complete dependency y--z Bayes Error Rate (BER) directly maximizing the discrimination/generalization ability Practical prohibition Estimation: Both involve unkown distributions, and numerical integration of high-dimensional data Optimization: coupled variables, complicated objective
12
Two steps making things easy
Our Solution Two steps making things easy Nonparametric Estimation Eliminates the numerical integration of unknown high-dimensional distributions, reduces the task into kernel-based optimization Variational Kernel Approximation Turns complex optimization into variational graph embedding The resultant algorithm is a EM-style iteration between learning a graph (optimizing variational parameters) and embedding a learned graph (optimizing learner parameters)
13
Variational Graph Embedding: MI
1. Max Nonparametric Quadratic MI 1.1 Shannon entropy Renyi’s quadratic entropy 1.2 Kernel density estimation: isotropic Gaussian kernel Optimizing a big sum of nonconvex functions!
14
Variational Graph Embedding: MI
1. Max Nonparametric Quadratic MI 2. Max Variational Nonparametric QMI 1.1 Shannon entropy Renyi’s quadratic entropy 1.2 Kernel density estimation: isotropic Gaussian kernel 2. kernel term variational kernel term (Jaakkola-Jordan Variational Lower Bound for exp function) Variational Graph Embedding (MI) E-Step: optimizing variational parameters (equivalent to learning a weighted graph) M-Step: linearly embedding graph (can be solved by spectral analysis in linear or kernel space)
15
Variational Graph Embedding: MI
Initial Graph Approximate each kernel term by its first-order Taylor expansion
16
Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff
17
Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination
18
Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving
19
Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving
20
Variational Graph Embedding: MI
Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving Connection with LPP and FDA
21
Variational Graph Embedding: BER
Bayes Error Rate Nonparametric Estimation Variational kernel approximation Variational Graph Embedding =
22
Variational Graph Embedding: BER
Variational Graph Embedding (MI) E-Step: optimizing variational parameters (equivalent to learning a weighted graph) M-Step: linearly embedding graph (can be solved by spectral analysis in linear or kernel space)
23
Experiments Face recognition
Compared with both global/statistical methods (PCA/KPCA, LDA/KDA) and local/geometric methods (LPP/KLPP, MFA/KMFA) Compared with both supervised (LDA, MFA) and unsupervised (PCA, LPP) Three benchmark facial image sets (Yale, ORL and CMU Pie) are used. All the images were taken in different environments, at different times, with different poses, facial expressions and details. All the raw images are normalized to 32x32. For each data set, we randomly select v images of each person as training data, and leave others for testing. Only the training data are used to learn features. To evaluate the effectiveness of different methods, the classification accuracy of a k-NN classifier on testing data is used as the evaluation metric.
24
Experiments Face recognition
25
Experiments Face recognition
On average, MIE is 36% over PCA, 8% over FDA, and 4% over MFA; BERE is 39% over PCA, 11% over FDA and 6% over MFA. The improvements are even more significant in the kernel case: kMIE (31%,9%,7%) and kBERE(33%,10%,8%) over (KPCA, KDA, KMFA)
26
Discussion 1. It is possible to capture both global/statistical and local/geometric structure into one well-defined objective. 2. Graph construction: What makes a good graph? Predictive: relevant to the target concept we are inferring Locally and globally consistent: account for both local & global information revealed by the data, and strive for a nature tradeoff Computationally convenient: easy and inexpensive for learning and testing A feasible graph construction approach: nonparametric measure + variational kernel approximation Kernel / (dis-)similarity learning Metric Learning, Dimensionality reduction Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering, … …
27
Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.