Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha

Slides:



Advertisements
Similar presentations
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Advertisements

Non-linear Dimensionality Reduction by Locally Linear Inlaying Presented by Peng Zhang Tianjin University, China Yuexian Hou, Peng Zhang, Xiaowei Zhang,
1 Manifold Alignment for Multitemporal Hyperspectral Image Classification H. Lexie Yang 1, Melba M. Crawford 2 School of Civil Engineering, Purdue University.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
A Geometric Perspective on Machine Learning 何晓飞 浙江大学计算机学院 1.
Graph Embedding and Extensions: A General Framework for Dimensionality Reduction Keywords: Dimensionality reduction, manifold learning, subspace learning,
Minimum Redundancy and Maximum Relevance Feature Selection
AGE ESTIMATION: A CLASSIFICATION PROBLEM HANDE ALEMDAR, BERNA ALTINEL, NEŞE ALYÜZ, SERHAN DANİŞ.
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Clustering and Dimensionality Reduction Brendan and Yifang April
Principal Component Analysis
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
Three Algorithms for Nonlinear Dimensionality Reduction Haixuan Yang Group Meeting Jan. 011, 2005.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Sufficient Dimensionality Reduction with Irrelevance Statistics Amir Globerson 1 Gal Chechik 2 Naftali Tishby 1 1 Center for Neural Computation and School.
Atul Singh Junior Undergraduate CSE, IIT Kanpur.  Dimension reduction is a technique which is used to represent a high dimensional data in a more compact.
NonLinear Dimensionality Reduction or Unfolding Manifolds Tennenbaum|Silva|Langford [Isomap] Roweis|Saul [Locally Linear Embedding] Presented by Vikas.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Radial Basis Function Networks
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Enhancing Tensor Subspace Learning by Element Rearrangement
This week: overview on pattern recognition (related to machine learning)
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
A Comparison Between Bayesian Networks and Generalized Linear Models in the Indoor/Outdoor Scene Classification Problem.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis J. Wu, J. M. Pedersen, D. Putthividhya, D. Norgaard,
Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Presented by Xianwang Wang Masashi Sugiyama.
Additive Data Perturbation: the Basic Problem and Techniques.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Jan Kamenický.  Many features ⇒ many dimensions  Dimensionality reduction ◦ Feature extraction (useful representation) ◦ Classification ◦ Visualization.
H. Lexie Yang1, Dr. Melba M. Crawford2
Data Mining Course 0 Manifold learning Xin Yang. Data Mining Course 1 Outline Manifold and Manifold Learning Classical Dimensionality Reduction Semi-Supervised.
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai, Chiyuan Zhang, Xiaofei He Zhejiang University.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Data Mining and Decision Support
June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Spectral Methods for Dimensionality
Nonlinear Dimensionality Reduction
Semi-Supervised Clustering
Nonparametric Density Estimation – k-nearest neighbor (kNN) 02/20/17
LECTURE 11: Advanced Discriminant Analysis
Data Mining, Neural Network and Genetic Programming
Unsupervised Riemannian Clustering of Probability Density Functions
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Machine Learning Basics
Vincent Granville, Ph.D. Co-Founder, DSC
Outline Nonlinear Dimension Reduction Brief introduction Isomap LLE
Graph Based Multi-Modality Learning
Learning with information of features
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Globally Maximizing Locally Minimizing unsupervised discriminant projection with applications to face and palm biometrics PAMI 2007 Bo Yang 2/25/2019.
Feature Selection Methods
NonLinear Dimensionality Reduction or Unfolding Manifolds
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Variational Graph Embedding for Globally and Locally Consistent Feature Extraction Shuang Hong Yang College of Computing, Georgia Tech, USA Hongyuan Zha S. Kevin Zhou Siemens Corporate Research Inc., USA Bao-Gang Hu Chinese Academy of Sciences, China Presented by Yang (shy@gatech.edu) on 09/08/2009 at ECML PKDD 2009

Motivation No.1 Graph-Based Learning Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Data graph: G= (V, E, W) V: node set a node v  a data x E: edge set W: edge weights wij=w(xi,xj)

Motivation No.1 Graph-Based Learning Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Where comes a graph? Existing graph construction: Prior: prior knowledge, side information, etc Naive: gaussian similarity, ε-ball, kNN, b-matching, etc. Data graph: G= (V, E, W) V: node set a node v  a data x E: edge set W: edge weights wij=w(xi,xj)

Motivation No.1 Graph-Based Learning Learning by exploiting relationships in data: Kernel / (dis-)similarity learning Metric Learning Dimensionality reduction, Manifold Learning Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering … … Where comes a graph? Existing graph construction: Prior: prior knowledge, side information, etc Naive: gaussian similarity, ε-ball, kNN, b-matching, etc. GBL is very sensitive to graph structure and edge weight. Q1: How to construct a reliable & effective graph? Data graph: G= (V, E, W) V: node set a node v  a data x E: edge set W: edge weights wij=w(xi,xj)

Motivation No.2 Feature Extraction find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy

Motivation No.2 Feature Extraction find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy

Motivation No.2 Feature Extraction find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy

Motivation No.2 Feature Extraction find a mapping such that the high-dimensional data x is compressed to a low-dimensional one z Existing FE methods: either of the two classes Global/Statistical: optimizing globally defined statistical measures (variance, entropy, correlation, Fisher information, etc.) E.g.: PCA, FDA, other classical methods Good to preserv global properties but suboptimal when the underlying assuptions are violated (e.g., multi-modality) Local/Geometric Preserving the local geometric (submanifold) structure from which the data are sampled E.g.: ISOMAP, LLE, Laplacian Eigenmap, LPP Good to preserv local/geometric structure, but neglecting the overall properties of the data, e.g., relavance to label, variance, redundancy Q2: Is it possible to combine the two pros into one framework?

This Paper Well-defined graph can be established based on theoretically justified learning measures. Take feature extraction as an example Graph-based Learning task, Mutual Information (MI) and Bayes Error Rate (BER) as example measures, we show that: Both global/statistical and local/geometric structures of the data can be captured using a single objective, leading to high-quality feature learner with various appealing properties Algorithm: Variational Graph-Embedding: iterative between graph learning and spectral graph embedding

Variational Graph Embedding for FE Outline Motivation Variational Graph Embedding for FE Theoretically Optimal Measures: MI and BER Variational Graph Embedding (MI) Variational Graph Embedding (BER) Experiment

Optimal Feature Extraction Theoretical Optimal Measures for feature learning Mutual Information (MI) accounts for high-order statistics, i.e., complete dependency y--z Bayes Error Rate (BER) directly maximizing the discrimination/generalization ability Practical prohibition Estimation: Both involve unkown distributions, and numerical integration of high-dimensional data Optimization: coupled variables, complicated objective

Two steps making things easy Our Solution Two steps making things easy Nonparametric Estimation Eliminates the numerical integration of unknown high-dimensional distributions, reduces the task into kernel-based optimization Variational Kernel Approximation Turns complex optimization into variational graph embedding The resultant algorithm is a EM-style iteration between learning a graph (optimizing variational parameters) and embedding a learned graph (optimizing learner parameters)

Variational Graph Embedding: MI 1. Max Nonparametric Quadratic MI 1.1 Shannon entropy  Renyi’s quadratic entropy 1.2 Kernel density estimation: isotropic Gaussian kernel Optimizing a big sum of nonconvex functions!

Variational Graph Embedding: MI 1. Max Nonparametric Quadratic MI 2. Max Variational Nonparametric QMI 1.1 Shannon entropy  Renyi’s quadratic entropy 1.2 Kernel density estimation: isotropic Gaussian kernel 2. kernel term  variational kernel term (Jaakkola-Jordan Variational Lower Bound for exp function) Variational Graph Embedding (MI) E-Step: optimizing variational parameters (equivalent to learning a weighted graph) M-Step: linearly embedding graph (can be solved by spectral analysis in linear or kernel space)

Variational Graph Embedding: MI Initial Graph Approximate each kernel term by its first-order Taylor expansion

Variational Graph Embedding: MI Justification Max-Relevance-Min-Redundancy with nature tradeoff

Variational Graph Embedding: MI Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination

Variational Graph Embedding: MI Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving

Variational Graph Embedding: MI Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving

Variational Graph Embedding: MI Justification Max-Relevance-Min-Redundancy with nature tradeoff Max Discrimination Locality Preserving Connection with LPP and FDA

Variational Graph Embedding: BER Bayes Error Rate Nonparametric Estimation Variational kernel approximation Variational Graph Embedding =

Variational Graph Embedding: BER Variational Graph Embedding (MI) E-Step: optimizing variational parameters (equivalent to learning a weighted graph) M-Step: linearly embedding graph (can be solved by spectral analysis in linear or kernel space)

Experiments Face recognition Compared with both global/statistical methods (PCA/KPCA, LDA/KDA) and local/geometric methods (LPP/KLPP, MFA/KMFA) Compared with both supervised (LDA, MFA) and unsupervised (PCA, LPP) Three benchmark facial image sets (Yale, ORL and CMU Pie) are used. All the images were taken in different environments, at different times, with different poses, facial expressions and details. All the raw images are normalized to 32x32. For each data set, we randomly select v images of each person as training data, and leave others for testing. Only the training data are used to learn features. To evaluate the effectiveness of different methods, the classification accuracy of a k-NN classifier on testing data is used as the evaluation metric.

Experiments Face recognition

Experiments Face recognition On average, MIE is 36% over PCA, 8% over FDA, and 4% over MFA; BERE is 39% over PCA, 11% over FDA and 6% over MFA. The improvements are even more significant in the kernel case: kMIE (31%,9%,7%) and kBERE(33%,10%,8%) over (KPCA, KDA, KMFA)

Discussion 1. It is possible to capture both global/statistical and local/geometric structure into one well-defined objective. 2. Graph construction: What makes a good graph? Predictive: relevant to the target concept we are inferring Locally and globally consistent: account for both local & global information revealed by the data, and strive for a nature tradeoff Computationally convenient: easy and inexpensive for learning and testing A feasible graph construction approach: nonparametric measure + variational kernel approximation Kernel / (dis-)similarity learning Metric Learning, Dimensionality reduction Spectral Clustering Semi-Supervised Classification / Clustering Relational Learning Collaborative filtering / co-clustering, … …

Thanks!