Similarity-based Classifiers: Problems and Solutions.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Olivier Duchenne , Armand Joulin , Jean Ponce Willow Lab , ICCV2011.
An Introduction of Support Vector Machine
Support Vector Machines
Machine learning continued Image source:
Optimization Tutorial
Discriminative and generative methods for bags of features
Support Vector Machines
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Dual Problem of Linear Program subject to Primal LP Dual LP subject to ※ All duality theorems hold and work perfectly!
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Variations of Minimax Probability Machine Huang, Kaizhu
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Supervised Distance Metric Learning Presented at CMU’s Computer Vision Misc-Read Reading Group May 9, 2007 by Tomasz Malisiewicz.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines and Kernel Methods
Support Vector Machines
What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
An Introduction to Support Vector Machines Martin Law.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
Massive Support Vector Regression (via Row and Column Chunking) David R. Musicant and O.L. Mangasarian NIPS 99 Workshop on Learning With Support Vectors.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support vector machines
Kernels Usman Roshan.
Robust Optimization and Applications in Machine Learning
Statistical Learning Dong Liu Dept. EEIS, USTC.
CS 2750: Machine Learning Support Vector Machines
Support Vector Machines Most of the slides were taken from:
Usman Roshan CS 675 Machine Learning
Support vector machines
Support vector machines
SVMs for Document Ranking
Minimal Kernel Classifiers
Presentation transcript:

Similarity-based Classifiers: Problems and Solutions

Classifying based on similarities: 2 Van Gogh Or Monet ? Van Gogh Monet

the Similarity-based Classification Problem 3 (painter) (paintings)

the Similarity-based Classification Problem 4

5 ?

Examples of Similarity Functions Computational Biology – Smith-Waterman algorithm (Smith & Waterman, 1981) – FASTA algorithm (Lipman & Pearson, 1985) – BLAST algorithm (Altschul et al., 1990) Computer Vision – Tangent distance (Duda et al., 2001) – Earth mover’s distance (Rubner et al., 2000) – Shape matching distance (Belongie et al., 2002) – Pyramid match kernel (Grauman & Darrell, 2007) Information Retrieval – Levenshtein distance (Levenshtein, 1966) – Cosine similarity between tf-idf vectors (Manning & Schütze, 1999) 6

Approaches to Similarity-based Classification 7 MDS Similarities as kernels SVM Similarities as features theory k-NN weights Generative Models SDA

Approaches to Similarity-based Classification 8 MDS Similarities as kernels SVM Similarities as features theory k-NN weights Generative Models SDA

Can we treat similarities as kernels? 9

10

Can we treat similarities as kernels? 11

Example: Amazon similarity books

Example: Amazon similarity books

Example: Amazon similarity 96 books Rank

Well, let’s just make S be a kernel matrix 15 00

Well, let’s just make S be a kernel matrix 16 00

Well, let’s just make S be a kernel matrix 17 00

Well, let’s just make S be a kernel matrix Flip, Clip or Shift? Best bet is Clip.

Well, let’s just make S be a kernel matrix 19 Learn the best kernel matrix for the SVM: (Luss NIPS 2007, Chen et al. ICML 2009) Learn the best kernel matrix for the SVM: (Luss NIPS 2007, Chen et al. ICML 2009)

Approaches to Similarity-based Classification 20. MDS Similarities as Kernels SVM Similarities as features theory k-NN weights Generative Models SDA

Let the similarities to the training samples be features – SVM (Graepel et al., 1998; Liao & Noble, 2003) – Linear programming (LP) machine (Graepel et al., 1999) – Linear discriminant analysis (LDA) (Pekalska et al., 2001) – Quadratic discriminant analysis (QDA) (Pekalska & Duin, 2002) – Potential support vector machine (P-SVM) (Hochreiter & Obermayer, 2006; Knebel et al., 2008) 21

22 AMAZON47 classes AURAL SONAR 2 classes CALTECH 101 classes FACE REC 139 classes MIREX 10 classes VOTING VDM 2 classes # samplesn = 204n =100n = 8677n = 945n = 3090n = 435 SVM (clip) SVM sim- as-feature (linear) SVM sim- as-feature (RBF) P-SVM

23 AMAZON47 classes AURAL SONAR 2 classes CALTECH 101 classes FACE REC 139 classes MIREX 10 classes VOTING VDM 2 classes # samplesn = 204n =100n = 8677n = 945n = 3090n = 435 SVM-kNN (clip) (Zhang et al. 2006) SVM (clip) SVM sim- as-feature (linear) SVM sim- as-feature (RBF) P-SVM

Approaches to Similarity-based Classification 24 MDS Similarities as Kernels SVM Similarities as features theory k-NN weights Generative Models SDA

Weighted Nearest-Neighbors Take a weighted vote of the k-nearest-neighbors: Algorithmic parallel of the exemplar model of human learning. 25 ?

Weighted Nearest-Neighbors Take a weighted vote of the k-nearest-neighbors: Algorithmic parallel of the exemplar model of human learning. 26

Design Goals for the Weights 27 ?

Design Goals for the Weights 28 Design Goal 1 (Affinity): w i should be an increasing function of ψ(x, x i ). ?

Design Goals for the Weights 29 ?

Design Goals for the Weights (Chen et al. JMLR 2009) 30 Design Goal 2 (Diversity): w i should be a decreasing function of ψ(x i, x j ). ?

Linear Interpolation Weights Linear interpolation weights will meet these goals: 31

Linear Interpolation Weights Linear interpolation weights will meet these goals: 32

LIME weights Linear interpolation weights will meet these goals: Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): 33

LIME weights Linear interpolation weights will meet these goals: Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): 34

LIME weights Linear interpolation weights will meet these goals: Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): 35

LIME weights Linear interpolation weights will meet these goals: Linear interpolation with maximum entropy (LIME) weights (Gupta et al., IEEE PAMI 2006): 36

Kernelize Linear Interpolation (Chen et al. JMLR 2009) 37

Kernelize Linear Interpolation 38 regularizes the variance of the weights

Kernelize Linear Interpolation 39 only need inner products – can replace with kernel or similarities!

KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 40

KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 41 affinity:

KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 42 diversity:

KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: 43

KRI Weights Satisfy Design Goals Kernel ridge interpolation (KRI) weights: Remove the constraints on the weights: Can show equivalent to local ridge regression: KRR weights. 44

Weighted k-NN: Example 1 45 KRI weightsKRR weights

Weighted k-NN: Example 2 46 KRI weightsKRR weights

Weighted k-NN: Example 3 47 KRI weightsKRR weights

48 Amazon- 47 Aural Sonar Caltech- 101 Face Rec MirexVoting # samples # classes LOCAL k-NN affinity k-NN KRI k-NN (clip) KRR k-NN (pinv) SVM-KNN (clip) GLOBAL SVM sim-as-kernel (clip) SVM sim-as-feature (linear) SVM sim-as-feature (RBF) P-SVM

Amazon- 47 Aural Sonar Caltech- 101 Face Rec MirexVoting # samples # classes LOCAL k-NN affinity k-NN KRI k-NN (clip) KRR k-NN (pinv) SVM-KNN (clip) GLOBAL SVM sim-as-kernel (clip) SVM sim-as-feature (linear) SVM sim-as-feature (RBF) P-SVM

Amazon- 47 Aural Sonar Caltech- 101 Face Rec MirexVoting # samples # classes LOCAL k-NN affinity k-NN KRI k-NN (clip) KRR k-NN (pinv) SVM-KNN (clip) GLOBAL SVM sim-as-kernel (clip) SVM sim-as-feature (linear) SVM sim-as-feature (RBF) P-SVM

Amazon- 47 Aural Sonar Caltech- 101 Face Rec MirexVoting # samples # classes LOCAL k-NN affinity k-NN KRI k-NN (clip) KRR k-NN (pinv) SVM-KNN (clip) GLOBAL SVM sim-as-kernel (clip) SVM sim-as-feature (linear) SVM sim-as-feature (RBF) P-SVM

Approaches to Similarity-based Classification 52. MDS Similarities as Kernels SVM Similarities as features theory k-NN weights Generative Models SDA

Generative Classifiers 53

Generative Classifiers 54

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009) 55

Similarity Discriminant Analysis (Cazzanti and Gupta, ICML 2007, 2008, 2009) 56 Reg. Local SDA Performance: Competitive

Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful. Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 57

Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful. Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 58

Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful. Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 59

Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful. Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 60

Some Conclusions Performance depends heavily on oddities of each dataset Weighted k-NN with affinity-diversity weights work well. Preliminary: Reg. Local SDA works well. Probabilities useful. Local models useful - less approximating - hard to model entire space, underlying manifold? - always feasible 61

Lots of Open Questions Making S PSD. Fast k-NN search for similarities Similarity-based regression Relationship with learning on graphs Try it out on real data Fusion with Euclidean features (see our FUSION 2009 papers) Open theoretical questions (Chen et al. JMLR 2009, Balcan et al. ML 2008) 62

Code/Data/Papers: idl.ee.washington.edu/similaritylearning Similarity-based Classification by Chen et al., JMLR 2009

Training and Test Consistency For a test sample x, given, shall we classify x as 64 No! If a training sample was used as a test sample, could change its class!

Data Sets 65 AmazonAural SonarProtein Eigenvalue Rank Eigenvalue

Data Sets 66 VotingYeast-5-7Yeast-5-12 Eigenvalue Eigenvalue Rank

SVM Review Empirical risk minimization (ERM) with regularization: 67 Hinge loss: SVM Primal:

Learning the Kernel Matrix Find for classification the best K regularized toward S: 68 SVM that learns the full kernel matrix:

Related Work 69 Robust SVM (Luss & d’Aspremont, 2007): SVM Dual: “This can be interpreted as a worst-case robust classification problem with bounded uncertainty on the kernel matrix K.”

Related Work 70 Let Rewrite the robust SVM as Theorem (Sion, 1958) Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M  N, which is quasiconcave in M, quasiconvex in N, upper semi- continuous in μ for each ν  N, and lower semi-continuous in ν for each μ  M, then Theorem (Sion, 1958) Let M and N be convex spaces one of which is compact, and f(μ,ν) a function on M  N, which is quasiconcave in M, quasiconvex in N, upper semi- continuous in μ for each ν  N, and lower semi-continuous in ν for each μ  M, then

Related Work 71 Let Rewrite the robust SVM as By Sion’s minimax theorem, the robust SVM is equivalent to: Compare zero duality gap

Learning the Kernel Matrix It is not trivial to directly solve: 72 Lemma (Generalized Schur Complement) Let, and. Then if and only if, z is in the range of K, and. Lemma (Generalized Schur Complement) Let, and. Then if and only if, z is in the range of K, and. Let, and notice that since.

Learning the Kernel Matrix It is not trivial to directly solve: 73 However, it can be expressed as a convex conic program: – We can recover the optimal by.

Learning the Spectrum Modification Concerns about learning the full kernel matrix: – Though the problem is convex, the number of variables is O(n 2 ). – The flexibility of the model may lead to overfitting. 74