Review Rong Jin
Comparison of Different Classification Models The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
K Nearest Neighbor (kNN) Approach (k=1) (k=4) Probability interpretation: estimate p(y|x) as
K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)? Leave one out approach Weight K nearest neighbor Neighbor is defined through a weight function Estimate p(y|x) How to estimate the appropriate value for 2 ?
K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)? Leave one out approach Weight K nearest neighbor Neighbor is defined through a weight function Estimate p(y|x) How to estimate the appropriate value for 2 ?
K Nearest Neighbor Approach (KNN) What is the appropriate size for neighborhood N(x)? Leave one out approach Weight K nearest neighbor Neighbor is defined through a weight function Estimate p(y|x) How to estimate the appropriate value for 2 ?
Weighted K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability Leave one out likelihood of training data Search the optimal 2 by maximizing the leave one out likelihood
Weight K Nearest Neighbor Leave one out + maximum likelihood Estimate leave one out probability Leave one out likelihood of training data Search the optimal 2 by maximizing the leave one out likelihood
Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y) Allocate a separate set of parameters for each class { 1, 2,…, c } p(xly; ) p(x; y ) Maximum likelihood estimation
Gaussian Generative Model p(y|x) ~ p(x|y) p(y): posterior = likelihood prior Estimate p(x|y) and p(y) Allocate a separate set of parameters for each class { 1, 2,…, c } p(xly; ) p(x; y ) Maximum likelihood estimation
Gaussian Generative Model Difficult to estimate p(x|y) if x is of high dimensionality Naïve Bayes: Essentially a linear model How to make a Gaussian generative model discriminative? ( m, m ) of each class are only based on the data belonging to that class lack of discriminative power
Gaussian Generative Model Maximum likelihood estimation How to optimize this objective function?
Gaussian Generative Model Bound optimization algorithm
Gaussian Generative Model We have decomposed the interaction of parameters between different classes Question: how to handle x with multiple features ?
Logistic Regression Model A linear decision boundary: w x+b A probabilistic model p(y|x) Maximum likelihood approach for estimating weights w and threshold b
Logistic Regression Model Overfitting issue Example: text classification Words that appears in only one document will be assigned with infinite large weight Solution: regularization Regularization term
Kernelize logistic regression model Non-linear Logistic Regression Model
Hierarchical Mixture Expert Model Group linear classifiers into a tree structure Group 1 g 1 (x) m 1,1 (x) Group Layer ExpertLa yer r(x) Group 2 g 2 (x) m 1,2 (x) m 2,1 (x)m 2,2 (x) Products generates nonlinearity in the prediction function
It could be a rough assumption by assuming all data points can be fitted by a linear model But, it is usually appropriate to assume a local linear model KNN can be viewed as a localized model without any parameters Can we extend the KNN approach by introducing a localized linear model? Non-linear Logistic Regression Model
Localized Logistic Regression Model Similar to the weight KNN Weigh each training example by Build a logistic regression model using the weighted examples
Localized Logistic Regression Model Similar to the weight KNN Weigh each training example by Build a logistic regression model using the weighted examples
Conditional Exponential Model An extension of logistic regression model to multiple class case A different set of weights w y and threshold b for each class y Translation invariance
Iterative scaling methods for optimization Maximum Entropy Model Finding the simplest model that matches with the data Maximize Entropy Prefer uniform distribution Constraints Enforce the model to be consistent with observed data
Classification Margin Support Vector Machine Classification margin Maximum margin principle: Separate data far away from the decision boundary Two objectives Minimize the classification error over training data Maximize the classification margin Support vectors Only support vectors have impact on the location of decision boundary denotes +1 denotes -1
Support Vector Machine Classification margin Maximum margin principle: Separate data far away from the decision boundary Two objectives Minimize the classification error over training data Maximize the classification margin Support vectors Only support vectors have impact on the location of decision boundary denotes +1 denotes -1 Support Vectors
Support Vector Machine Separable case Noisy case
Support Vector Machine Separable case Noisy case Quadratic programming!
Logistic Regression Model vs. Support Vector Machine Logistic regression model Support vector machine Different loss function for punishing mistakes Identical terms
Logistic Regression Model vs. Support Vector Machine Logistic regression differs from support vector machine only in the loss function
Kernel Tricks Introducing nonlinearity into the discriminative models Diffusion kernel A graph laplacian L for local similarity Diffusion kernel Propagate local similarity information into a global one
Fisher Kernel Derive a kernel function from a generative model Key idea Map a point x in original input space into the model space The similarity of two data points are measured in the model space Original Input Space Model Space Measure the similarity in the model space
Kernel Methods in Generative Model Usually, kernels can be introduced to a generative model through a Gaussian process Define a “kernelized” covariance matrix Positive semi-definitive, similar to Mercer’s condition
Multi-class SVM SVMs can only handle two-class outputs One-against-all Learn N SVM’s SVM 1 learns “Output==1” vs “Output != 1” SVM 2 learns “Output==2” vs “Output != 2” :: SVM N learns “Output==N” vs “Output != N”
Error Correct Output Code (ECOC) Encode each class into a bit vector S 1 S 2 S 3 S 4 x
Ordinal Regression A special class of multi-class classification problem There a natural ordinal relationship between multiple classes Maximum margin principle The computation of margin involves multiple classes ‘good’ ‘OK’ ‘bad’ w’
Ordinal Regression
Decision Tree From slides of Andrew Moore
Decision Tree A greedy approach for generating a decision tree 1. Choose the most informative feature Using the mutual information measurements 2. Split data set according to the values of the selected feature 3. Recursive until each data item is classified correctly Attributes with real values Quantize the real value into a discrete one
Decision Tree The overfitting problem Tree pruning Reduced error pruning Rule post-pruning
Decision Tree The overfitting problem Tree pruning Reduced error pruning Rule post-pruning
Generalize Decision Tree + + a decision tree with simple data partition + a decision tree using classifiers for data partition + Each node is a linear classifier Attribute 1 Attribute 2 classifier