Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Introduction to Support Vector Machines (SVM)

Pattern Recognition and Machine Learning

Support Vector Machines

Supervised Learning Recap

Separating Hyperplanes

Computer vision: models, learning and inference

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines and Kernel Methods

Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.

Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.

Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.

1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.

SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.

Support Vector Machine (SVM) Classification

Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.

Visual Recognition Tutorial

Lecture 10: Support Vector Machines

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Review of Lecture Two Linear Regression Normal Equation

An Introduction to Support Vector Machines Martin Law.

Support Vector Machines

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

An Introduction to Support Vector Machines (M. Law)

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Biointelligence Laboratory, Seoul National University

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.

1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Support Vector Machines Tao Department of computer science University of Illinois.

5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Intro. ANN & Fuzzy Systems Lecture 38 Mixture of Experts Neural Network.

Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.

1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.

Support Vector Machine Slides from Andrew Moore and Mingyue Tan.

Support vector machines

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Deep Feedforward Networks

Dan Roth Department of Computer and Information Science

Geometrical intuition behind the dual problem

10701 / Machine Learning.

Probabilistic Models for Linear Regression

An Introduction to Support Vector Machines

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Linear machines 28/02/2017.

Support vector machines

Support vector machines

Linear Discrimination

Primal Sparse Max-Margin Markov Networks

SVMs for Document Ranking

Presentation transcript:

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

· inputs x, class y = +1, -1 · data D = { (x 1,y 1 ), …. (x T,y T ) } · learn f opt (x) discriminant function from F = {f} family of discriminants · classify y = sign f opt (x) Classification

Model averaging · many f with near optimal performance · Instead of choosing f opt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign Q · To specify: F = { f } family of discriminant functions · To learn Q(f) distribution over F

Goal of this work · Define a discriminative criterion for averaging over models Advantages ·can incorporate prior ·can use generative model ·computationally feasible ·generalizes to other discrimination tasks

Maximum Entropy Discrimination given data set D = { (x 1,y 1 ), … (x T,y T ) } find Q ME = argmax Q H(Q) s.t. y t Q  for all t = 1,…,T (C)  and some  > 0  solution Q ME correctly classifies D · among all admissible Q, Q ME has max entropy · max entropy least specific about f

· convex problem: Q ME unique · solution T Q ME (f) ~ exp{  t y t f(x t ) } t=1 · t 0 Lagrange multipliers · finding Q ME : start with =0 and follow gradient of unsatisfied constraints Solution: Q ME as a projection uniform Q 0 Q ME admissible Q =0 ME

Finding the solution · needed t, t = 1,...T · by solving the dual problem max J( ) = max [ - log Z + - log Z - -   t ] s.t. t >= 0 for t = 1,...T Algorithm · start with t = 0 (uniform distribution) · iterative ascent on J( ) until convergence · derivative J/ t = y t Q(P) -  P + (x) P - (x)

Q ME as sparse solution · Classification rule y(x) = sign Q ME   is classification margin  t  > 0 for y t Q =   x t on the margin (support vector!)

Q ME as regularization · Uniform distribution Q 0 =0 · ”smoothness” of Q = H(Q) · Q ME is smoothest admissible distribution f opt Q ME Q0Q0 Q(f) f

Goal of this work · Define a discriminative criterion for averaging over models 4 Extensions ·incorporate prior ·relationship to support vectors ·use generative models ·generalizes to other discrimination tasks

Priors · prior Q 0 ( f ) · Minimum Relative Entropy Discrimination Q MRE = argmin Q KL( Q || Q 0 ) s.t. y t Q  for all t = 1,…,T (C) · prior on  learn Q MRE ( f,  ) soft margin Q 0 Q MRE admissible Q prior KL( Q || Q 0 )

Soft margins · average also over margin  · define Q 0 (f,  ) = Q 0 (f) Q 0 (  ) · constraints Q(f,  ) 0 · learn Q MRE (f,  ) = Q MRE (f) Q MRE (  ) Q 0 (  ) =c exp[c(  -1)] Potential as function of

Examples: support vector machines · Theorem For f(x) = . x + b, Q 0 (  ) = Normal( 0, I ), Q 0 (b) = non-informative prior, the Lagrange multipliers are obtained by maximizing J( ) subject to 0 t 0 and  t  t y t = 0, where J( ) =  t [ t + log( 1 - t /c) ] - 1/2  t,s  t s y t y s x t.x s · Separable D SVM recovered exactly · Inseparable D SVM recovered with different misclassification penalty · Adaptive kernel SVM....

SVM extensions · Example: Leptograpsus Crabs ( 5 inputs, T train =80, T test =120) f(x) = log + b with P + ( x ) = normal( x ; m +, V + ) quadratic classifier Q( V +, V - ) = distribution of kernel width P + (x) P - (x) MRE Gaussian Linear SVM Max Likelihood Gaussian

Using generative models · generative models P + (x), P - (x) for y = +1, -1 · f(x) = log + b · learn Q MRE (P +,P -, b,  ) · if Q 0 (P +,P - b,  ) = Q 0 (P + ) Q 0 ( P - ) Q 0 ( b) Q 0 (  ) · Q MRE (P +,P - ) = Q ME (P + ) Q ME (P - ) Q MRE ( b) Q MRE (  ) (factored prior factored posterior) P + (x) P - (x)

Examples: other distributions · Multinomial (1 discrete variable) 4 · Graphical model 4 (fixed structure, no hidden variables) · Tree graphical model 4 ( Q over structures and parameters)

Tree graphical models · P(x| E,  ) = P 0 (x) P uv (x u x v |  uv ) · prior Q 0 (P) = Q 0 (E) Q 0 (  |E) ·Q 0 (E) =  uv ·Q 0 (  |E) = conjugate prior Q MRE (P) = W 0 W uv can be integrated analytically Q 0 (P) conjugate prior over E and  EE EE EE

Trees: experiments · Splice junction classification task 25 inputs, 400 training examples compared with Max Likelihood trees ML, err=14% MaxEnt, err=12.3%

Trees experiments (contd) Tree edges’ weights

Discrimination tasks · Classification · Classification with partially labeled data · Anomaly detection x x x x x x x x x x x

Partially labeled data · Problem: given F families of discriminants and data set D = { (x 1, y 1 )… (x T,y T ), x T+1,… x N } find Q(f, ,y) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C) 

Partially labeled data : experiment Complete data 10% labeled + 90% unlabeled 10% labeled · Splice junction classification 25 inputs T total =1000

Anomaly detection · Problem: given P = { P } family of generative models and data set D = { x 1, … x T } find Q(P) that Q( P,  ) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C) 

Anomaly detection: experiments MaxEnt MaxLikelihood

Anomaly detection: experiments MaxEnt MaxLikelihood

Conclusions · New framework for classification · Based on regularization in the space of distributions · Enables use of generative models · Enables use of priors · Generalizes to other discrimination tasks