Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Introduction to Support Vector Machines (SVM)
Pattern Recognition and Machine Learning
Support Vector Machines
Supervised Learning Recap
Separating Hyperplanes
Computer vision: models, learning and inference
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Fuzzy Support Vector Machines (FSVMs) Weijia Wang, Huanren Zhang, Vijendra Purohit, Aditi Gupta.
Speeding up multi-task learning Phong T Pham. Multi-task learning  Combine data from various data sources  Potentially exploit the inter-relation between.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
SVM QP & Midterm Review Rob Hall 10/14/ This Recitation Review of Lagrange multipliers (basic undergrad calculus) Getting to the dual for a QP.
Support Vector Machine (SVM) Classification
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Visual Recognition Tutorial
Lecture 10: Support Vector Machines
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Review of Lecture Two Linear Regression Normal Equation
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
Learning a Small Mixture of Trees M. Pawan Kumar Daphne Koller Aim: To efficiently learn a.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
An Introduction to Support Vector Machines (M. Law)
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Biointelligence Laboratory, Seoul National University
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
CS6772 Advanced Machine Learning Fall 2006 Extending Maximum Entropy Discrimination on Mixtures of Gaussians With Transduction Final Project by Barry.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Intro. ANN & Fuzzy Systems Lecture 38 Mixture of Experts Neural Network.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Linear machines márc Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support vector machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Deep Feedforward Networks
Dan Roth Department of Computer and Information Science
Geometrical intuition behind the dual problem
10701 / Machine Learning.
Probabilistic Models for Linear Regression
An Introduction to Support Vector Machines
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Linear machines 28/02/2017.
Support vector machines
Support vector machines
Linear Discrimination
Primal Sparse Max-Margin Markov Networks
SVMs for Document Ranking
Presentation transcript:

Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT

· inputs x, class y = +1, -1 · data D = { (x 1,y 1 ), …. (x T,y T ) } · learn f opt (x) discriminant function from F = {f} family of discriminants · classify y = sign f opt (x) Classification

Model averaging · many f with near optimal performance · Instead of choosing f opt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign Q · To specify: F = { f } family of discriminant functions · To learn Q(f) distribution over F

Goal of this work · Define a discriminative criterion for averaging over models Advantages ·can incorporate prior ·can use generative model ·computationally feasible ·generalizes to other discrimination tasks

Maximum Entropy Discrimination given data set D = { (x 1,y 1 ), … (x T,y T ) } find Q ME = argmax Q H(Q) s.t. y t Q  for all t = 1,…,T (C)  and some  > 0  solution Q ME correctly classifies D · among all admissible Q, Q ME has max entropy · max entropy least specific about f

· convex problem: Q ME unique · solution T Q ME (f) ~ exp{  t y t f(x t ) } t=1 · t 0 Lagrange multipliers · finding Q ME : start with =0 and follow gradient of unsatisfied constraints Solution: Q ME as a projection uniform Q 0 Q ME admissible Q =0 ME

Finding the solution · needed t, t = 1,...T · by solving the dual problem max J( ) = max [ - log Z + - log Z - -   t ] s.t. t >= 0 for t = 1,...T Algorithm · start with t = 0 (uniform distribution) · iterative ascent on J( ) until convergence · derivative J/ t = y t Q(P) -  P + (x) P - (x)

Q ME as sparse solution · Classification rule y(x) = sign Q ME   is classification margin  t  > 0 for y t Q =   x t on the margin (support vector!)

Q ME as regularization · Uniform distribution Q 0 =0 · ”smoothness” of Q = H(Q) · Q ME is smoothest admissible distribution f opt Q ME Q0Q0 Q(f) f

Goal of this work · Define a discriminative criterion for averaging over models 4 Extensions ·incorporate prior ·relationship to support vectors ·use generative models ·generalizes to other discrimination tasks

Priors · prior Q 0 ( f ) · Minimum Relative Entropy Discrimination Q MRE = argmin Q KL( Q || Q 0 ) s.t. y t Q  for all t = 1,…,T (C) · prior on  learn Q MRE ( f,  ) soft margin Q 0 Q MRE admissible Q prior KL( Q || Q 0 )

Soft margins · average also over margin  · define Q 0 (f,  ) = Q 0 (f) Q 0 (  ) · constraints Q(f,  ) 0 · learn Q MRE (f,  ) = Q MRE (f) Q MRE (  ) Q 0 (  ) =c exp[c(  -1)] Potential as function of

Examples: support vector machines · Theorem For f(x) = . x + b, Q 0 (  ) = Normal( 0, I ), Q 0 (b) = non-informative prior, the Lagrange multipliers are obtained by maximizing J( ) subject to 0 t 0 and  t  t y t = 0, where J( ) =  t [ t + log( 1 - t /c) ] - 1/2  t,s  t s y t y s x t.x s · Separable D SVM recovered exactly · Inseparable D SVM recovered with different misclassification penalty · Adaptive kernel SVM....

SVM extensions · Example: Leptograpsus Crabs ( 5 inputs, T train =80, T test =120) f(x) = log + b with P + ( x ) = normal( x ; m +, V + ) quadratic classifier Q( V +, V - ) = distribution of kernel width P + (x) P - (x) MRE Gaussian Linear SVM Max Likelihood Gaussian

Using generative models · generative models P + (x), P - (x) for y = +1, -1 · f(x) = log + b · learn Q MRE (P +,P -, b,  ) · if Q 0 (P +,P - b,  ) = Q 0 (P + ) Q 0 ( P - ) Q 0 ( b) Q 0 (  ) · Q MRE (P +,P - ) = Q ME (P + ) Q ME (P - ) Q MRE ( b) Q MRE (  ) (factored prior factored posterior) P + (x) P - (x)

Examples: other distributions · Multinomial (1 discrete variable) 4 · Graphical model 4 (fixed structure, no hidden variables) · Tree graphical model 4 ( Q over structures and parameters)

Tree graphical models · P(x| E,  ) = P 0 (x) P uv (x u x v |  uv ) · prior Q 0 (P) = Q 0 (E) Q 0 (  |E) ·Q 0 (E) =  uv ·Q 0 (  |E) = conjugate prior Q MRE (P) = W 0 W uv can be integrated analytically Q 0 (P) conjugate prior over E and  EE EE EE

Trees: experiments · Splice junction classification task 25 inputs, 400 training examples compared with Max Likelihood trees ML, err=14% MaxEnt, err=12.3%

Trees experiments (contd) Tree edges’ weights

Discrimination tasks · Classification · Classification with partially labeled data · Anomaly detection x x x x x x x x x x x

Partially labeled data · Problem: given F families of discriminants and data set D = { (x 1, y 1 )… (x T,y T ), x T+1,… x N } find Q(f, ,y) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C) 

Partially labeled data : experiment Complete data 10% labeled + 90% unlabeled 10% labeled · Splice junction classification 25 inputs T total =1000

Anomaly detection · Problem: given P = { P } family of generative models and data set D = { x 1, … x T } find Q(P) that Q( P,  ) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C) 

Anomaly detection: experiments MaxEnt MaxLikelihood

Anomaly detection: experiments MaxEnt MaxLikelihood

Conclusions · New framework for classification · Based on regularization in the space of distributions · Enables use of generative models · Enables use of priors · Generalizes to other discrimination tasks