Download presentation
Presentation is loading. Please wait.
Published byCaitlin Pearson Modified over 9 years ago
1
Maximum Entropy Discrimination Tommi Jaakkola Marina Meila Tony Jebara MIT CMU MIT
2
· inputs x, class y = +1, -1 · data D = { (x 1,y 1 ), …. (x T,y T ) } · learn f opt (x) discriminant function from F = {f} family of discriminants · classify y = sign f opt (x) Classification
3
Model averaging · many f with near optimal performance · Instead of choosing f opt, average over all f in F Q(f) = weight of f y(x) = sign Q(f)f(x) F = sign Q · To specify: F = { f } family of discriminant functions · To learn Q(f) distribution over F
4
Goal of this work · Define a discriminative criterion for averaging over models Advantages ·can incorporate prior ·can use generative model ·computationally feasible ·generalizes to other discrimination tasks
5
Maximum Entropy Discrimination given data set D = { (x 1,y 1 ), … (x T,y T ) } find Q ME = argmax Q H(Q) s.t. y t Q for all t = 1,…,T (C) and some > 0 solution Q ME correctly classifies D · among all admissible Q, Q ME has max entropy · max entropy least specific about f
6
· convex problem: Q ME unique · solution T Q ME (f) ~ exp{ t y t f(x t ) } t=1 · t 0 Lagrange multipliers · finding Q ME : start with =0 and follow gradient of unsatisfied constraints Solution: Q ME as a projection uniform Q 0 Q ME admissible Q =0 ME
7
Finding the solution · needed t, t = 1,...T · by solving the dual problem max J( ) = max [ - log Z + - log Z - - t ] s.t. t >= 0 for t = 1,...T Algorithm · start with t = 0 (uniform distribution) · iterative ascent on J( ) until convergence · derivative J/ t = y t Q(P) - P + (x) P - (x)
8
Q ME as sparse solution · Classification rule y(x) = sign Q ME is classification margin t > 0 for y t Q = x t on the margin (support vector!)
9
Q ME as regularization · Uniform distribution Q 0 =0 · ”smoothness” of Q = H(Q) · Q ME is smoothest admissible distribution f opt Q ME Q0Q0 Q(f) f
10
Goal of this work · Define a discriminative criterion for averaging over models 4 Extensions ·incorporate prior ·relationship to support vectors ·use generative models ·generalizes to other discrimination tasks
11
Priors · prior Q 0 ( f ) · Minimum Relative Entropy Discrimination Q MRE = argmin Q KL( Q || Q 0 ) s.t. y t Q for all t = 1,…,T (C) · prior on learn Q MRE ( f, ) soft margin Q 0 Q MRE admissible Q prior KL( Q || Q 0 )
12
Soft margins · average also over margin · define Q 0 (f, ) = Q 0 (f) Q 0 ( ) · constraints Q(f, ) 0 · learn Q MRE (f, ) = Q MRE (f) Q MRE ( ) Q 0 ( ) =c exp[c( -1)] Potential as function of
13
Examples: support vector machines · Theorem For f(x) = . x + b, Q 0 ( ) = Normal( 0, I ), Q 0 (b) = non-informative prior, the Lagrange multipliers are obtained by maximizing J( ) subject to 0 t 0 and t t y t = 0, where J( ) = t [ t + log( 1 - t /c) ] - 1/2 t,s t s y t y s x t.x s · Separable D SVM recovered exactly · Inseparable D SVM recovered with different misclassification penalty · Adaptive kernel SVM....
14
SVM extensions · Example: Leptograpsus Crabs ( 5 inputs, T train =80, T test =120) f(x) = log + b with P + ( x ) = normal( x ; m +, V + ) quadratic classifier Q( V +, V - ) = distribution of kernel width P + (x) P - (x) MRE Gaussian Linear SVM Max Likelihood Gaussian
15
Using generative models · generative models P + (x), P - (x) for y = +1, -1 · f(x) = log + b · learn Q MRE (P +,P -, b, ) · if Q 0 (P +,P - b, ) = Q 0 (P + ) Q 0 ( P - ) Q 0 ( b) Q 0 ( ) · Q MRE (P +,P - ) = Q ME (P + ) Q ME (P - ) Q MRE ( b) Q MRE ( ) (factored prior factored posterior) P + (x) P - (x)
16
Examples: other distributions · Multinomial (1 discrete variable) 4 · Graphical model 4 (fixed structure, no hidden variables) · Tree graphical model 4 ( Q over structures and parameters)
17
Tree graphical models · P(x| E, ) = P 0 (x) P uv (x u x v | uv ) · prior Q 0 (P) = Q 0 (E) Q 0 ( |E) ·Q 0 (E) = uv ·Q 0 ( |E) = conjugate prior Q MRE (P) = W 0 W uv can be integrated analytically Q 0 (P) conjugate prior over E and EE EE EE
18
Trees: experiments · Splice junction classification task 25 inputs, 400 training examples compared with Max Likelihood trees ML, err=14% MaxEnt, err=12.3%
19
Trees experiments (contd) Tree edges’ weights
20
Discrimination tasks · Classification · Classification with partially labeled data · Anomaly detection + + + + - - - x x x x x x x x x x x + + + + + + + + + + + + + + + + + + + + + +
21
Partially labeled data · Problem: given F families of discriminants and data set D = { (x 1, y 1 )… (x T,y T ), x T+1,… x N } find Q(f, ,y) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C)
22
Partially labeled data : experiment Complete data 10% labeled + 90% unlabeled 10% labeled · Splice junction classification 25 inputs T total =1000
23
Anomaly detection · Problem: given P = { P } family of generative models and data set D = { x 1, … x T } find Q(P) that Q( P, ) = argmin Q KL(Q||Q 0 ) s. t. Q 0 for all t = 1,…,T (C)
24
Anomaly detection: experiments MaxEnt MaxLikelihood
25
Anomaly detection: experiments MaxEnt MaxLikelihood
26
Conclusions · New framework for classification · Based on regularization in the space of distributions · Enables use of generative models · Enables use of priors · Generalizes to other discrimination tasks
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.