(recap) Kernel Perceptron

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Lecture 9 Support Vector Machines
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
SVM—Support Vector Machines
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Separating Hyperplanes
Support Vector Machines (and Kernel Methods in general)
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.
Binary Classification Problem Learn a Classifier from the Training Set
Support Vector Machines
Unconstrained Optimization Problem
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Support Vector Machines
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Stochastic Subgradient Approach for Solving Linear Support Vector Machines Jan Rupnik Jozef Stefan Institute.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 Linear Separation and Margins. Non-Separable and.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Approximation Algorithms based on linear programming.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
1 Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 23, 2010 Piotr Mirowski Based on slides by Sumit.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Administration HW4 is due on Saturday 3/11
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
Dan Roth Department of Computer and Information Science
Support Vector Machines
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Statistical Learning Dong Liu Dept. EEIS, USTC.
Large Scale Support Vector Machines
Support vector machines
Support Vector Machines
CIS 519/419 Applied Machine Learning
Support vector machines
Support vector machines
Constraints.
Presentation transcript:

(recap) Kernel Perceptron If n’ is large, we cannot represent w explicitly. However, the weight vector w can be written as a linear combination of examples: Where 𝛼 𝑗 is the number of mistakes made on 𝑥 (𝑗) Then we can compute f(x) based on {𝑥 (𝑗) } and 𝜶

(recap) Kernel Perceptron In the training phase, we initialize 𝜶 to be an all-zeros vector. For training sample (𝑥 (𝑘) , 𝑦 (𝑘) ), instead of using the original Perceptron update rule in the 𝑅 𝑛 ′ space we maintain 𝜶 by based on the relationship between w and 𝜶 :

Data Dependent VC dimension Consider the class of linear functions that separate the data S with margin °. Note that although both classifiers (w’s) separate the data, they do it with a different margin. Intuitively, we can agree that: Large Margin  Small VC dimension

Generalization

Margin of a Separating Hyperplane A separating hyperplane: wT x+b = 0 Assumption: data is linear separable Distance between wT x+b = +1 and -1 is 2 / ||w|| Idea: Consider all possible w with different angles Scale w such that the constraints are tight Pick the one with largest margin wT xi +b¸ 1 if yi = 1 wT xi +b· -1 if yi = -1 => 𝑦 𝑖 ( 𝑤 𝑇 𝑥 𝑖 +𝑏)≥1 wT x+b = 1 wT x+b = 0 wT x+b = -1

Maximal Margin min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤 The margin of a linear separator wT x+b = 0 is 2 / ||w|| max 2 / ||w|| = min ||w|| = min ½ wTw min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤 s.t y i (w T x i +𝑏)≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈𝑆

Margin and VC dimension Theorem (Vapnik): If H° is the space of all linear classifiers in <n that separate the training data with margin at least °, then VC(H°) · R2/°2 where R is the radius of the smallest sphere (in <n) that contains the data. This is the first observation that will lead to an algorithmic approach. The second observation is that: Small ||w||  Large Margin Consequently: the algorithm will be: from among all those w’s that agree with the data, find the one with the minimal size ||w||

Hard SVM Optimization min 𝑤 1 2 𝑤 𝑇 𝑤 We have shown that the sought after weight vector w is the solution of the following optimization problem: SVM Optimization: (***) This is an optimization problem in (n+1) variables, with |S|=m inequality constraints. min 𝑤 1 2 𝑤 𝑇 𝑤 s.t y i ( w T x i +𝑏)≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈𝑆

Support Vector Machines The name “Support Vector Machine” stems from the fact that w* is supported by (i.e. is the linear span of) the examples that are exactly at a distance 1/||w*|| from the separating hyperplane. These vectors are therefore called support vectors. Theorem: Let w* be the minimizer of the SVM optimization problem (***) for S = {(xi, yi)}. Let I= {i: yi (w*Txi + b) = 1}. Then there exists coefficients ®i > 0 such that: w* = i 2 I ®i yi xi This representation should ring a bell…

Recap: Perceptron in Dual Rep. If n’ is large, we cannot represent w explicitly. However, the weight vector w can be written as a linear combination of examples: Where 𝛼 𝑗 is the number of mistakes made on 𝑥 (𝑗) Then we can compute f(x) based on {𝑥 (𝑗) } and 𝜶

Duality This, and other properties of Support Vector Machines are shown by moving to the dual problem. Theorem: Let w* be the minimizer of the SVM optimization problem (***) for S = {(xi, yi)}. Let I= {i: yi (w*Txi +b)= 1}. Then there exists coefficients ®i >0 such that: w* = i 2 I ®i yi xi

Footnote about the threshold Similar to Perceptron, we can augment vectors to handle the bias term 𝑥 ⇐ 𝑥 , 1 ; 𝑤 ⇐ 𝑤 , 𝑏 so that 𝑤 𝑇 𝑥 = 𝑤 𝑇 𝑥+𝑏 Then consider the following formulation min 𝑤 1 2 𝑤 𝑇 𝑤 s.t y i 𝑤 T 𝑥 i ≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈S However, this formulation is slightly different from (***), because it is equivalent to min 𝑤,𝑏 1 2 𝑤 𝑇 𝑤+ 1 2 𝑏 2 s.t y i ( 𝑤 T x i +𝑏)≥1, ∀ 𝑥 𝑖 , 𝑦 𝑖 ∈S The bias term is included in the regularization. This usually doesn’t matter For simplicity, we ignore the bias term

Key Issues Computational Issues Is it really optimal? Training of an SVM used to be is very time consuming – solving quadratic program. Modern methods are based on Stochastic Gradient Descent and Coordinate Descent. Is it really optimal? Is the objective function we are optimizing the “right” one?

Real Data 17,000 dimensional context sensitive spelling Histogram of distance of points from the hyperplane In practice, even in the separable case, we may not want to depend on the points closest to the hyperplane but rather on the distribution of the distance. If only a few are close, maybe we can dismiss them. This applies both to generalization bounds and to the algorithm.

Soft SVM min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 Notice that the relaxation of the constraint: y i w T x i ≥1 Can be done by introducing a slack variable 𝜉 𝑖 (per example) and requiring: y i w T x i ≥1− 𝜉 𝑖 ; 𝜉 𝑖 ≥0 Now, we want to solve: min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t y i w T x i ≥1− 𝜉 𝑖 ; 𝜉 𝑖 ≥0 ∀𝑖 What is the value of slack variable when the data point is classified correctly?

Soft SVM (2) min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 Now, we want to solve: Which can be written as: min 𝑤 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 max (0, 1− 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 ) . What is the interpretation of this? min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t 𝜉 𝑖 ≥1− y i w T x i ; 𝜉 𝑖 ≥0 ∀𝑖 min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t y i w T x i ≥1− 𝜉 𝑖 ; 𝜉 𝑖 ≥0 ∀𝑖 In optimum, ξ i = max (0, 1− y i w T x i )

Minw ½ ||w||2 + C (x,y) 2 S max(0, 1 – y wTx) Soft SVM (3) The hard SVM formulation assumes linearly separable data. A natural relaxation: maximize the margin while minimizing the # of examples that violate the margin (separability) constraints. However, this leads to non-convex problem that is hard to solve. Instead, move to a surrogate loss function that is convex. SVM relies on the hinge loss function (note that the dual formulation can give some intuition for that too). Minw ½ ||w||2 + C (x,y) 2 S max(0, 1 – y wTx) where the parameter C controls the tradeoff between large margin (small ||w||) and small hinge-loss.

SVM Objective Function A more general framework SVM: Minw ½ ||w||2 + C (x,y) 2 S max(0, 1 – y wTx) General Form of a learning algorithm: Minimize empirical loss, and Regularize (to avoid over fitting) Think about the implication of large C and small C Theoretically motivated improvement over the original algorithm we’ve see at the beginning of the semester. Regularization term Empirical loss Can be replaced by other regularization functions Can be replaced by other loss functions

Balance between regularization and empirical loss

Balance between regularization and empirical loss (DEMO)

Underfitting and Overfitting Model complexity Expected Error Underfitting Overfitting Variance Bias Simple models: High bias and low variance Complex models: High variance and low bias Smaller C Larger C

What Do We Optimize?

What Do We Optimize(2)? We get an unconstrained problem. We can use the gradient descent algorithm! However, it is quite slow. Many other methods Iterative scaling; non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence wk that converges to the optimal solution of the optimization problem above. Currently: Limited memory BFGS is very popular

Optimization: How to Solve 1. Earlier methods used Quadratic Programming. Very slow. 2. The soft SVM problem is an unconstrained optimization problems. It is possible to use the gradient descent algorithm! Still, it is quite slow. Many options within this category: Iterative scaling; non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. All methods are iterative methods, that generate a sequence wk that converges to the optimal solution of the optimization problem above. Currently: Limited memory BFGS is very popular 3. 3rd generation algorithms are based on Stochastic Gradient Decent The runtime does not depend on n=#(examples); advantage when n is very large. Stopping criteria is a problem: method tends to be too aggressive at the beginning and reaches a moderate accuracy quite fast, but it’s convergence becomes slow if we are interested in more accurate solutions. 4. Dual Coordinated Descent (& Stochastic Version)

SGD for SVM Initialize 𝑤=0∈ 𝑅 𝑛 2. For every example x i , y i ∈𝐷 Goal: min 𝑤 𝑓 𝑤 ≡ 1 2 𝑤 𝑇 𝑤+ 𝐶 𝑚 𝑖 max 0, 1− 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 . m: data size Compute sub-gradient of 𝑓 𝑤 : 𝛻𝑓 𝑤 =𝑤−𝐶 𝑦 𝑖 𝑥 𝑖 if 1− 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 ≥0 ; otherwise 𝛻𝑓 𝑤 =𝑤 m is here for mathematical correctness, it doesn’t matter in the view of modeling. Initialize 𝑤=0∈ 𝑅 𝑛 2. For every example x i , y i ∈𝐷 If 𝑦 𝑖 𝑤 𝑇 𝑥 𝑖 ≤1 update the weight vector to 𝑤← 1−𝛾 𝑤+𝛾 𝐶𝑦 𝑖 𝑥 𝑖 (𝛾 - learning rate) Otherwise 𝑤←(1−𝛾)𝑤 Continue until convergence is achieved Convergence can be proved for a slightly complicated version of SGD (e.g, Pegasos) This algorithm should ring a bell…

Nonlinear SVM Primal: Dual: min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 We can map data to a high dimensional space: x→𝜙 𝑥 (DEMO) Then use Kernel trick: 𝐾 𝑥 𝑖 , 𝑥 𝑗 =𝜙 𝑥 𝑖 𝑇 𝜙 𝑥 𝑗 (DEMO2) Primal: min 𝑤, 𝜉 𝑖 1 2 𝑤 𝑇 𝑤+𝐶 𝑖 𝜉 𝑖 s.t y i w T 𝜙 𝑥 𝑖 ≥1− 𝜉 𝑖 𝜉 𝑖 ≥0 ∀𝑖 Dual: min 𝛼 1 2 𝛼 𝑇 Q𝛼− 𝑒 𝑇 𝛼 s.t 0≤𝛼≤𝐶 ∀𝑖 Q 𝑖𝑗 = 𝑦 𝑖 𝑦 𝑗 𝐾 𝑥 𝑖 , 𝑥 𝑗 Theorem: Let w* be the minimizer of the primal problem, 𝛼 ∗ be the minimizer of the dual problem. Then w ∗ = 𝑖 𝛼 ∗ y i x i

Nonlinear SVM Tradeoff between training time and accuracy Complex model v.s. simple model From: http://www.csie.ntu.edu.tw/~cjlin/papers/lowpoly_journal.pdf

At this point these work only for linear kernels PEGASOS At = S Subgradient method |At| = 1 Stochastic gradient LibSVM has a version that is called LibLinear that implements similar ideas in the dual space. At this point these work only for linear kernels Not a real problem. Subgradient Projection CS446-Spring 10

Support Vector Machines – Short Summary SVM = Linear Classifier + Regularization + [Kernel Trick]. Typically we are looking for the best w: wopt = argminw 1m L(f(x,w,b),y) Where L is a loss function that penalizes f when it makes a mistake. Typically, L is the 0-1 loss (hinge-loss): L(y’, y)= ½ |y – sgn y’| (y, y’ 2 {-1,1}) Just optimizing with respect to the training data may lead to overfitting. We want to achieve good performance on the training while maintaining some notion of “simple” hypothesis. We therefore change our goal to: wopt = argminw 1m L(f(x,w,b),y) +  R(w) Where L is a loss function, and R is a regularization function, which penalizes w that results in a model that is too complex (high VC dimension).  determines the tradeoff between the empirical error and the regularization.

Support Vector Machines – Short Version SVM = Linear Classifier + Regularization + [Kernel Trick]. This leads to an algorithm: from among all those w’s that agree with the data, find the one with the minimal size ||w|| Minimize ½ ||w||2 Subject to: y(w ¢ x + b) ¸ 1, for all x 2 S This is an optimization problem that can be solved using techniques from optimization theory. By introducing Lagrange multipliers  we can rewrite the dual formulation of this optimization problems as: w = i i yi xi Where the ’s are such that the following functional is maximized: L() = -1/2 i j 1 j xi xj yi yj + i i The optimum setting of the ’s turns out to be: i yi (w ¢ xi +b -1 ) = 0 8 i

Support Vector Machines – Short Version Dependence on the dot product leads to the ability to introduce kernels (just like in perceptron) What if the data is not linearly separable? What is the difference from regularized perceptron/Winnow? SVM = Linear Classifier + Regularization + [Kernel Trick]. Minimize ½ ||w||2 Subject to: y(w ¢ x + b) ¸ 1, for all x 2 S The dual formulation of this optimization problems gives: w = i i yi xi Optimum setting of ’s : i yi (w ¢ xi +b -1 ) = 0 8 i That is, i :=0 only when w ¢ xi +b -1=0 Those are the points sitting on the margin, call support vectors We get: f(x,w,b) = w ¢ x +b = i i yi xi ¢ x +b The value of the function depends on the support vectors, and only on their dot product with the point of interest x.

SVM Notes Chp. 5 of “Support Vector Machine” [Cristianini and Shawe-Taylor] provides an introduction to the theory of Optimization required in order to understand SVMs.

Optimization We are interested in finding the minimum or a maximum of a function, typically subject to some constraints. The general problem is: Minimize f(w), w 2 Rn Subject to gi(w) · 0, i=1,…k hi(w) = 0, i=1,…m An Inequality constraint gi(w) · 0 is called active if the solution w* satisfies gi(w*) = 0 An inequality constraint can be transformed into an equality constraint by introducing a slack variable: optimization gi(w) · 0 ´ gi(w) + i = 0,  ¸ 0 CS446-Spring 10

Optimization Problems An optimization problem in which the objective function and the constraints are linear is called a linear program. If the objective function is quadratic and the constraints linear it is called a quadratic program. If the objective function, the constraints and the domain of the functions are convex, that problem is a convex optimization problem. CS446-Spring 10

w u f(w) f(u) Convexity A real valued function f(w) is called convex for w 2 Rn if, 8 w,u 2 Rn, and for any a 2 (0,1): f(aw + (1-a)u) · af(w) + (1-a) f(u) Claim: If a function is convex, any local minimum w* of the unconstrained optimization problem with objective function f is also a global minimum. Proof: Take any u><w*. By the definition of the local minimum there exists a sufficiently close to 1 that f(w*) · f(aw*+ (1-a)u) · af(w*) + (1-a) f(u) Where the first inequality is a result of the assumption of local minimum and the second is from the convexity assumption. But that implies (algebra) that f(w*) · f(u) This is important. It renders optimization problems tractable when the functions involved are convex. CS446-Spring 10

Solving Optimization Problems (1) Theorem 1. A necessary condition for w* to be a minimum of f 2 C, is df(w*)/d(w) =0. This condition, together with convexity, is also a sufficient solution. CS446-Spring 10

Constrained Optimization Problems Solved by converting constrained problems to unconstrained problems. When the problems are constrained, there is a need to define the Lagrangian function, which incorporates information about both the objective function and the constraints, and which can be used to detect solutions. The Lagrangian is defined as the objective function plus a linear combination of the constraints, where the coefficients of the combination are call the Lagrange multipliers. Given an optimization problem with objective function f(w), and equality constraints hi(w)=0, i=1,…,m, we define the Lagrangian function as: L(w,) = f(w) + i=1m i hi(w) where the coefficients i are called the Lagrange multipliers. CS446-Spring 10

Constrained Optimization Problems (2) If w* is a local minimum for an optimization problem with only equality constraints, it is possible that df(w*)/d(w) >< 0 (so, w/o constraints it would not be a candidate for extreme point) but that the direction in which we could move to reduce f would violate one or more of the constraints. In order to respect the equality constraints hi we must move in directions that are perpendicular to dhi(w*)/d(w), that is they do not change the constraints. Therefore, in order to respect all the constraints we must move perpendicular to the subspace V spanned by: dhi(w*)/dw, (i =1,m) So, we make the constrained problem be an unconstrained one in which we require that there exists bi such that df(w*)/dw + i=1m i dhi(w*)/dw = 0 CS446-Spring 10

Constrained Optimization Problems (3) Theorem 2 (Lagrange): A necessary condition for w* to be a minimum for an optimization problem with objective f(w) and equality constraints hi (1,…,m) is: df(w*)/dw + i=1m i dhi(w*)/dw = 0 For some values *. These conditions are also sufficient provided that L(w,*) is also a convex function of w. Another way to write it is L(w*,)/dw = L(w*,)/dw =0 The first condition results in a new system of equations. The second, returns the equality constraints. By solving jointly the two systems, one obtains the solution. CS446-Spring 10

Lagrange Theorem: Example Consider the standard problem of finding the largest volume box with a given surface area. Let the sides of the box be u,v,w. Optimization problem Minimize: - uvw Subject to: wu+uv+vw=c The Lagrangian: L = -uvw + (uv+wu+vw-c) Necessary conditions: dL/dw= -uv+(u+v); dL/du= -wv+(w+v); dL/dv= -uw+(u+w) Solving the equations gives  v(w-u) =  w (u-v) = 0, implying that the only non trivial solution is w=v=u=(c/3)1/2 CS446-Spring 10

Example (Maximum Entropy Distribution) Suppose we wish to find the discrete probability distribution p= (p1,p2,…pn) with maximal entropy H(p) = -i=1n pi log{pi}. The optimization problem: minimize -H(p), subject to i=1n pi =1 L(p,) = i=1n pi log{pi} + (i=1n pi -1 ) Differentiating, we get: log{e pi) +  = 0 (i=1,…n) Which implies that all the pi s are the same p=(1/n…1/n). It is not hard to show that the function H(p) is convex, so the uniform distribution is the global solution. CS446-Spring 10

General Optimization Problems We now move to the general case, where the optimization problem contains both equality and inequality constraints. Define the generalize Lagrangian: L(w,, ) = f(w) + i=1k i gi(w) + i=1m i hi(w) Theorem3(Kuhn-Tucker):Given an optimization problem on Rn: Minimize f(w), w 2 Rn Subject to gi(w) · 0, i=1,…k hi(w) = 0, i=1,…m Necessary and sufficient conditions for w* to be an optimum are the existence of *, * such that: dL(w*, *, * )/dw = dL(w*, *, * )/d = 0 And i* gi(w*) = 0 i=1,…,k gi(w*) · 0 i=1,…,k i* ¸ 0 i=1,…,k CS446-Spring 10

General Optimization Problems (2) The last conditions are called KKT: Karush-Kuhn-Tucker conditions. i* gi(w*) = 0 i=1,…,k gi(w*) · 0 i=1,…,k i* ¸ 0 i=1,…,k It implies that solutions can be in one of two positions with respect to the inequality constraints. Either the solution is in the interior of the feasible region, so that the constraint is inactive (gi(w*) · 0) , or on the boundary defined by that constraints, which is active. In the first case, the i s need to be zero. In the second case, the constraint is active (gi(w*) = 0), and i can be non-zero. This condition will give rise to the notion of support vectors. CS446-Spring 10