CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)
CISC841, F07, Lec3, Liao

E = (1/2n) i=1 to n |yi- h(xi , )|
Terminologies An object x is represented by a set of m attributes xi, 1  i  m. A set of n training examples S = { (x1, y1), …, (xn, yn)}, where yi is the classification (or label) of instance xi. For binary classification, yi ={1, +1}, and for k-class classification, yi ={1, 2, …,k}. Without loss of generality, we focus on binary classification. The task is to learn the mapping: xi  yi A machine is a learned function/mapping/hypothesis h: xi  h(xi , ) where  stands for parameters to be fixed during training. Performance is measured as E = (1/2n) i=1 to n |yi- h(xi , )| CISC841, F07, Lec3, Liao

Linear SVMs: find a hyperplane (specified by normal vector w and perpendicular distance b to the origin) that separates the positive and negative examples with the largest margin. Margin  w · xi + b > 0 if yi = +1 + w · xi + b < 0 if yi = 1 An unknown x is classified as sign(w · x + b) w  b Separating hyperplane (w, b) Origin CISC841, F07, Lec3, Liao

Rosenblatt’s Algorithm (1956)
; // is the learning rate w0 = 0; b0 = 0; k = 0 R = max 1  i  n || xi || error = 1; // flag for misclassification/mistake while (error) { // as long as modification is made in the for-loop error = 0; for (i = 1 to n) { if (yi ( <wk · xi> + bk )  0 ){ // misclassification wk+1 = wk +  yi xi // update the weight bk+1 = bk +  yi R // update the bias k = k +1 error = 1; } return (wk, bk) // hyperplane that separates the data, where k is the number of // modifications. CISC841, F07, Lec3, Liao

Questions w.r.t. Rosenblatt’s algorithm
Is the algorithm guaranteed to converge? How quickly does it converge? Novikoff Theorem: Let S be a training set of size n and R = max 1  i  n || xi ||. If there exists a vector w* such that ||w*|| = 1 and yi (w* · xi)  , for 1  i  n, then the number of modifications before convergence is at most (R/ )2. CISC841, F07, Lec3, Liao

1. wt· w* = wt-1 · w* +  yi xi · w*  wt-1 · w* +   wt· w*  t  
Proof: wt· w* = wt-1 · w* +  yi xi · w*  wt-1 · w* +   wt· w*  t   || wt ||2 = || wt-1 ||2 + 2  yi xi · wt-1 + 2 || xi ||2  || wt-1 ||2 + 2 || xi ||2  || wt-1 ||2 + 2 R2 || wt ||2  t 2 R2 t  R ||w*||  wt · w*  t   t  (R/ )2. Note: Without loss of generality, the separating plane is assumed to pass the origin, i.e., no bias b is necessary. The learning rate  seems to have no bearing on this upper bound. What if the training data is not linearly separable, i.e., w* does not exist? CISC841, F07, Lec3, Liao

Dual form The final hypothesis w is a linear combination of the training points: w =  i=1 to n i yixi where i are positive values proportional to the number of times misclassification of xi has caused the weight to be updated. Vector  can be considered as alternative representation of the hypothesis; i can be regarded as an indication of the information content of the example xi. The decision function can be rewritten as h(x) = sign (w · x + b) = sign( ( j=1 to n j yjxj) · x + b) = sign(  j=1 to n j yj (xj· x) + b) CISC841, F07, Lec3, Liao

Rosenblatt’s Algorithm in dual form
R = max 1  i  n || xi || error = 1; // flag for misclassification while (error) { // as long as modification is made in the for-loop error = 0; for (i = 1 to n) { if (yi ( j=1 to n j yj (xj· xi) + b)  0 ){ // xi is misclassified i = i // update the weight b = b + yi R2 // update the bias error = 1; } return (, b) // hyperplane that separates the data, where k is the number of // modifications. Notes: The training examples enter the algorithm as dot products (xj· xi). i is a measure of information content; xi with non-zero information content (i >0) are called support vectors, as they are located on the boundaries. CISC841, F07, Lec3, Liao

+   = 0  > 0 Margin  w b Separating hyperplane (w, b) Origin

Larger margin is preferred: converge more quickly generalize better

2 = [ (x+ · w ) - (x- · w ) ] = (x+ - x-) · w = || x+ - x- || ||w||
w · x+ + b = + 1 w · x- + b =  1 2 = [ (x+ · w ) - (x- · w ) ] = (x+ - x-) · w = || x+ - x- || ||w|| Therefore, maximizing the geometric margin || x+ - x- || is equivalent to minimizing ½ ||w||2, under linear constraints: yi (w · xi) +b  1 for i = 1, …, n. Min w,b < w · w > subject to yi <w · xi> +b  1 for i = 1, …, n CISC841, F07, Lec3, Liao

L(w, b,  )= ½ ||w||2   i (yi (w · xi + b)  1),
Optimization with constraints Min w,b < w · w > subject to yi <w · xi> +b  1 for i = 1, …, n Lagrangian Theory: Introducing Lagrangian multiplier i for each constraint L(w, b,  )= ½ ||w||2   i (yi (w · xi + b)  1), and then calculating  L  L  L = 0, = 0, = 0,  w  b   This optimization problem can be solved as Quadratic Programming. … guaranteed to converge to the global minimum because of its being a convex Note: advantages over the artificial neural nets CISC841, F07, Lec3, Liao

L() =  i  ½  i j yi yj xi · xj
The optimal w* and b* can be found by solving the dual problem for  to maximize: L() =  i  ½  i j yi yj xi · xj under the constraints: i  0, and  i yi = 0. Once  is solved, w* =  i yi xi b* = ½ (max y =-1 w*· xi + min y=+1 w*· xi ) And an unknown x is classified as sign(w* · x + b*) = sign( i yi xi · x + b*) Notes: Only the dot product for vectors is needed. Many i are equal to zero, and those that are not zero correspond to xi on the boundaries – support vectors! In practice, instead of sign function, the actual value of w* · x + b* is used when its absolute value is less than or equal to one. Such a value is called a discriminant. CISC841, F07, Lec3, Liao

Non-linear mapping to a feature space
Φ( ) L() =  i  ½  i j yi yj Φ (xi )·Φ (xj ) xi Φ(xi) Φ(xj) xj CISC841, F07, Lec3, Liao

Nonlinear SVMs Input Space Feature Space x1 x2 X =

Kernel function for mapping
For input X= (x1, x2), Define map (X) = (x1x1, 2x1x2, x2x2). Define Kernel function as K(X,Y) = (X·Y)2. It has K(X,Y) = (X) · (Y) We can compute the dot product in feature space without computing . K(X,Y) = (X) · (Y) = (x1 x1, 2 x1x2, x2 x2) · (y1 y1, 2 y1y2, y2 y2) = (x1x1y1y1 + 2x1x2y1y2 + x2x2y2y2) = (x1y1 + x2y2)(x1y1 + x2y2) = ((x1, x2) · (y1, y2))2 = (X·Y)2 CISC841, F07, Lec3, Liao

Kernels Given a mapping Φ( ) from the space of input vectors to some higher dimensional feature space, the kernel K of two vectors xi, xj is the inner product of their images in the feature space, namely, K(xi, xj) = Φ (xi)·Φ (xj ). Since we just need the inner product of vectors in the feature space to find the maximal margin separating hyperplane, we use the kernel in place of the mapping Φ( ). Because inner product of two vectors is a measure of the distance between the vectors, a kernel function actually defines the geometry of the feature space (lengths and angles), and implicitly provides a similarity measure for objects to be classified. CISC841, F07, Lec3, Liao

Properties of kernels K(x,y) = K(y, x) K(x,z)2 ≤ K(x,x) K(z, z)
proof: K(x,z)2 = [Φ(x) ·Φ(z) ]2 ≤ || Φ(x) ||2 || Φ(z) ||2 = k(x,x) k(z,z). Function k(x,z) is a kernel if and only if matrix K = [k(xi, xj)]n i,j=1 is positive semi-finite (has non-negative eigenvalues). CISC841, F07, Lec3, Liao

Mercer’s condition Since kernel functions play an important role, it is important to know if a kernel gives dot products (in some higher dimension space). For a kernel K(x,y), if for any g(x) such that  g(x)2 dx is finite, we have  K(x,y)g(x)g(y) dx dy  0, then there exist a mapping  such that K(x,y) = (x) · (y) Notes: Mercer’s condition does not tell how to actually find . Mercer’s condition may be hard to check since it must hold for every g(x). CISC841, F07, Lec3, Liao

Making kernels from kernels
K(x,z) = k1(x,z) + k2(x,z) K(x,z) = a k1(x,z) CISC841, F07, Lec3, Liao

some commonly used generic kernel functions
More kernel functions some commonly used generic kernel functions Polynomial kernel: K(x,y) = (1+x·y)p Radial (or Gaussian) kernel: K(x,y) = exp(-||x-y||2/22) Questions: By introducing extra dimensions (sometimes infinite), we can find a linearly separating hyperplane. But how can we be sure such a mapping to a higher dimension space will generalize well to unseen data? Because the mapping introduces flexibility for fitting the training examples, how to avoid overfitting? Answer: Use the maximum margin hyperplane. (Vapnik theory) CISC841, F07, Lec3, Liao

L() =  i  ½  i j yi yj K(xi , xj)
The optimal w* and b* can be found by solving the dual problem for  to maximize: L() =  i  ½  i j yi yj K(xi , xj) under the constraints: i  0, and  i yi = 0. Once  is solved, w* =  i yi xi b* = ½ max y =-1 [K(w*, xi) + min y=+1 (w*· xi )] And an unknown x is classified as sign(w* · x + b*) = sign( i yi K(xi , x) + b*) CISC841, F07, Lec3, Liao

References and resources
Cristianini & Shawe-Tayor, “An introduction to Support Vector Machines”, Cambridge University Press, 2000. SVMLight Chris Burges, A tutorial J.-P Vert, A 3-day tutorial W. Noble, “Support vector machine applications in computational biology”, Kernel Methods in Computational Biology. B. Schoelkopf, K. Tsuda and J.-P. Vert, ed. MIT Press, 2004. CISC841, F07, Lec3, Liao

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

Similar presentations

Presentation on theme: "CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)

Similar presentations

Presentation on theme: "CISC 841 Bioinformatics (Fall 2007) Kernel Based Methods (I)"— Presentation transcript:

Similar presentations

About project

Feedback