Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison.

Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison

Objectives  Primary objective: Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge  Secondary objective: Achieve transparency of the prior knowledge for practical applications

Outline  Use kernels for function approximation  Incorporate prior knowledge  Previous approaches: require transformation of knowledge  New approach does not require any transformation of knowledge Knowledge given over completely arbitrary sets  Experimental results  Two synthetic examples and one real world example related to breast cancer prognosis  Approximations with prior knowledge more accurate than approximations without prior knowledge

Function Approximation  Given a set m of points in n-dimensional real space R n and corresponding function values in R  Points are represented by rows of a matrix A  R m  n  Exact or approximate function values for each point are given by a corresponding vector y  R m  Find a function f:R n  R based on the given data f(Ai0) yif(Ai0) yi

Function Approximation  Problem: utilizing only given data may result in a poor approximation  Points may be noisy  Sampling may be costly  Solution: use prior knowledge to improve the approximation

Adding Prior Knowledge  Standard approach: fit function at given data points without knowledge  Constrained approach: satisfy inequalities at given points  2004 MSW Paper: satisfy linear inequalities over polyhedral regions  Proposed new approach: satisfy nonlinear inequalities over arbitrary regions

Kernel Approximation  Approximate f by a nonlinear kernel function K  f(x)  K(x 0, A 0 )  + b  K(x 0, A 0 )  =  exp(-  ||x-A i || 2 )  i  Error in kernel approximation of given data  -s  K(A, A 0 )  + be - y  s  e is a vector of ones in R m m i=1

Kernel Approximation  Trade off between solution complexity (||  || 1 ) and data fitting (||s|| 1 )  Convert to a linear program  At solution  e 0 a = ||  || 1  e 0 s = ||s|| 1

Incorporating Nonlinear Prior Knowledge (MSW 2004)  Bx  d   0 Ax+ b  h 0 x +   Need to “kernelize” knowledge from input space to feature space of kernel  Requires change of variable x = A 0 t  BA 0 t  d   0 AA 0 t + b  h 0 A 0 t +   K(B, A 0 )t  d   0 K(A, A 0 )t + b  h 0 A 0 t +   Motzkin’s theorem of the alternative gives an equivalent linear system of inequalities which is added to a linear program  Achieves good numerical results, but kernelization is not readily interpretable in the original space

Incorporating Nonlinear Prior Knowledge: New Approach  Start with arbitrary nonlinear knowledge implication  g(x)  0  K(x 0, A 0 )  + b  h(x), 8 x 2  ½ R n  g, h are arbitrary functions  g:  ! R k, h:  ! R  Problem: need to add this knowledge to the optimization problem  Logically equivalent system:  g(x)  0, K(x 0, A 0 )  + b - h(x) < 0 has no solution x  

Prior Knowledge as a System of Linear Inequalities  Use a theorem of the alternative for convex functions: Assume that g(x), K(x 0, A 0 )  + b, -h(x) are convex functions of x, that  is convex and 9 x 2  : g(x)<0. Then either:  I. g(x)  0, K(x 0, A 0 )  + b - h(x) < 0 has a solution x  , or  II.  v  R k, v  0: K(x 0, A 0 )  + b - h(x) + v 0 g(x)  0  x    But never both.  If we can find v  0: K(x 0, A 0 )  + b - h(x) + v 0 g(x)  0  x  , then by above theorem  g(x)  0, K(x 0, A 0 )  + b - h(x) < 0 has no solution x   or equivalently:  g(x)  0  K(x 0, A 0 )  + b  h(x), 8 x 2 

Proof   I  II  Follows from OLM 1969, Corollary 4.2.2 and the existence of an x 2  such that g(x) <0.   I  II  Suppose not. That is, there exists x 2  v 2 R k,, v ¸ 0:  g(x)  0, K(x 0, A 0 )  + b - h(x) < 0, (I)  v  0, v 0 g(x) +K(x 0, A 0 )  + b - h(x)  0, 8 x 2   Then we have the contradiction:  0 > v 0 g(x) +K(x 0, A 0 )  + b - h(x)  0  Requires no assumptions on g, h, K, or  whatsoever

Example g(x) = 1250 - x 3, f(x) = x 4, h(x) = x 2 + 5000 x 3  1250 x 4  x 2 + 5000 v 0 g(x) + f(x) - h(x) ¸ 0g(x) · 0 ) f(x) ¸ h(x) II :I:I

 Linear semi-infinite program: infinite number of constraints  Discretize: finite linear program  Slacks allow knowledge to be satisfied inexactly  Add term to objective function to drive slacks to zero Incorporating Prior Knowledge

Numerical Experience  Evaluate on three datasets  Two synthetic datasets  Wisconsin Prognostic Breast Cancer Database  Compare approximation with prior knowledge to one without prior knowledge  Prior knowledge leads to an improved approximation  Prior knowledge used cannot be handled exactly by previous work  No kernelization needed on knowledge set

Two-Dimensional Hyperboloid  f(x 1, x 2 ) = x 1 x 2

Two-Dimensional Hyperboloid  Given exact values only at 11 points along line x 1 = x 2  At x 1 2 {-5, …, 5} x1x1 x2x2

Two-Dimensional Hyperboloid Approximation without Prior Knowledge

Two-Dimensional Hyperboloid  Add prior knowledge:  x 1 x 2  1  f(x 1, x 2 )  x 1 x 2  Nonlinear term x 1 x 2 can not be handled exactly by any previous approach  Discretization used only 11 points along the line x 1 = -x 2, x 1  {-5, -4, …, 4, 5}

Two-Dimensional Hyperboloid Approximation with Prior Knowledge

Two-Dimensional Tower Function

Two-Dimensional Tower Function Data  Given 400 points on the grid [-4, 4]  [-4, 4]  Values are min{g(x), 2}, where g(x) is the exact tower function

Two-Dimensional Tower Function Approximation without Prior Knowledge

Two-Dimensional Tower Function Prior Knowledge  Add prior knowledge:  (x 1, x 2 )  [-4, 4]  [-4, 4]  f(x) = g(x)  Prior knowledge is the exact function value.  Enforced at 2500 points on the grid [-4, 4]  [-4, 4] through above implication  Principal objective of prior knowledge is to overcome poor given data

Two-Dimensional Tower Function Approximation with Prior Knowledge

Predicting Lymph Node Metastasis  Number of metastasized lymph nodes is an important prognostic indicator for breast cancer recurrence  Determined by surgery in addition to the removal of the tumor  Wisconsin Prognostic Breast Cancer (WPBC) data  Lymph node metastasis for 194 patients  30 cytological features from a fine-needle aspirate  Tumor size, obtained during surgery  Task: predict the number of metastasized lymph nodes given tumor size alone

Predicting Lymph Node Metastasis  Split data into two portions  Past data: 20% used to find prior knowledge  Present data: 80% used to evaluate performance  Simulates acquiring prior knowledge from an expert’s experience

Prior Knowledge for Lymph Node Metastasis  Use kernel approximation without knowledge on the past data  f 1 (x) = K(x 0, A 1 0 )  1 + b 1  A 1 is the matrix of the past data points  Use density estimation to decide where to enforce knowledge  p(x) is the empirical density of the past data  Number of metastasized lymph nodes is greater than the predicted value on the past data, with tolerance of 0.01  p(x)  0.1  f(x)  f 1 (x) - 0.01

Predicting Lymph Node Metastasis: Results  Table shows root-mean-squared-error (RMSE) of past data (20%) approximation f 1 (x) on present data (80%)  Leave-one-out (LOO) RMSE reported for approximations with and without knowledge  Improvement due to knowledge: 14.8% ApproximationRMSE/LOO RMSE f 1 (x) from past data6.12 Without knowledge5.92 With knowledge5.04

Conclusion  Added general nonlinear prior knowledge to kernel approximation  Implemented as linear inequalities in a linear programming problem  Knowledge incorporated transparently  Demonstrated effectiveness  Two synthetic examples  Real world problem from breast cancer prognosis  Future work  More general prior knowledge with inequalities replaced by more general functions  Apply to classification problems

Questions WW ebsites linking to papers and talks: hh ttp://www.cs.wisc.edu/~olvi/ hh ttp://www.cs.wisc.edu/~wildt/

Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison.

Similar presentations

Presentation on theme: "Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison.

Similar presentations

Presentation on theme: "Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison."— Presentation transcript:

Similar presentations

About project

Feedback