Download presentation
Presentation is loading. Please wait.
Published byEunice Merritt Modified over 9 years ago
1
Nonlinear Knowledge in Kernel Approximation Olvi Mangasarian UW Madison & UCSD La Jolla Edward Wild UW Madison
2
Objectives Primary objective: Incorporate prior knowledge over completely arbitrary sets into function approximation without transforming (kernelizing) the knowledge Secondary objective: Achieve transparency of the prior knowledge for practical applications
3
Outline Use kernels for function approximation Incorporate prior knowledge Previous approaches: require transformation of knowledge New approach does not require any transformation of knowledge Knowledge given over completely arbitrary sets Experimental results Two synthetic examples and one real world example related to breast cancer prognosis Approximations with prior knowledge more accurate than approximations without prior knowledge
4
Function Approximation Given a set m of points in n-dimensional real space R n and corresponding function values in R Points are represented by rows of a matrix A R m n Exact or approximate function values for each point are given by a corresponding vector y R m Find a function f:R n R based on the given data f(Ai0) yif(Ai0) yi
5
Function Approximation Problem: utilizing only given data may result in a poor approximation Points may be noisy Sampling may be costly Solution: use prior knowledge to improve the approximation
6
Adding Prior Knowledge Standard approach: fit function at given data points without knowledge Constrained approach: satisfy inequalities at given points 2004 MSW Paper: satisfy linear inequalities over polyhedral regions Proposed new approach: satisfy nonlinear inequalities over arbitrary regions
7
Kernel Approximation Approximate f by a nonlinear kernel function K f(x) K(x 0, A 0 ) + b K(x 0, A 0 ) = exp(- ||x-A i || 2 ) i Error in kernel approximation of given data -s K(A, A 0 ) + be - y s e is a vector of ones in R m m i=1
8
Kernel Approximation Trade off between solution complexity (|| || 1 ) and data fitting (||s|| 1 ) Convert to a linear program At solution e 0 a = || || 1 e 0 s = ||s|| 1
9
Incorporating Nonlinear Prior Knowledge (MSW 2004) Bx d 0 Ax+ b h 0 x + Need to “kernelize” knowledge from input space to feature space of kernel Requires change of variable x = A 0 t BA 0 t d 0 AA 0 t + b h 0 A 0 t + K(B, A 0 )t d 0 K(A, A 0 )t + b h 0 A 0 t + Motzkin’s theorem of the alternative gives an equivalent linear system of inequalities which is added to a linear program Achieves good numerical results, but kernelization is not readily interpretable in the original space
10
Incorporating Nonlinear Prior Knowledge: New Approach Start with arbitrary nonlinear knowledge implication g(x) 0 K(x 0, A 0 ) + b h(x), 8 x 2 ½ R n g, h are arbitrary functions g: ! R k, h: ! R Problem: need to add this knowledge to the optimization problem Logically equivalent system: g(x) 0, K(x 0, A 0 ) + b - h(x) < 0 has no solution x
11
Prior Knowledge as a System of Linear Inequalities Use a theorem of the alternative for convex functions: Assume that g(x), K(x 0, A 0 ) + b, -h(x) are convex functions of x, that is convex and 9 x 2 : g(x)<0. Then either: I. g(x) 0, K(x 0, A 0 ) + b - h(x) < 0 has a solution x , or II. v R k, v 0: K(x 0, A 0 ) + b - h(x) + v 0 g(x) 0 x But never both. If we can find v 0: K(x 0, A 0 ) + b - h(x) + v 0 g(x) 0 x , then by above theorem g(x) 0, K(x 0, A 0 ) + b - h(x) < 0 has no solution x or equivalently: g(x) 0 K(x 0, A 0 ) + b h(x), 8 x 2
12
Proof I II Follows from OLM 1969, Corollary 4.2.2 and the existence of an x 2 such that g(x) <0. I II Suppose not. That is, there exists x 2 v 2 R k,, v ¸ 0: g(x) 0, K(x 0, A 0 ) + b - h(x) < 0, (I) v 0, v 0 g(x) +K(x 0, A 0 ) + b - h(x) 0, 8 x 2 Then we have the contradiction: 0 > v 0 g(x) +K(x 0, A 0 ) + b - h(x) 0 Requires no assumptions on g, h, K, or whatsoever
13
Example g(x) = 1250 - x 3, f(x) = x 4, h(x) = x 2 + 5000 x 3 1250 x 4 x 2 + 5000 v 0 g(x) + f(x) - h(x) ¸ 0g(x) · 0 ) f(x) ¸ h(x) II :I:I
14
Linear semi-infinite program: infinite number of constraints Discretize: finite linear program Slacks allow knowledge to be satisfied inexactly Add term to objective function to drive slacks to zero Incorporating Prior Knowledge
15
Numerical Experience Evaluate on three datasets Two synthetic datasets Wisconsin Prognostic Breast Cancer Database Compare approximation with prior knowledge to one without prior knowledge Prior knowledge leads to an improved approximation Prior knowledge used cannot be handled exactly by previous work No kernelization needed on knowledge set
16
Two-Dimensional Hyperboloid f(x 1, x 2 ) = x 1 x 2
17
Two-Dimensional Hyperboloid Given exact values only at 11 points along line x 1 = x 2 At x 1 2 {-5, …, 5} x1x1 x2x2
18
Two-Dimensional Hyperboloid Approximation without Prior Knowledge
19
Two-Dimensional Hyperboloid Add prior knowledge: x 1 x 2 1 f(x 1, x 2 ) x 1 x 2 Nonlinear term x 1 x 2 can not be handled exactly by any previous approach Discretization used only 11 points along the line x 1 = -x 2, x 1 {-5, -4, …, 4, 5}
20
Two-Dimensional Hyperboloid Approximation with Prior Knowledge
21
Two-Dimensional Tower Function
22
Two-Dimensional Tower Function Data Given 400 points on the grid [-4, 4] [-4, 4] Values are min{g(x), 2}, where g(x) is the exact tower function
23
Two-Dimensional Tower Function Approximation without Prior Knowledge
24
Two-Dimensional Tower Function Prior Knowledge Add prior knowledge: (x 1, x 2 ) [-4, 4] [-4, 4] f(x) = g(x) Prior knowledge is the exact function value. Enforced at 2500 points on the grid [-4, 4] [-4, 4] through above implication Principal objective of prior knowledge is to overcome poor given data
25
Two-Dimensional Tower Function Approximation with Prior Knowledge
26
Predicting Lymph Node Metastasis Number of metastasized lymph nodes is an important prognostic indicator for breast cancer recurrence Determined by surgery in addition to the removal of the tumor Wisconsin Prognostic Breast Cancer (WPBC) data Lymph node metastasis for 194 patients 30 cytological features from a fine-needle aspirate Tumor size, obtained during surgery Task: predict the number of metastasized lymph nodes given tumor size alone
27
Predicting Lymph Node Metastasis Split data into two portions Past data: 20% used to find prior knowledge Present data: 80% used to evaluate performance Simulates acquiring prior knowledge from an expert’s experience
28
Prior Knowledge for Lymph Node Metastasis Use kernel approximation without knowledge on the past data f 1 (x) = K(x 0, A 1 0 ) 1 + b 1 A 1 is the matrix of the past data points Use density estimation to decide where to enforce knowledge p(x) is the empirical density of the past data Number of metastasized lymph nodes is greater than the predicted value on the past data, with tolerance of 0.01 p(x) 0.1 f(x) f 1 (x) - 0.01
29
Predicting Lymph Node Metastasis: Results Table shows root-mean-squared-error (RMSE) of past data (20%) approximation f 1 (x) on present data (80%) Leave-one-out (LOO) RMSE reported for approximations with and without knowledge Improvement due to knowledge: 14.8% ApproximationRMSE/LOO RMSE f 1 (x) from past data6.12 Without knowledge5.92 With knowledge5.04
30
Conclusion Added general nonlinear prior knowledge to kernel approximation Implemented as linear inequalities in a linear programming problem Knowledge incorporated transparently Demonstrated effectiveness Two synthetic examples Real world problem from breast cancer prognosis Future work More general prior knowledge with inequalities replaced by more general functions Apply to classification problems
31
Questions WW ebsites linking to papers and talks: hh ttp://www.cs.wisc.edu/~olvi/ hh ttp://www.cs.wisc.edu/~wildt/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.