Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03
Outline Regression Background Linear ε- Insensitive Loss Algorithm Primal Formulation Dual Formulation Kernel Formulation Quadratic ε- Insensitive Loss Algorithm Kernel Ridge Regression & Gaussian Process
Regression = find a function that fits the observations Observations: (1949,100) (1950,117)... (1996,1462) (1997,1469) (1998,1467) (1999,1474) (x,y) pairs
Linear fit... Not so good...
Better linear fit... Take logarithm of y and fit a straight line
Transform back to original So so...
So what is regression about? Construct a model of a process, using examples of the process. Input: x (possibly a vector) Output: f(x) (generated by the process) Examples: Pairs of input and output {y, x} Our model: The function is our estimate of the true function g(x)
Assumption about the process The “fixed regressor model” x(n) Observed input y(n) Observed output g[x(n)] True underlying function (n) I.I.D noise process with zero mean Data set:
Example 2
Model Sets (examples) g(x) = x + x 2 + 6x 3 11 22 33 1 ={a+bx}; 2 ={a+bx+cx 2 }; 3 ={a+bx+cx 2 +dx 3 }; Linear; Quadratic; Cubic; 1 2 31 2 3
Idealized regression g(x) Model Set (our hypothesis set) f opt (x) Error Find appropriate model family and find f(x) with minimum “distance” to g(x) (“error”)
How measure “distance”? Q: What is the distance (difference) between functions f and g?
Margin Slack Variable For Example(xi, yi), function f, Margin slack variable θ: target accuracy in test γ : difference between target accuracy and margin in training
ε- Insensitive Loss Function Let ε= θ-γ, Margin Slack Variable Linear ε- Insensitive Loss: Quadratic ε- Insensitive Loss
Linear ε- Insensitive Loss a Linear SV Machine ξ ξ Yi-
Basic Idea of SV Regression Starting point We have input data X = {(x 1,y 1 ), …., (x N,y N )} Goal We want to find a robust function f(x) that has at most ε deviation from the targets y, while at the same time being as flat as possible. Idea Simple Regression Problem + Optimization + Kernel Trick
Thus setting: Primal Regression Problem
Linear ε- Insensitive Loss Regression min subject to ε decide Insensitive Zone C a trade-off between error and ||w|| εand C must be tuned simultaneously Regression is more difficult than Classification?
Parameters used in SV Regression
Dual Formulation Lagrangian function will help us to formulate the dual problem ε: insensitive loss β i * : Lagrange Multiplier ξ i : difference value for points above εband ξ i * : difference value for points below εband Optimality Conditions
Dual Formulation(Cont’) Dual Problem Solving
KKT Optimality Conditions and b KKT Optimality Conditions b can be computed as follows This means that the Lagrange multipliers will only be non-zero for points outside the band. Thus these points are the support vectors
The Idea of SVM input space feature space
Kernel Version Why can we use Kernel? The complexity of a function’s representation depends only on the number of SVs the complete algorithm can be described in terms of inner product. An implicit mapping to the feature space Mapping via Kernel
Quadratic ε- Insensitive Loss Regression Problem: min subject to Kernel Formulation
Kernel Ridge Regression & Gaussian Processes ε= 0 Least Square Linear Regression The weight decay factor is controlled by C min (λ~1/C) subject to Kernel Formulation (I: Identity Matrix) is also the mean of a Gaussian distribution
Architecture of SV Regression Machine similar to regression in a three-layered neural network!? b
Conclusion SVM is a useful alternative to neural network Two key concepts of SVM optimization kernel trick Advantages of SV Regression Represent solution by a small subset of training points Ensure the existence of global minimum Ensure the optimization of a reliable eneralization bound
Discussion1: Influence of an insensitivity band on regression quality 17 measured training data points are used. Left: ε= 0.1 15 SV are chosen Right: ε= 0.5 6 chosen SV produced a much better regression function
Enables sparseness within SVs, but guarantees sparseness? Robust (robust to small changes in data/ model) Less sensitive to outliers Discussion2: ε- Insensitive Loss