Aug. 27, 2003IFAC-SYSID2003 Functional Analytic Framework for Model Selection Masashi Sugiyama Tokyo Institute of Technology, Tokyo, Japan Fraunhofer FIRST-IDA,

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Instabilities of SVD Small eigenvalues -> m+ sensitive to small amounts of noise Small eigenvalues maybe indistinguishable from 0 Possible to remove small.

Prediction with Regression

Support Vector Machines Instructor Max Welling ICS273A UCIrvine.

Pattern Recognition and Machine Learning: Kernel Methods.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

Model Assessment and Selection

Model assessment and cross-validation - overview

April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

Support Vector Machine

Pattern Recognition and Machine Learning

Kernel Methods for De-noising with Neuroimaging Application

Radial Basis Functions

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

Support Vector Regression (Linear Case:)  Given the training set:  Find a linear function, where is determined by solving a minimization problem that.

SVM Support Vectors Machines

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems

An Introduction to Support Vector Machines Martin Law.

Overview of Kernel Methods Prof. Bennett Math Model of Learning and Discovery 2/27/05 Based on Chapter 2 of Shawe-Taylor and Cristianini.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.

MML Inference of RBFs Enes Makalic Lloyd Allison Andrew Paplinski.

2. Bayes Decision Theory Prof. A.L. Yuille Stat 231. Fall 2004.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Mathematical formulation XIAO LIYING. Mathematical formulation.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Perceptual and Sensory Augmented Computing Advanced Machine Learning Winter’12 Advanced Machine Learning Lecture 3 Linear Regression II Bastian.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

1 Week 5 Linear operators and the Sturm–Liouville theory 1.Complex differential operators 2.Properties of self-adjoint operators 3.Sturm-Liouville theory.

ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.

An Introduction to Support Vector Machines (M. Law)

Basis Expansions and Regularization Part II. Outline Review of Splines Wavelet Smoothing Reproducing Kernel Hilbert Spaces.

1. 2  A Hilbert space H is a real or complex inner product space that is also a complete metric space with respect to the distance function induced.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Kernel adaptive filtering Lecture slides for EEL6502 Spring 2011 Sohan Seth.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

AAAI 2011, San Francisco Trajectory Regression on Road Networks Tsuyoshi Idé (IBM Research – Tokyo) Masashi Sugiyama (Tokyo Institute of Technology)

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Neural Nets: Something you can use and something to think about Cris Koutsougeras What are Neural Nets What are they good for Pointers to some models and.

High-dimensional Error Analysis of Regularized M-Estimators Ehsan AbbasiChristos ThrampoulidisBabak Hassibi Allerton Conference Wednesday September 30,

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.

Characterizing the Function Space for Bayesian Kernel Models Natesh S. Pillai, Qiang Wu, Feng Liang Sayan Mukherjee and Robert L. Wolpert JMLR 2007 Presented.

CORRECTIONS L2 regularization ||w|| 2 2, not ||w|| 2 Show second derivative is positive or negative on exams, or show convex – Latter is easier (e.g. x.

Stein Unbiased Risk Estimator Michael Elad. The Objective We have a denoising algorithm of some sort, and we want to set its parameters so as to extract.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

June 25-29, 2006ICML2006, Pittsburgh, USA Local Fisher Discriminant Analysis for Supervised Dimensionality Reduction Masashi Sugiyama Tokyo Institute of.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Instructor: Mircea Nicolescu Lecture 7

1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.

ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.

Computacion Inteligente Least-Square Methods for System Identification.

Density Estimation in R Ha Le and Nikolaos Sarafianos COSC 7362 – Advanced Machine Learning Professor: Dr. Christoph F. Eick 1.

Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.

Machine learning, pattern recognition and statistical data modelling

10701 / Machine Learning Today: - Cross validation,

Biointelligence Laboratory, Seoul National University

Identification of Wiener models using support vector regression

Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.

Presentation transcript:

Aug. 27, 2003IFAC-SYSID2003 Functional Analytic Framework for Model Selection Masashi Sugiyama Tokyo Institute of Technology, Tokyo, Japan Fraunhofer FIRST-IDA, Berlin, Germany

2 From, obtain a good approximation to Regression Problem :Learning target function :Learned function :Training examples (noise)

3 Model Selection Too simpleAppropriateToo complex Target function Learned function Choice of the model is extremely important for obtaining good learned function ! (Model refers to, e.g., regularization parameter)

4 Aims of Our Research Model is chosen such that a generalization error estimator is minimized. Therefore, model selection research is essentially to pursue an accurate estimator of the generalization error. We are interested in Having a novel method in different framework. Estimating the generalization error with small (finite) samples.

5 : A functional Hilbert space We assume We shall measure the “goodness” of the learned function (or the generalization error) by Formulating Regression Problem as Function Approximation Problem :Norm in :Expectation over noise

6 In learning problems, we sample values of the target function at sample points (e.g., ). Therefore, values of the target function at sample points should be specified. This means that usual -space is not suitable for learning problems. Function Spaces for Learning and have different values at But they are treated as the same function in is spanned by

7 In a reproducing kernel Hilbert space (RKHS), a value of a function at an input point is always specified. Indeed, an RKHS has the reproducing kernel with reproducing property: Reproducing Kernel Hilbert Spaces :Inner product in

8 Sampling Operator For any RKHS, there exists a linear operator from to such that Indeed, :Neumann-Schatten product : -th standard basis in For vectors,

9 Our Framework Learning target function Learned function ＋ noise Sampling operator (Always linear) Learning operator (Generally non-linear) RKHSSample value space Gen. error :Expectation over noise

10 We want to estimate. But it includes unknown so it is not straightforward. To cope with this problem, We shall estimate only its essential part We focus on the kernel regression model: Tricks for Estimating Generalization Error ConstantEssential part :Reproducing kernel of

11 Unknown target function can be erased! For the kernel regression model, the essential gen. error is expressed by A Key Lemma :Generalized inverse :Expectation over noise

12 is an unbiased estimator of the essential gen. error. However, the noise vector is unknown. Let us define Clearly, it is still unbiased: We would like to handle well. Estimating Essential Part

13 How to Deal with Depending on the type of learning operator we consider the following three cases. A) is linear. B) is non-linear but twice almost differentiable. C) is general non-linear.

14 A) Examples of Linear Learning Operator Kernel ridge regression A particular Gaussian process regression Least-squares support vector machine :Parameters to be learned :Ridge parameter

15 When the learning operator is linear, A) Linear Learning This induces the subspace information criterion (SIC): SIC is unbiased with finite samples: M. Sugiyama & H. Ogawa (Neural Comp, 2001) M. Sugiyama & K.-R. Müller (JMLR, 2002) :Adjoint of

16 How to Deal with Depending on the type of learning operator we consider the following three cases. A) is linear. B) is non-linear but twice almost differentiable. C) is general non-linear.

17 B) Examples of Twice Almost Differentiable Learning Operator Support vector regression with Huber’s loss :Ridge parameter :Threshold

18 For the Gaussian noise, we have B) Twice Differentiable Learning SIC for twice almost differentiable learning: It reduces to the original SIC if is linear. It is still unbiased with finite samples: :Vector-valued function

19 How to Deal with Depending on the type of learning operator we consider the following three cases. A) is linear. B) is non-linear but twice almost differentiable. C) is general non-linear.

20 C) Examples of General Non-Linear Learning Operator Kernel sparse regression Support vector regression with Vapnik’s loss

21 Approximation by the bootstrap C) General Non-Linear Learning Bootstrap approximation of SIC (BASIC): BASIC is almost unbiased: :Expectation over bootstrap replications

22 :Gaussian RKHS Kernel ridge regression Simulation: Learning Sinc function :Ridge parameter

23 Simulation: DELVE Data Sets Red: Best or comparable (95%t-test) Normalized test error

24 Conclusions We provided a functional analytic framework for regression, where the generalization error is measured using the RKHS norm: Within this framework, we derived a generalization error estimator called SIC. A) Linear learning (Kernel ridge, GPR, LS-SVM): SIC is exact unbiased with finite samples. B) Twice almost differentiable learning (SVR+Huber): SIC is exact unbiased with finite samples. C) Non-linear learning (K-sparse, SVR+Vapnik): BASIC is almost unbiased.