April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute of Technology, Tokyo, Japan Masashi Sugiyama (sugi@cs.titech.ac.jp)

2 From, obtain such that it is as close to as possible. Regression Problem :Learning target function :Learned function :Training examples (noise)

3 Typical Method of Learning Linear regression model Ridge regression :Parameters to be learned :Basis functions :Ridge parameter

4 Model Selection Too simpleAppropriateToo complex Target function Learned function Choice of the model affects heavily on the learned function (Model refers to, e.g., the ridge parameter )

5 Ideal Model Selection Determine the model such that a certain generalization error is minimized. Badness of

6 However, the generalization error can not be directly calculated since it includes unknown learning target function Practical Model Selection Determine the model such that an estimator of the generalization error is minimized. We want to have an accurate estimator (not true for Bayesian model selection using evidence)

7 Try to obtain unbiased estimators Two Approaches to Estimating Generalization Error (1) C P (Mallows, 1973) Cross-Validation Akaike Information Criterion (Akaike, 1974) etc. Interested in typical-case performance

8 Try to obtain probabilistic upper bounds Two Approaches to Estimating Generalization Error (2) VC-bound (Vapnik & Chervonenkis, 1974) Span bound (Chapelle & Vapnik, 2000) Concentration bound (Bousquet & Elisseeff 2001) etc. with probability Interested in worst-case performance

9 Popular Choices of Generalization Measure Risk e.g., Kullback-Leibler divergence :Learned density :Target density

10 Concerns in Existing Methods The used approximation often requires a large (infinite) number of training examples for justification (asymptotic approximation) They do not work with small samples Generalization measure should be integrated over, from which training examples are drawn They can not be used for transduction (estimating error at a point of interest)

11 Our Interests We are interested in Estimating the generalization error with accuracy guaranteed for small (finite) samples Estimating the transduction error (the error at a point of interest) Investigating the role of unlabeled samples (samples without output sample values )

12 :Functional Hilbert space We assume Our Generalization Measure :Norm in the function space

13 Generalization Measure in Functional Hilbert Space A functional Hilbert space is specified by Set of functions which span the space, Inner product (and norm). Given a set of functions, we can design the inner product (and therefore the generalization measure) as desired.

14 Weighted distance in input domain Weighted distance in Fourier domain Sobolev norm Examples of the Norm :Weight function : -th derivative of

15 Interesting Features When and, we can use unlabelled samples for estimating : For transductive inference (given ), For interpolation, extrapolation: Desired : Weight function

16 Goal of My Talk I suppose that you like the generalization measure defined in the functional Hilbert space. The goal of my talk is to give a method for estimating the generalization error. :Norm in the function space

17 For further discussion, we have to specify the class of function spaces. We want the class to be less restrictive. A general function space such as is not suitable for learning problems because a value of a function at a point is not specified in. Function Spaces for Learning and have different values at But they are treated as the same function in is spanned by

18 A function space that is rather general and a value of a function at a point is specified is the reproducing kernel Hilbert space (RKHS). RKHS has the reproducing kernel For any fixed, is a function of in For any function in and any, Reproducing Kernel Hilbert Space :Inner product in the function space

19 Specified RKHS : Fixed, : Mean, Variance Linear estimation e.g., ridge regression for linear model Formulation of Learning Problem :Linear operator :Basis functions in

20 Sampling Operator For any RKHS, there exists a linear operator from to such that Indeed, :Neumann-Schatten product : -th standard basis in For vectors,

21 Formulation Learning target function Learned function ＋ noise Sampling operator (Always linear) Learning operator (Assume linear) RKHSSample value space Gen. error

22 We are interested in typical performance so we estimate the expected generalization error over the noise We do not take expectation over input points Data-dependent ! We do not assume Advantageous in active learning ! Expected Generalization Error :Expectation over noise

23 Bias / Variance Decomposition Bias Variance Bias Variance We want to estimate the bias ! :Noise variance :Expectation over noise :Adjoint of RKHS

24 Tricks for Estimating Bias Suppose we have a linear operator that gives an unbiased estimate of We use for estimating the bias of :Expectation over noise RKHS Sugiyama & Ogawa (Neural Comp., 2001)

25 Unbiased Estimator of Bias Bias Rough estimate RKHS

26 Subspace Information Criterion (SIC) SIC is an unbiased estimator of the generalization error with finite samples Estimate of BiasVariance :Expectation over noise

27 exists if and only if span the entire space. When this is satisfied, is given by. We can enjoy all the features ! (Unlabeled samples, transductive inference etc.) Obtaining Unbiased Estimate We need that gives an unbiased estimate of learning target. :Generalized inverse

28 Example of Using SIC: Standard Linear Regression Learning target function where are unknown Regression model where are estimated linearly (e.g., ridge regression)

29 Example (cont.) Generalization measure If the design matrix has rank, then the best linear unbiased estimator (BLUE) always exists In this case, SIC provides an unbiased estimate of the above generalization error :Number of basis functions :Weight function

30 However, the design matrix has rank only if Therefore, the target function should be included in a rather small model Applicability of SIC :Number of basis functions :Number of training examples Range of application of SIC is rather limited

31 When Unbiased Estimate Does Not Exist exists if and only if span the whole space. When this condition is not fulfilled, let us restrict ourselves to finding a learning result function from a subspace, not from the entire RKHS RKHS Sugiyama & Müller (JMLR, 2002)

32 Essential Generalization Error RKHS Essential Irrelevant (constant) Essentially, we are estimating projection is just replaced by

33 Such exists if and only if the subspace is included in the span of. Unbiased Estimate of Projection If a linear operator that gives an unbiased estimate of the projection of the learning target is available, then SIC is an unbiased estimator of the essential generalization error. e.g., kernel regression model

34 Restriction However, another restriction arises: If the generalization measure is designed as desired, we have to use the kernel function induced by the generalization measure

35 Restriction (cont.) On the other hand, If a desired kernel function is used, then we have to use the generalization measure induced by the kernel e.g., generalization measure in Gaussian RKHS heavily penalizes high frequency components

36 Summary of Usage of SIC SIC essentially has two modes. For rather restricted linear regression, SIC has several interesting properties. Unlabeled samples can be utilized for estimating prediction error (expected test error). Any weighted error measures can be used, e.g., inter-(extra-)polation, transductive inference. For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.

37 Simulation (1): Setting :Trigonometric polynomial RKHS Span: Gen. measure: Learning target function : sinc-like function in

38 Training examples : Ridge regression is used for learning Number of training examples: Noise variance: Number of trials: Simulation (1): Setting (cont.)

39 Simulation (1-a): Using Unlabeled Samples We estimate the prediction error using 1000 unlabeled samples

40 Results: Unlabeled Samples Values can be negative since some constants are ignored :Ridge parameter

41 Results: Unlabeled Samples :Ridge parameter

42 Simulation (1-b): Transduction We estimate the test error at a single test point

43 Results: Transduction :Ridge parameter

44 Results: Transduction :Ridge parameter

45 :Gaussian RKHS Learning target function : sinc function Training examples : We estimate Simulation (2): Infinite Dimensional RKHS

46 Results: Gaussian RKHS :Ridge parameter

47 Results: Gaussian RKHS :Ridge parameter

48 Simulation (3): DELVE Data Sets :Gaussian RKHS We choose the ridge parameter by SIC Leave-one-out cross-validation An empirical Bayesian method (Marginal likelihood maximization) Performance is compared by test error (Akaike, 1980)

49 Normalized Test Errors Red: Best or comparable (95%t-test)

50 Image Restoration Restoration Filter We would like to determine parameter values appropriately. largesmallappropriate Degraded Image Parameter e.g.,Gaussian filter, regularization filter Sugiyama et al. (IEICE Trans., 2001) Sugiyama & Ogawa (Signal Processing, 2002)

51 Formulation Hilbert space Original image Restored image Degradation Filter Observed image Noise

52 Results with Regularization Filter Original images Degraded images Restored images using SIC

53 Precipitation Estimation Estimating future precipitation from past precipitation and whether radar data. Our method with SIC won the 1st prize in estimation accuracy in IEICE Precipitation Estimation Contest 2001 Precipitation and weather radar and data from IEICE Precipitation Estimation Contest 2001 Moro & Sugiyama (IEICE General Conf., 2001) 1stTokyoTechMSE=0.71 2ndKyuTechMSE=0.75 3rdChiba UnivMSE=0.93 4thMSE=1.18

54 References (Fundamentals of SIC) Proposing the concept of SIC Sugiyama, M. & Ogawa, H. Subspace information criterion for model selection. Neural Computation, vol.13, no.8, pp.1863-1889, 2001 Neural Computation Performance evaluation of SIC Sugiyama, M. & Ogawa, H. Theoretical and experimental evaluation of the subspace information criterion. Machine Learning, vol.48, no.1/2/3, pp.25-50, 2002. Machine Learning

55 References (SIC for Particular Learning Methods) SIC for regularization learning Sugiyama, M. & Ogawa, H. Optimal design of regularization term and regularization parameter by subspace information criterion. Neural Networks, vol.15, no.3, pp.349-361, 2002. Neural Networks SIC for sparse regressors Tsuda, K., Sugiyama, M., & Müller, K.-R. Subspace information criterion for non-quadratic regularizers --- Model selection for sparse regressors. IEEE Transactions on Neural Networks, vol.13, no.1, pp.70-80, 2002. IEEE Transactions on Neural Networks

56 References (Applications of SIC) Applying SIC to image restoration Sugiyama, M., Imaizumi, D., & Ogawa, H. Subspace information criterion for image restoration --- Optimizing parameters in linear filters. IEICE Transactions on Information and Systems, vol.E84-D, no.9, pp.1249-1256, Sep. 2001. IEICE Sugiyama, M. & Ogawa, H. A unified method for optimizing linear image restoration filters. Signal Processing, vol.82, no.11, pp.1773-1787, 2002. Signal Processing Applying SIC to precipitation estimation Moro, S. & Sugiyama, M. Estimation of precipitation from meteorological radar data. In Proceedings of the 2001 IEICE General Conference SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29, 2001.IEICE

57 References (Extensions of SIC) Extending range of application of SIC Sugiyama, M. & Müller, K.-R. The subspace information criterion for infinite dimensional hypothesis spaces. Journal of Machine Learning Research, vol.3 (Nov), pp.323- 359, 2002. Journal of Machine Learning Research Further Improving SIC Sugiyama, M. Improving precision of the subspace information criterion. IEICE Transactions on Fundamentals (to appear). IEICE Sugiyama, M., Kawanabe, M. & Müller, K.-R. Trading variance reduction with unbiasedness --- The regularized subspace information criterion for robust model selection (submitted).

58 Conclusions We formulated the regression problem from a functional analytic point of view. Within this framework, we gave a generalization error estimator called the subspace information criterion (SIC). Unbiasedness of SIC guaranteed even with finite samples. We did not take expectation over training sample points so SIC may be more data-dependent.

59 Conclusions (cont.) SIC essentially has two modes:  For rather restrictive linear regression, SIC has several interesting properties. Unlabeled samples can be utilized for estimating prediction error. Any weighted error measures can be used, e.g., interpolation, extrapolation, transductive inference.  For kernel regression, SIC can always be applied. However, kernel induced generalization measure should be employed.

April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

Similar presentations

Presentation on theme: "April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute.

Similar presentations

Presentation on theme: "April 3, 2003Max Plank Institute for Biological Cybernetics Functional Analytic Approach to Model Selection Department of Computer Science, Tokyo Institute."— Presentation transcript:

Similar presentations

About project

Feedback