Introduction to Predictive Learning LECTURE SET 4 Statistical Learning Theory Electrical and Computer Engineering
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
Objectives Problems with philosophical approaches - lack quantitative description/ characterization of ideas; - no real predictive power (as in Natural Sciences) - no agreement on basic definitions/ concepts (as in Natural Sciences) Goal: to introduce Predictive Learning as a scientific discipline
Characteristics of Scientific Theory Problem setting Solution approach Math proofs (technical analysis) Constructive methods Applications Note: Problem Setting and Solution Approach are independent (of each other)
History and Overview SLT aka VC-theory (Vapnik-Chervonenkis) Theory for estimating dependencies from finite samples (predictive learning setting) Based on the risk minimization approach All main results originally developed in 1970’s for classification (pattern recognition) – why? but remained largely unknown Recent renewed interest due to practical success of Support Vector Machines (SVM)
History and Overview(cont’d) MAIN CONCEPTUAL CONTRIBUTIONS Distinction between problem setting, inductive principle and learning algorithms Direct approach to estimation with finite data (KID principle) Math analysis of ERM (standard inductive setting) Two factors responsible for generalization: - empirical risk (fitting error) - complexity(capacity) of approximating functions
Importance of VC-theory Math results addressing the main question: - under what general conditions the ERM approach leads to (good) generalization? New approach to induction: Predictive vs generative modeling (in classical statistics) Connection to philosophy of science - VC-theory developed for binary classification (pattern recognition) ~ the simplest generalization problem - natural sciences: from observations to scientific law VC-theoretical results can be interpreted using general philosophical principles of induction, and vice versa.
Inductive Learning Setting The learning machine observes samples (x ,y), and returns an estimated response Two modes of inference: identification vs imitation Risk
The Problem of Inductive Learning Given: finite training samples Z={(xi, yi),i=1,2,…n} choose from a given set of functions f(x, w) the one that approximates best the true output. (in the sense of risk minimization) Concepts and Terminology approximating functions f(x, w) (non-negative) loss function L(f(x, w),y) expected risk functional R(Z,w) Goal: find the function f(x, wo) minimizing R(Z,w) when the joint distribution P(x,y) is unknown.
Empirical Risk Minimization ERM principle in model-based learning Model parameterization: f(x, w) Loss function: L(f(x, w),y) Estimate risk from data: Choose w* that minimizes Remp Statistical Learning Theory developed from the theoretical analysis of ERM principle under finite sample settings
Probabilistic Modeling vs ERM
Probabilistic Modeling vs ERM: Example Known class distribution optimal decision boundary
Probabilistic Approach Estimate parameters of Gaussian class distributions , and plug them into quadratic decision boundary
ERM Approach Quadratic and linear decision boundary estimated via minimization of squared loss
Estimation of multivariate functions Is it possible to estimate a function from finite data? Simplified problem: estimation of unknown continuous function from noise-free samples Many results from function approximation theory: To estimate accurately a d-dimensional function one needs O(n^^d) data points For example, if 3 points are needed to estimate 2-nd order polynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space. Similar results in signal processing Never enough data points to estimate multivariate functions in most practical applications (image recognition, genomics etc.) For multivariate function estimation, the number of free parameters increases exponentially with problem dimensionality (the Curse of Dimensionality)
Properties of high-dimensional data Sparse data looks like a porcupine: the volume of a unit sphere inscribed in a d-dimensional cube gets smaller even as the volume of d-cube gets exponentially larger! A point is closer to an edge than to another point Pairwise distances between points are the same Intuition behind kernel (local) methods no longer holds. How generalization is possible, in spite of the curse of dimensionality?
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
Keep-It-Direct Principle The goal of learning is generalization rather than estimation of true function (system identification) Keep-It-Direct Principle (Vapnik, 1995) Do not solve an estimation problem of interest by solving a more general (harder) problem as an intermediate step Good predictive model reflects some properties of unknown distribution P(x,y) Since model estimation with finite data is ill-posed, one should never try to solve a more general problem than required by given application Importance of formalizing application requirements as a learning problem.
Learning vs System Identification Consider regression problem where unknown target function Goal 1: Prediction Goal 2: Function Approximation (system identification) or Admissible models: algebraic polynomials Purpose of comparison: contrast goals (1) and (2) NOTE: most applications assume Goal 2, i.e. Noisy Data ~ true signal + noise
Empirical Comparison Target function: sine-squared Input distribution: non-uniform Gaussian pdf Additive gaussian noise with st. deviation = 0.1
Empirical Comparison (cont’d) Model selection: use separate data sets - training : for parameter estimation - validation: for selecting polynomial degree - test: for estimating prediction risk (MSE) Validation set generated differently to contrast (1)&(2) Predictive Learning (1) ~ Gaussian Funct. Approximation (2) ~ uniform fixed sampling Training + test data ~ Gaussian Training set size: 30 Validation set size : 30
Regression estimates (2 typical realizations of data): Dotted line ~ estimate obtained using predictive learning Dashed line ~ estimate via function approximation setting
Conclusion The goal of prediction (1) is different (less demanding) than the goal of estimating the true target function (2) everywhere in the input space. The curse of dimensionality applies to system identification setting (2), but may not hold under predictive setting (1). Both settings coincide if the input distribution is uniform (i.e., in signal and image denoising applications)
Philosophical Interpretation of KID Interpretation of predictive models Realism ~ objective truth (hidden in Nature) Instrumentalism ~ creation of human mind (imposed on the data) – favored by KID Objective Evaluation still possible (via prediction risk reflecting application needs) Natural Science Methodological implications Importance of good learning formulations (asking the ‘right question’) Accounts for 80% of success in applications
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
VC-theory has 4 parts: Analysis of consistency/convergence of ERM Generalization bounds Inductive principles (for finite samples) Constructive methods (learning algorithms) for implementing (3) NOTE: (1)(2)(3)(4)
Consistency/Convergence of ERM Empirical Risk known but Expected Risk unknown Asymptotic consistency requirement: under what (general) conditions models providing min Empirical Risk will also provide min Prediction Risk, when the number of samples grows large? Why asymptotic analysis is needed? - helps to develop useful concepts - necessary and sufficient conditions ensure that VC-theory is general and can not be improved
Consistency of ERM Convergence of empirical risk to expected risk does not imply consistency of ERM Models estimated via ERM (w*) are always biased estimates of the functions minimizing true risk:
Conditions for Consistency of ERM Main insight: consistency is not possible without restricting the set of possible models Example: 1-nearest neighbor classification method. - is it consistent ? Consider binary decision functions (classification) How to measure their flexibility, or ability to ‘explain’ the training data (for binary classification)? This complexity index for indicator functions: - is independent of unknown data distribution; - measures the capacity of a set of possible models, rather than characteristics of the ‘true model’
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
SHATTERING Linear indicator functions: can split 3 data points in 2D in all 2^^3 = 8 possible binary partitions If a set of n samples can be separated by a set of functions in all 2^^n possible ways, this sample is said to be shattered (by the set of functions) Shattering ~ a set of models can explain a given sample of size n (for all possible labelings)
VC DIMENSION Definition: A set of functions has VC-dimension h is there exist h samples that can be shattered by this set of functions, but there are no h+1 samples that can be shattered VC-dimension h=3 ( h=d+1 for linear functions ) VC-dim. is a positive integer (combinatorial index) What is VC-dim. of 1-nearest neighbor classifier ?
VC-dimension and Consistency of ERM VC-dimension is infinite if a sample of size n can be split in all 2^^n possible ways (in this case, no valid generalization is possible) Finite VC-dimension gives necessary and sufficient conditions for: (1) consistency of ERM-based learning (2) fast rate of convergence (these conditions are distribution-independent) Interpretation of the VC-dimension via falsifiability: functions with small VC-dim can be easily falsified
VC-dimension and Falsifiability A set of functions has VC-dimension h if It can explain (shatter) a set of h samples ~ there exists h samples that cannot falsify it and (b) It can not shatter h+1 samples ~ any h+1 samples falsify this set Finiteness of VC-dim is necessary and sufficient condition for generalization (for any learning method based on ERM)
Recall Occam’s Razor: Main problem in predictive learning - Complexity control (model selection) - How to measure complexity? Interpretation of Occam’s razor (in Statistics): Entities ~ model parameters Complexity ~ degrees-of-freedom Necessity ~ explaining (fitting) available data Model complexity = number of parameters (DoF) Consistent with classical statistical view: learning = function approx. / density estimation
Philosophical Principle of VC-falsifiability Occam’s Razor: Select the model that explains available data and has the smallest number of free parameters (entities) VC theory: Select the model that explains available data and has low VC-dimension (i.e. can be easily falsified) New principle of VC-falsifiability
Calculating the VC-dimension How to estimate the VC-dimension (for a given set of functions)? Apply definition (via shattering) to derive analytic estimates – works for ‘simple’ sets of functions Generally, such analytic estimates are not possible for complex nonlinear parameterizations (i.e., for practical machine learning and statistical methods)
Example: VC-dimension of spherical indicator functions. Consider spherical decision surfaces in a d-dimensional x-space, parameterized by center c and radius r parameters: In a 2-dim space (d=2) there exists 3 points that can be shattered, but 4 points cannot be shattered h=3
Example: VC-dimension of a linear combination of fixed basis functions (i.e. polynomials, Fourier expansion etc.) Assuming that basis functions are linearly independent, the VC-dim equals the number of basis functions (free parameters). Example: single parameter but infinite VC-dimension
Example: Wide linear decision boundaries Consider linear functions such that the distance btwn D(x) and the closest data sample is larger than given value Then VC-dimension depends on the width parameter, rather than d (as in linear models):
Linear combination of fixed basis functions is equivalent to linear functions in m-dimensional space VC-dimension = m + 1 (this assumes linear independence of basis functions) In general, analytic estimation of VC-dimension is hard VC-dimension can be - equal to DoF - larger than DoF - smaller than DoF
VC-dimension vs number of parameters VC-dimension can be equal to DoF (number of parameters) Example: linear estimators VC-dimension can be smaller than DoF Example: penalized estimators VC-dimension can be larger than DoF Example: feature selection sin (wx)
VC-dimension for Regression Problems VC-dimension was defined for indicator functions Can be extended to real-valued functions, i.e. third-order polynomial for univariate regression: linear parameterization VC-dim = 4 Qualitatively, the VC-dimension ~ the ability to fit (or explain) finite training data for regression. This can be also related to falsifiability
Example: what is VC-dim of kNN Regression? Ten training samples from Using k-nn regression with k=1 and k=4:
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
Recall consistency of ERM Two Types of VC-bounds: How close is the empirical risk to the true risk (2) How close is the empirical risk to the minimal possible risk ?
Generalization Bounds Bounds for learning machines (implementing ERM) evaluate the difference btwn (unknown) risk and known empirical risk, as a function of sample size n and general properties of admissible models (their VC-dimension) Classification: the following bound holds with probability of for all approximating functions where is called the confidence interval Regression: the following bound holds with probability of for all approximating functions where
Practical VC Bound for regression Practical regression bound can be obtained by setting the confidence level and theoretical constants: can be used for model selection (examples given later) Compare to analytic bounds (SC, FPE) in Lecture Set 2 Analysis (of denominator) shows that h < 0.8 n for any estimator In practice: h < 0.5 n for any estimator
VC Regression Bound for model selection VC-bound can be used for analytic model selection (if the VC-dimension is known) Example: polynomial regression for estimating Sine_Squared target function from 25 noisy samples Optimal model found: 6-th degree polynomial (no resampling needed)
Modeling pure noise with x in [0,1] via poly regression sample size n=30, noise Comparison of different model selection methods: - prediction risk (MSE) - selected DoF (~ h)
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
Structural Risk Minimization Analysis of generalization bounds suggests that when n/h is large, the term is small This leads to parametric modeling approach (ERM) When n/h is not large (say, less than 20), both terms in the right-hand side of VC- bound need to be minimized make the VC-dimension a controlling variable SRM = formal mechanism for controlling model complexity Set of admissible models has a nested structure such that structure formally defines complexity ordering
Structural Risk Minimization An upper bound on the true risk and the empirical risk, as a function of VC-dimension h (for fixed sample size n)
SRM vs ERM modeling
SRM Approach Two general strategies for implementing SRM: Use VC-dimension as a controlling parameter for minimizing VC bound: Two general strategies for implementing SRM: 1. Keep fixed and minimize (most statistical and neural network methods) 2. Keep fixed and minimize (Support Vector Machines)
Common SRM structures Dictionary structure A set of algebraic polynomials is a structure since More generally where is a set of basis functions (dictionary). The number of terms (basis functions) m specifies an element of a structure. For fixed basis fcts, VC-dim ~ number of parameters
Common SRM structures Feature selection (aka subset selection) Consider sparse polynomials of degree m: for m=1: for m=2: etc. Each monomial is a feature. The goal is to select a set of m features providing min. empirical risk (MSE) This is a structure since More generally, where m basis fcts are selected from a (large) set of M fcts Note: nonlinear optimization, VC-dimension is unknown
Common SRM structures Penalization Consider algebraic polynomial of fixed degree where For each (positive) value c this set of functions specifies an element of a structure Minimization of empirical risk (MSE) on each element of a structure is a constrained minimization problem This optimization problem can be equivalently stated as minimization of the penalized empirical risk functional: where the choice of Note: VC-dimension is unknown
Example: SRM structures for regression Regression data set x-values~ uniformly sampled in [0,1] y-values ~ target fct additive Gaussian noise with st. dev 0.05 Experimental set-up training set ~ 40 samples validation set ~ 40 samples (for model selection) SRM structures defined on algebraic polynomials - dictionary (polynomial degrees 1 to 10) - penalization (fixed degree-10 polynomial) - sparse polynomials (degree 1 to 5)
Estimated models using different SRM structures: - dictionary - penalization lambda=1.013e-005 - sparse polynomial Visual results: target fct~ red line, feature selection~ black solid, dictionary ~ green, penalization ~ yellow line 0.2 0.4 0.6 0.8 1 -1.5 -1 -0.5 0.5 x y
SRM Summary SRM structure ~ complexity ordering on a set of admissible models (approximating functions) Many different structures on the same set of approximating functions (possible models) How to choose the ‘best’ structure? - depends on application data - VC theory cannot provide answer SRM = mechanism for complexity control - selecting optimal complexity for a given data set - new measure of complexity: VC-dimension - model selection using analytic VC-bounds
OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion
Summary and Discussion: VC-theory Methodology - learning problem setting (KID principle) - concepts (risk minimization, VC-dimension, structure) Interpretation/ evaluation of existing methods Model selection using VC-bounds New types of inference (TBD later) What theory can not do: - provide formalization (for a given application) - select ‘good’ structure - always a gap between theory and applications