Introduction to Predictive Learning

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Introduction to Support Vector Machines (SVM)

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.

Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

Data mining and statistical learning - lecture 6

Instructor : Dr. Saeed Shiry

Support Vector Machine

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Instructor : Saeed Shiry

Pattern Recognition and Machine Learning

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

The Nature of Statistical Learning Theory by V. Vapnik

Introduction to Predictive Learning

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Computational Learning Theory

Learning From Data Chichang Jou Tamkang University.

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

Data mining and statistical learning - lecture 13 Separating hyperplane.

SVM Support Vectors Machines

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

An Introduction to Support Vector Machines Martin Law.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Support Vector Machine & Image Classification Applications

CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.

Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Universit at Dortmund, LS VIII

Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

An Introduction to Support Vector Machines (M. Law)

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota Presented at the University.

Lecture3 – Overview of Supervised Learning Rice ELEC 697 Farinaz Koushanfar Fall 2006.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Support Vector Machines Tao Department of computer science University of Illinois.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 8 Combining Methods and Ensemble Learning.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 4 Statistical Learning Theory.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.

CS 9633 Machine Learning Support Vector Machines

Predictive Learning from Data

Predictive Learning from Data

Deep Feedforward Networks

Predictive Learning from Data

Support Vector Machines

Predictive Learning from Data

Predictive Learning from Data

Support Vector Machines and Kernels

Supervised machine learning: creating a model

Support Vector Machines 2

Presentation transcript:

Introduction to Predictive Learning LECTURE SET 4 Statistical Learning Theory Electrical and Computer Engineering

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

Objectives Problems with philosophical approaches - lack quantitative description/ characterization of ideas; - no real predictive power (as in Natural Sciences) - no agreement on basic definitions/ concepts (as in Natural Sciences) Goal: to introduce Predictive Learning as a scientific discipline

Characteristics of Scientific Theory Problem setting Solution approach Math proofs (technical analysis) Constructive methods Applications Note: Problem Setting and Solution Approach are independent (of each other)

History and Overview SLT aka VC-theory (Vapnik-Chervonenkis) Theory for estimating dependencies from finite samples (predictive learning setting) Based on the risk minimization approach All main results originally developed in 1970’s for classification (pattern recognition) – why? but remained largely unknown Recent renewed interest due to practical success of Support Vector Machines (SVM)

History and Overview(cont’d) MAIN CONCEPTUAL CONTRIBUTIONS Distinction between problem setting, inductive principle and learning algorithms Direct approach to estimation with finite data (KID principle) Math analysis of ERM (standard inductive setting) Two factors responsible for generalization: - empirical risk (fitting error) - complexity(capacity) of approximating functions

Importance of VC-theory Math results addressing the main question: - under what general conditions the ERM approach leads to (good) generalization? New approach to induction: Predictive vs generative modeling (in classical statistics) Connection to philosophy of science - VC-theory developed for binary classification (pattern recognition) ~ the simplest generalization problem - natural sciences: from observations to scientific law  VC-theoretical results can be interpreted using general philosophical principles of induction, and vice versa.

Inductive Learning Setting The learning machine observes samples (x ,y), and returns an estimated response Two modes of inference: identification vs imitation Risk

The Problem of Inductive Learning Given: finite training samples Z={(xi, yi),i=1,2,…n} choose from a given set of functions f(x, w) the one that approximates best the true output. (in the sense of risk minimization) Concepts and Terminology approximating functions f(x, w) (non-negative) loss function L(f(x, w),y) expected risk functional R(Z,w) Goal: find the function f(x, wo) minimizing R(Z,w) when the joint distribution P(x,y) is unknown.

Empirical Risk Minimization ERM principle in model-based learning Model parameterization: f(x, w) Loss function: L(f(x, w),y) Estimate risk from data: Choose w* that minimizes Remp Statistical Learning Theory developed from the theoretical analysis of ERM principle under finite sample settings

Probabilistic Modeling vs ERM

Probabilistic Modeling vs ERM: Example Known class distribution  optimal decision boundary

Probabilistic Approach Estimate parameters of Gaussian class distributions , and plug them into quadratic decision boundary

ERM Approach Quadratic and linear decision boundary estimated via minimization of squared loss

Estimation of multivariate functions Is it possible to estimate a function from finite data? Simplified problem: estimation of unknown continuous function from noise-free samples Many results from function approximation theory: To estimate accurately a d-dimensional function one needs O(n^^d) data points For example, if 3 points are needed to estimate 2-nd order polynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space. Similar results in signal processing Never enough data points to estimate multivariate functions in most practical applications (image recognition, genomics etc.) For multivariate function estimation, the number of free parameters increases exponentially with problem dimensionality (the Curse of Dimensionality)

Properties of high-dimensional data Sparse data looks like a porcupine: the volume of a unit sphere inscribed in a d-dimensional cube gets smaller even as the volume of d-cube gets exponentially larger! A point is closer to an edge than to another point Pairwise distances between points are the same Intuition behind kernel (local) methods no longer holds. How generalization is possible, in spite of the curse of dimensionality?

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

Keep-It-Direct Principle The goal of learning is generalization rather than estimation of true function (system identification) Keep-It-Direct Principle (Vapnik, 1995) Do not solve an estimation problem of interest by solving a more general (harder) problem as an intermediate step Good predictive model reflects some properties of unknown distribution P(x,y) Since model estimation with finite data is ill-posed, one should never try to solve a more general problem than required by given application  Importance of formalizing application requirements as a learning problem.

Learning vs System Identification Consider regression problem where unknown target function Goal 1: Prediction Goal 2: Function Approximation (system identification) or Admissible models: algebraic polynomials Purpose of comparison: contrast goals (1) and (2) NOTE: most applications assume Goal 2, i.e. Noisy Data ~ true signal + noise

Empirical Comparison Target function: sine-squared Input distribution: non-uniform Gaussian pdf Additive gaussian noise with st. deviation = 0.1

Empirical Comparison (cont’d) Model selection: use separate data sets - training : for parameter estimation - validation: for selecting polynomial degree - test: for estimating prediction risk (MSE) Validation set generated differently to contrast (1)&(2) Predictive Learning (1) ~ Gaussian Funct. Approximation (2) ~ uniform fixed sampling Training + test data ~ Gaussian Training set size: 30 Validation set size : 30

Regression estimates (2 typical realizations of data): Dotted line ~ estimate obtained using predictive learning Dashed line ~ estimate via function approximation setting

Conclusion The goal of prediction (1) is different (less demanding) than the goal of estimating the true target function (2) everywhere in the input space. The curse of dimensionality applies to system identification setting (2), but may not hold under predictive setting (1). Both settings coincide if the input distribution is uniform (i.e., in signal and image denoising applications)

Philosophical Interpretation of KID Interpretation of predictive models Realism ~ objective truth (hidden in Nature) Instrumentalism ~ creation of human mind (imposed on the data) – favored by KID Objective Evaluation still possible (via prediction risk reflecting application needs)  Natural Science Methodological implications Importance of good learning formulations (asking the ‘right question’) Accounts for 80% of success in applications

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

VC-theory has 4 parts: Analysis of consistency/convergence of ERM Generalization bounds Inductive principles (for finite samples) Constructive methods (learning algorithms) for implementing (3) NOTE: (1)(2)(3)(4)

Consistency/Convergence of ERM Empirical Risk known but Expected Risk unknown Asymptotic consistency requirement: under what (general) conditions models providing min Empirical Risk will also provide min Prediction Risk, when the number of samples grows large? Why asymptotic analysis is needed? - helps to develop useful concepts - necessary and sufficient conditions ensure that VC-theory is general and can not be improved

Consistency of ERM Convergence of empirical risk to expected risk does not imply consistency of ERM Models estimated via ERM (w*) are always biased estimates of the functions minimizing true risk:

Conditions for Consistency of ERM Main insight: consistency is not possible without restricting the set of possible models Example: 1-nearest neighbor classification method. - is it consistent ? Consider binary decision functions (classification) How to measure their flexibility, or ability to ‘explain’ the training data (for binary classification)? This complexity index for indicator functions: - is independent of unknown data distribution; - measures the capacity of a set of possible models, rather than characteristics of the ‘true model’

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

SHATTERING Linear indicator functions: can split 3 data points in 2D in all 2^^3 = 8 possible binary partitions If a set of n samples can be separated by a set of functions in all 2^^n possible ways, this sample is said to be shattered (by the set of functions) Shattering ~ a set of models can explain a given sample of size n (for all possible labelings)

VC DIMENSION Definition: A set of functions has VC-dimension h is there exist h samples that can be shattered by this set of functions, but there are no h+1 samples that can be shattered VC-dimension h=3 ( h=d+1 for linear functions ) VC-dim. is a positive integer (combinatorial index) What is VC-dim. of 1-nearest neighbor classifier ?

VC-dimension and Consistency of ERM VC-dimension is infinite if a sample of size n can be split in all 2^^n possible ways (in this case, no valid generalization is possible) Finite VC-dimension gives necessary and sufficient conditions for: (1) consistency of ERM-based learning (2) fast rate of convergence (these conditions are distribution-independent) Interpretation of the VC-dimension via falsifiability: functions with small VC-dim can be easily falsified

VC-dimension and Falsifiability A set of functions has VC-dimension h if It can explain (shatter) a set of h samples ~ there exists h samples that cannot falsify it and (b) It can not shatter h+1 samples ~ any h+1 samples falsify this set Finiteness of VC-dim is necessary and sufficient condition for generalization (for any learning method based on ERM)

Recall Occam’s Razor: Main problem in predictive learning - Complexity control (model selection) - How to measure complexity? Interpretation of Occam’s razor (in Statistics): Entities ~ model parameters Complexity ~ degrees-of-freedom Necessity ~ explaining (fitting) available data  Model complexity = number of parameters (DoF) Consistent with classical statistical view: learning = function approx. / density estimation

Philosophical Principle of VC-falsifiability Occam’s Razor: Select the model that explains available data and has the smallest number of free parameters (entities) VC theory: Select the model that explains available data and has low VC-dimension (i.e. can be easily falsified)  New principle of VC-falsifiability

Calculating the VC-dimension How to estimate the VC-dimension (for a given set of functions)? Apply definition (via shattering) to derive analytic estimates – works for ‘simple’ sets of functions Generally, such analytic estimates are not possible for complex nonlinear parameterizations (i.e., for practical machine learning and statistical methods)

Example: VC-dimension of spherical indicator functions. Consider spherical decision surfaces in a d-dimensional x-space, parameterized by center c and radius r parameters: In a 2-dim space (d=2) there exists 3 points that can be shattered, but 4 points cannot be shattered  h=3

Example: VC-dimension of a linear combination of fixed basis functions (i.e. polynomials, Fourier expansion etc.) Assuming that basis functions are linearly independent, the VC-dim equals the number of basis functions (free parameters). Example: single parameter but infinite VC-dimension

Example: Wide linear decision boundaries Consider linear functions such that the distance btwn D(x) and the closest data sample is larger than given value Then VC-dimension depends on the width parameter, rather than d (as in linear models):

Linear combination of fixed basis functions is equivalent to linear functions in m-dimensional space VC-dimension = m + 1 (this assumes linear independence of basis functions) In general, analytic estimation of VC-dimension is hard VC-dimension can be - equal to DoF - larger than DoF - smaller than DoF

VC-dimension vs number of parameters VC-dimension can be equal to DoF (number of parameters) Example: linear estimators VC-dimension can be smaller than DoF Example: penalized estimators VC-dimension can be larger than DoF Example: feature selection sin (wx)

VC-dimension for Regression Problems VC-dimension was defined for indicator functions Can be extended to real-valued functions, i.e. third-order polynomial for univariate regression: linear parameterization  VC-dim = 4 Qualitatively, the VC-dimension ~ the ability to fit (or explain) finite training data for regression. This can be also related to falsifiability

Example: what is VC-dim of kNN Regression? Ten training samples from Using k-nn regression with k=1 and k=4:

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

Recall consistency of ERM Two Types of VC-bounds: How close is the empirical risk to the true risk (2) How close is the empirical risk to the minimal possible risk ?

Generalization Bounds Bounds for learning machines (implementing ERM) evaluate the difference btwn (unknown) risk and known empirical risk, as a function of sample size n and general properties of admissible models (their VC-dimension) Classification: the following bound holds with probability of for all approximating functions where is called the confidence interval Regression: the following bound holds with probability of for all approximating functions where

Practical VC Bound for regression Practical regression bound can be obtained by setting the confidence level and theoretical constants: can be used for model selection (examples given later) Compare to analytic bounds (SC, FPE) in Lecture Set 2 Analysis (of denominator) shows that h < 0.8 n for any estimator In practice: h < 0.5 n for any estimator

VC Regression Bound for model selection VC-bound can be used for analytic model selection (if the VC-dimension is known) Example: polynomial regression for estimating Sine_Squared target function from 25 noisy samples Optimal model found: 6-th degree polynomial (no resampling needed)

Modeling pure noise with x in [0,1] via poly regression sample size n=30, noise Comparison of different model selection methods: - prediction risk (MSE) - selected DoF (~ h)

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

Structural Risk Minimization Analysis of generalization bounds suggests that when n/h is large, the term is small  This leads to parametric modeling approach (ERM) When n/h is not large (say, less than 20), both terms in the right-hand side of VC- bound need to be minimized  make the VC-dimension a controlling variable SRM = formal mechanism for controlling model complexity Set of admissible models has a nested structure such that  structure formally defines complexity ordering

Structural Risk Minimization An upper bound on the true risk and the empirical risk, as a function of VC-dimension h (for fixed sample size n)

SRM vs ERM modeling

SRM Approach Two general strategies for implementing SRM: Use VC-dimension as a controlling parameter for minimizing VC bound: Two general strategies for implementing SRM: 1. Keep fixed and minimize (most statistical and neural network methods) 2. Keep fixed and minimize (Support Vector Machines)

Common SRM structures Dictionary structure A set of algebraic polynomials is a structure since More generally where is a set of basis functions (dictionary). The number of terms (basis functions) m specifies an element of a structure. For fixed basis fcts, VC-dim ~ number of parameters

Common SRM structures Feature selection (aka subset selection) Consider sparse polynomials of degree m: for m=1: for m=2: etc. Each monomial is a feature. The goal is to select a set of m features providing min. empirical risk (MSE) This is a structure since More generally, where m basis fcts are selected from a (large) set of M fcts Note: nonlinear optimization, VC-dimension is unknown

Common SRM structures Penalization Consider algebraic polynomial of fixed degree where For each (positive) value c this set of functions specifies an element of a structure Minimization of empirical risk (MSE) on each element of a structure is a constrained minimization problem This optimization problem can be equivalently stated as minimization of the penalized empirical risk functional: where the choice of Note: VC-dimension is unknown

Example: SRM structures for regression Regression data set x-values~ uniformly sampled in [0,1] y-values ~ target fct additive Gaussian noise with st. dev 0.05 Experimental set-up training set ~ 40 samples validation set ~ 40 samples (for model selection) SRM structures defined on algebraic polynomials - dictionary (polynomial degrees 1 to 10) - penalization (fixed degree-10 polynomial) - sparse polynomials (degree 1 to 5)

Estimated models using different SRM structures: - dictionary - penalization lambda=1.013e-005 - sparse polynomial Visual results: target fct~ red line, feature selection~ black solid, dictionary ~ green, penalization ~ yellow line 0.2 0.4 0.6 0.8 1 -1.5 -1 -0.5 0.5 x y

SRM Summary SRM structure ~ complexity ordering on a set of admissible models (approximating functions) Many different structures on the same set of approximating functions (possible models) How to choose the ‘best’ structure? - depends on application data - VC theory cannot provide answer SRM = mechanism for complexity control - selecting optimal complexity for a given data set - new measure of complexity: VC-dimension - model selection using analytic VC-bounds

OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

Summary and Discussion: VC-theory Methodology - learning problem setting (KID principle) - concepts (risk minimization, VC-dimension, structure) Interpretation/ evaluation of existing methods Model selection using VC-bounds New types of inference (TBD later) What theory can not do: - provide formalization (for a given application) - select ‘good’ structure - always a gap between theory and applications