Instructor : Dr. Saeed Shiry

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Chapter 10: Estimating with Confidence
Fundamentals of Data Analysis Lecture 12 Methods of parametric estimation.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
CMPUT 466/551 Principal Source: CMU
ECE 472/572 - Digital Image Processing Lecture 8 - Image Restoration – Linear, Position-Invariant Degradations 10/10/11.
Visual Recognition Tutorial
1. Introduction Consistency of learning processes To explain when a learning machine that minimizes empirical risk can achieve a small value of actual.
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Instructor : Saeed Shiry
Pattern Recognition and Machine Learning
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Maximum likelihood (ML) and likelihood ratio (LR) test
Point estimation, interval estimation
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Maximum likelihood (ML) and likelihood ratio (LR) test
Learning From Data Chichang Jou Tamkang University.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
SVM Support Vectors Machines
Visual Recognition Tutorial
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Maximum likelihood (ML)
Advanced Image Processing Image Relaxation – Restoration and Feature Extraction 02/02/10.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
+ The Practice of Statistics, 4 th edition – For AP* STARNES, YATES, MOORE Chapter 8: Estimating with Confidence Section 8.1 Confidence Intervals: The.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer
1. 2  A Hilbert space H is a real or complex inner product space that is also a complete metric space with respect to the distance function induced.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
8.4.2 Quantum process tomography 8.5 Limitations of the quantum operations formalism 量子輪講 2003 年 10 月 16 日 担当:徳本 晋
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
A Brief Maximum Entropy Tutorial Presenter: Davidson Date: 2009/02/04 Original Author: Adam Berger, 1996/07/05
Univariate Gaussian Case (Cont.)
5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.
(COEN507) LECTURE III SLIDES By M. Abdullahi
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Computacion Inteligente Least-Square Methods for System Identification.
1 Ka-fu Wong University of Hong Kong A Brief Review of Probability, Statistics, and Regression for Forecasting.
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Empirical risk minimization
CH. 2: Supervised Learning
Bias and Variance of the Estimator
CONCEPTS OF ESTIMATION
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 09: BAYESIAN LEARNING
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Presentation transcript:

Instructor : Dr. Saeed Shiry Regularization Instructor : Dr. Saeed Shiry

Hypothesis Space The hypothesis space H is the space of functions allow our algorithm to provide. in the space the algorithm is allowed to search. it is often important to choose the hypothesis space as a function of the amount of data available.

Learning As Function Approximation From Samples: Regression and Classification The basic goal of supervised learning: to use the training set S to “learn” a function For a new x value predict the associated value of y: Regression : If y is a real-valued random variable Pattern classification : If y takes values from an unordered finite set, In two-class pattern classification problems, we assign one class a y value of 1, and the other class a y value of −1.

Loss Functions In order to measure goodness of our function, we need a loss function V. In general, we let V(f , z) = V(f (x), y) price we pay when we see x and guess that the associated y value is f (x) when it is actually y.

Common Loss Functions For Regression The most common loss function is square loss or L2 loss: V(f (x), y) = (f (x) − y)^2 L1 loss: V(f (x), y) = |f (x) − y| Vapnik’s more general -insensitive loss:

Problem of risk minimization In order to choose the best available approximation to the supervisor's response, one measures the loss or discrepancy L(y, f(x, a)) between the response y of the supervisor to a given input x and the response f(x, a) provided by the learning machine. Consider the expected value of the loss, given by the risk functional The goal is to find the function f(x, , a) which minimizes the risk functional R(a) over the class of functions f(x,),   A in the situation where the joint probability distribution P(x,y) is unknown and the only available information is contained in the training set.

Three Main Learning Problems Pattern Recognition Let the supervisor's output y take only two values y = {0,1} and let f(x,),   A, be a set of indicator functions (functions which take only two values: zero and one). Consider the following loss function: For this loss function, the functional (1.2) determines the probability of different answers given by the supervisor and by the indicator function f(x, ). We call the case of different answers a classification error. The problem, therefore, is to find a function that minimizes the probability of classification error when the probability measure F(x, y) is unknown, but the data are given.

Three Main Learning Problems Regression Estimation Let the supervisor's answer y be a real value, and let f(x, ),   A, be a set of real functions that contains the regression function It is known that the regression function is the one that minimizes the functional (1.2) with the following loss function: Thus the problem of regression estimation is the problem of minimizing the risk functional (1.2) with the above loss function in the situation where the probability measure P(x,y) is unknown but the data are given.

Three Main Learning Problems Density Estimation (Fisher-Wald Setting) Finally, consider the problem of density estimation from the set of densities p(x, )   A. For this problem we consider the following loss function: It is known that the desired density minimizes the risk functional (1.2) with the above loss function . Thus, again, to estimate the density from the data one has to minimize the risk functional under the condition that the corresponding probability measure P(x) is unknown, but i.i.d. data are given.

Expected error, empirical error The expected or true error of f is: Given a function f , a loss function V, and a probability distribution μ over Z, the expected loss on a new example drawn at random from μ. We would like to make I[f ] small, but in general we do not know μ. The empirical error of f is: Given a function f , a loss function V, and a training set S consisting of n data points

A reminder: convergence in probability Let {Xn} be a sequence of bounded random variables. We say that

Generalization

A learning algorithm should be well-posed, eg stable In addition to the key property of generalization, a “good” learning algorithm should also be stable: fs should depend continuously on the training set S. In particular, changing one of the training points should affect less and less the solution as n goes to infinity.

General definition of Well-Posed and Ill-Posed problems A problem is well-posed if its solution: exists is unique depends continuously on the data (e.g. it is stable) A problem is ill-posed if it is not well-posed. well-posedness is mainly used to mean stability of the solution.

Theory of Solving Ill-Posed Problems In the early 1900s Hadamard observed that under some (very general) circumstances the problem of solving (linear) operator equations (finding f F that satisfies the equality), is ill-posed; even if there exists a unique solution to this equation, a small deviation on the right-hand side of this equation (Fδ instead of F, where ||F- Fδ ||< δ is arbitrarily small) can cause large deviations in the solutions (it can happen that ||fδ -f||< is large). In this case if the right-hand side F of the equation is not exact (e.g., it equals Fδ , where Fδ differs from F by some level δ of noise), the functions fδ that minimize the function do not guarantee a good approximation to the desired solution even if δ tends to zero.

Real-life problems were found to be ill-posed Hadamard thought that ill-posed problems are a pure mathematical phenomenon and that all real-life problems are "well-posed.“ However, in the second half of the century a number of very important real-life problems were found to be ill-posed. it is important that one of main problems of statistics, estimating the density function from the data, is ill-posed.

Regularization theory Regularization theory was one of the first signs of the existence of intelligent inference: In the middle of the 1960s it was discovered that if instead of the functional R(f) one minimizes another so-called regularized functional where Ω(f) is some function (that belongs to a special type of functions) and (δ) is an appropriately chosen constant (depending on the level of noise), then one obtains a sequence of solutions that converges to the desired one as δ tends to zero

ERM Given a training set S and a function space H, empirical risk minimization (Vapnik introduced the term) is the class of algorithms that look at S and select fs as For example linear regression is ERM when V(z) = (f (x) − y)^2 and H is space of linear functions f = ax.

THE EMPIRICAL RISK MINIMIZATION (ERM) INDUCTIVE PRINCIPLE In order to minimize the risk functional for an unknown probability measure P(z) the following induction principle is usually employed. The expected risk functional R() is replaced by the empirical risk functional Constructed on the basis of the training set. The principle is to approximate the function Q(z, ) which minimizes the risk by the function Q(z, l) which miniminimizes the empirical risk (1.8). This principle is called the Empirical Risk Minimization induction principle (ERM principle).

Generalization and Well-posedness of Empirical Risk Minimization For ERM to represent a “good” class of learning algorithms, the solution should generalize exist, be unique and – especially – be stable (well-posedness).

ERM and generalization: given a certain number of samples...

...suppose this is the “true” solution...

... but suppose ERM gives this solution.

Under which conditions the ERM solution converges with increasing number of examples to the true solution? In other words...what are the conditions for generalization of ERM?

ERM and stability: given 10 samples...

...we can find the smoothest interpolating polynomial (which degree?).

But if we perturb the points slightly...

...the solution changes a lot!

If we restrict ourselves to degree two polynomials...

...the solution varies only a small amount under a small perturbation.

ERM: conditions for well-posedness (stability) and predictivity (generalization) Since Tikhonov, it is well-known that a generally ill-posed problem such as ERM, can be guaranteed to be well-posed and therefore stable by an appropriate choice of H. For example, compactness of H guarantees stability. It seems intriguing that the classical conditions for consistency of ERM – thus quite a different property – consist of appropriately restricting H.

ERM: conditions for well-posedness (stability) and predictivity (generalization) We would like to have a hypothesis space that yields generalization. Loosely speaking this would be a H for which the solution of ERM, say fs is such that |Is[fs] −I[fs]| converges to zero in probability for n increasing. Note that the above requirement is NOT the law of large numbers; the requirement for a fixed f that |Is[f ] − I[f ]| converges to zero in probability for n increasing Is the law of large numbers.

ERM: conditions for well-posedness (stability) and predictivity (generalization)

ERM: conditions for well-posedness (stability) and predictivity (generalization) The theorem says that a proper choice of the hypothesis space H ensures generalization of ERM (and consistency since for ERM generalization is necessary and sufficient for consistency and viceversa). A separate theorem guarantees also stability (defined in a specific way) of ERM. Thus with the appropriate definition of stability, stability and generalization are equivalent for ERM. Other results characterize uGC classes in terms of measures of complexity or capacity of H (such as VC dimension). Thus the two desirable conditions for a learning algorithm –generalization and stability – are equivalent (and they correspond to the same constraints on H).

Regularization A method of improving stability of solutions of ill-conditioned inverse problems, called regularization. The basic idea in the treatment of ill-conditioned problems use some a priori knowledge about solutions to disqualify meaningless ones. such knowledge can be: some regularity condition on the solution expressed existence of derivatives up to a certain order with bounds on the magnitudes of these derivatives some localization condition such as a bound on the support of the solution or its behavior at infinity. Tikhonov’s regularization: penalizes undesired solutions by adding a term called a stabilizer.

Regularization Generally speaking, any regularization method tries to analyze a related well-posed problem whose solution approximates the original ill-posed problem. The well-posedness is achieved by implementing one or more of the following basic ideas restriction of the data; change of the space and/or topologies; modification of the operator itself; the concept of regularization operators; and well-posed stochastic extensions of ill-posed problems.

Regularization Regularized cost function = empirical cost function +regularization parameter *regularizer function

Image restoration – An ill-posed problem Degradation model H is ill-conditioned which makes image restoration problem an ill-posed problem Solution is not stable

Tikhonov’s Regularization Theory Proposed by Tikhonov in 1963 Proposes the use of prior knowledge to regularize mappings Most common application: utilize the smoothness property: “Similar inputs produce similar outputs for an input-output mapping to be smooth”

Ivanov and Tikhonov Regularization

Tikhonov Regularization As we will see in future classes Tikhonov regularization ensures well-posedness eg existence, uniqueness and especially stability (in a very strong form) of the solution Tikhonov regularization ensures generalization Tikhonov regularization is closely related to – but different from – Ivanov regularization, eg ERM on a hypothesis space H which is a ball in a RKHS.