By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Slides:



Advertisements
Similar presentations
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Advertisements

Estimation of Means and Proportions
5.4 Basis And Dimension.
Chapter 2 Multivariate Distributions Math 6203 Fall 2009 Instructor: Ayona Chatterjee.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Chap 8: Estimation of parameters & Fitting of Probability Distributions Section 6.1: INTRODUCTION Unknown parameter(s) values must be estimated before.
Visual Recognition Tutorial
1. Introduction Consistency of learning processes To explain when a learning machine that minimizes empirical risk can achieve a small value of actual.
Methods of Pattern Recognition chapter 5 of: Statistical learning methods by Vapnik Zahra Zojaji.
Instructor : Saeed Shiry
The Nature of Statistical Learning Theory by V. Vapnik
Chapter 6 Introduction to Sampling Distributions
Statistical Inference Chapter 12/13. COMP 5340/6340 Statistical Inference2 Statistical Inference Given a sample of observations from a population, the.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probability Distributions
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 6 Introduction to Sampling Distributions.
Evaluating Hypotheses
Basic Concepts and Definitions Vector and Function Space. A finite or an infinite dimensional linear vector/function space described with set of non-unique.
Probability theory 2011 Convergence concepts in probability theory  Definitions and relations between convergence concepts  Sufficient conditions for.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
BCOR 1020 Business Statistics
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Statistical Intervals Based on a Single Sample.
Lecture II-2: Probability Review
Chapter 5 Linear Inequalities and Linear Programming Section R Review.
Probability theory: (lecture 2 on AMLbook.com)
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Review and Preview This chapter combines the methods of descriptive statistics presented in.
QBM117 Business Statistics Estimating the population mean , when the population variance  2, is known.
1 Introduction to Estimation Chapter Concepts of Estimation The objective of estimation is to determine the value of a population parameter on the.
Random Sampling, Point Estimation and Maximum Likelihood.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Yaomin Jin Design of Experiments Morris Method.
HAWKES LEARNING SYSTEMS math courseware specialists Copyright © 2010 by Hawkes Learning Systems/Quant Systems, Inc. All rights reserved. Chapter 9 Samples.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Chapter 7: Introduction to Sampling Distributions Section 2: The Central Limit Theorem.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Copyright © 1998, Triola, Elementary Statistics Addison Wesley Longman 1 Estimates and Sample Sizes Chapter 6 M A R I O F. T R I O L A Copyright © 1998,
Support Vector Machines Tao Department of computer science University of Illinois.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
1 Probability and Statistical Inference (9th Edition) Chapter 5 (Part 2/2) Distributions of Functions of Random Variables November 25, 2015.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
8.1 Estimating µ with large samples Large sample: n > 30 Error of estimate – the magnitude of the difference between the point estimate and the true parameter.
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
1 ES Chapters 14 & 16: Introduction to Statistical Inferences E n  z  
Chapter 6 Large Random Samples Weiqi Luo ( 骆伟祺 ) School of Data & Computer Science Sun Yat-Sen University :
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.
1 Estimation Chapter Introduction Statistical inference is the process by which we acquire information about populations from samples. There are.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate accuracy.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Introduction For inference on the difference between the means of two populations, we need samples from both populations. The basic assumptions.
ESTIMATION.
Vapnik–Chervonenkis Dimension
CONCEPTS OF ESTIMATION
Chapter 5 Linear Inequalities and Linear Programming
I.4 Polyhedral Theory (NW)
I.4 Polyhedral Theory.
Presentation transcript:

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik In the Name of God Statistical Learning Theory Bounds on the Rate of Convergence of Learning Processes(chapter 3) Supervisor : Dr. Shiry By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Introduction In this chapter: we consider upper bounds on the rate of uniform convergence. (lower bounds are not as important for controlling the learning processes as the upper bounds)

Introduction Bounds on the rate of convergence: Distribution-dependent bounds (based on the annealed entropy function) Distribution-independent bounds (based on the growth function) Bounds are nonconstructive the VC dimension of set of functions (a scalar value that can be evaluated for any set of functions ) Constructive distribution-independent bounds

THE BASIC INEQUALITIES : a set of indicator functions : the corresponding VC entropy : the annealed entropy : the growth function theorem: basic inequalities in the theory of bounds

THE BASIC INEQUALITIES The bounds are nontrivial if (In chapter 2 we called this condition the second milestone of learning theory.)

THE BASIC INEQUALITIES Theorem 3.1 estimates the rate of uniform convergence with respect to the norm of the deviation between probability and frequency. maximal difference occurs for the events with maximal variance. (Bernoulli case ) variance : therefore the maximum of the variance is achieved for the events with probahility: the largest deviations are associated with functions that possess large risk.

THE BASIC INEQUALITIES Theorem 3.2 considered relative uniform convergence. (we will obtain a bound on the risk where the confidence interval is determined by the rate of uniform convergence) the uniform relative convergence: upper bound on the risk obtained using Theorem 3.2 is much better than the upper bound on the risk obtained on the basis of Theorem 3.1.

THE BASIC INEQUALITIES The bounds obtained in Theorems 3.1 and 3-2 are distribution-dependent To construct distribution independent bounds it is sufficient to note that for any distribution function F(z) the growth function is not less than the annealed entropy. for any distribution function F(z):

THE BASIC INEQUALITIES These inequalities are nontrivial if (necessary and sufficient conditions for distribution-free uniform convergence) if condition 3.5 is violated, then there exist probability measures F(z) on Z for which uniform convergence does not take place.

Generalization for the set of real functions Let be a set of real functions, with Let us construct a set of indicators functions by: Where are indicator functions, the set of indicators coincides with this set of functions.

Generalization In generalization we distinguish three cases: Totally bounded functions Totally bounded nonnegative functions Nonnegative (not necessarily bounded) functions The following bounds are nontrivial if

Generalization Totally bounded functions Totally bounded nonnegative functions

Generalization Nonnegative functions Let be a set of functions such that for some p > 2 the pth normalized moments of the random variables exist:the Therefore: Where:

Distribution – independent bounds The above bounds were distribution-dependent To obtain distribution- independent bounds one replaces the annealed entropy with the growth function. The following inequalities are nontrivial if

Distribution – independent bounds For the set of totally bounded For the set of nonnegative totally bounded functions For the set of nonnegative real functions whose pth normalized moment exists for some p > 2,

Bounds on the generalization ability of learning machines What actual risk R(αl) is provided by the function Q(z,αl) that achieves minimal empirical risk Remp(αl)? How close is this risk to the minimal possible infα (R(α)), α ϵΛ, for the given set of functions? using the following notation, the bounds are nontrivial when

Describe distribution-independent bounds (another form) For the set of totally bounded functions with probability at least for all functions of : with probability at least for the function that minimizes the empirical risk :

Describe distribution-independent bounds (another form) For the set of totally bounded nonnegative functions with probahility at least for all functions : with probability of at least for the function that minimizes the empirical risk :

Describe distribution-independent bounds (another form) For the set of unbounded nonnegative functions We are given a pair (p,Ƭ) such that : With probability at least for all functions where (u)+ = max(u,0). With probability at least for the function that minimizes the empirical risk

The structure of the growth function To make the above bounds constructive one has to find a way to evaluate the annealed entropy and/or the growth function for the given set of functions. We will find constructive bounds by using the concept of VC dimension of the set of functions. There is remarkable connection between the concept of VC dimension and the growth function.

The structure of the growth function theorem Any growth function either satisfies the equality Or is bounded by the inequality Definition We will say that the VC dimension of the set of indicator functions is infinite if the growth function for this set of functions is linear. the corresponding growth function is bounded by a logarithmic function with coefficient h.

VC dimension the finiteness of the VC dimension of the set of indicator functions is a sufficient condition for consistency of the ERM method independent of the probability measure and implies a fast rate of convergence. It is a necessary and sufficient condition for distribution-independent consistency of ERM learning machines. The VC dimension of a set of indicator functions It is the maximum number h of vectors z1,...,zh that can be separated into two classes in all possible ways using functions of the set. If for any n there exists a set of n vectors that can be shattered by the set of functions, then the VC dimension is equal to infinity.

VC dimension The VC dimension of a set of real functions Let be a set of real functions bounded by constants A and B. Considering the set of indicators where θ(z) is the step function The VC dimension of a set of real functions is defined to be the VC dimension of the set of corresponding indicators with parameters and .

VC dimension-Example The VC dimension of the set of linear indicator functions in n-dimensioiial coordinate space is equal to h=n+1 since by using functions of this set one can shatter at most n + 1 vectors. The VC dimension of the set of linear function in n-dimensional coordinate space is equal to h = n + 1, because the VC dimension of the corresponding linear indicator functions is equal to « + 1.

VC dimension-Example The VC dimension of the set of functions is infinite. The points on the line can be shattered by functions from this set to separate these data into two classes determined by the sequence . it is sufficient to choose the value of the parameter α to be by choosing an appropriate co- coefficient α one can for any number of appropriately chosen points approximate values of any function bounded by (-1,+1) using .

VC dimension The VC dimension of a set of functions does not coincide with the number of parameters. It can be either larger or smaller than the number of parameters. In the following: we will present the bounds on the risk functional that in Chapter 4 we use for constructing the methods for controlling the generalization ability of learning machines.

Constructive distribution – independent bounds Considering sets of functions that possess a finite VC dimension h Therefore, in all inequalities of the above Section the following constructive expression can be used (in the case of the finite VC dimension) We also will consider the case where the set of loss functions contains a finite number of elements

Constructive distribution – independent bounds For the set of totally bounded functions with probability at least for all functions with probability at least for the function that minimizes the empirical risk:

Constructive distribution – independent bounds The set of totally bounded nonnegative functions with probability at least for all functions with probability at least for the function that minimizes the empirical risk:

Constructive distribution – independent bounds The set of unbounded nonnegative functions with probability at least for all functions with probability at least for the function that minimizes the empirical risk:

Refrences Vapnik, Vladimir,”The Nature of Statistical Learning Theory”, 2000