5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee.

Slides:

Advertisements

Similar presentations

1 A Statistical Analysis of the Precision-Recall Graph Ralf Herbrich Microsoft Research UK Joint work with Hugo Zaragoza and Simon Hill.

Advertisements

Properties of Least Squares Regression Coefficients

On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.

Chapter 6 Sampling and Sampling Distributions

Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Some terminology When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models The intercept and.

Sampling Distributions

FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.

Model Assessment and Selection

McGraw-Hill Ryerson Copyright © 2011 McGraw-Hill Ryerson Limited. Adapted by Peter Au, George Brown College.

Instructor : Dr. Saeed Shiry

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

The Mean Square Error (MSE):. Now, Examples: 1) 2)

Chapter 7 Sampling and Sampling Distributions

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.

Lec 6, Ch.5, pp90-105: Statistics (Objectives) Understand basic principles of statistics through reading these pages, especially… Know well about the normal.

Visual Recognition Tutorial

Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.

Chapter 11 Multiple Regression.

SVM Support Vectors Machines

Visual Recognition Tutorial

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

Inferential Statistics

STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)

1 As we have seen in section 4 conditional probability density functions are useful to update the information about an event based on the knowledge about.

Statistical Decision Theory

Review: Two Main Uses of Statistics 1)Descriptive : To describe or summarize a collection of data points The data set in hand = all the data points of.

Chap 6-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 6 Introduction to Sampling.

Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 6-1 Business Statistics: A Decision-Making Approach 6 th Edition Chapter.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Ran El-Yaniv and Dmitry Pechyony Technion – Israel Institute of Technology, Haifa, Israel Transductive Rademacher Complexity and its Applications.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Universit at Dortmund, LS VIII

T. Poggio, R. Rifkin, S. Mukherjee, P. Niyogi: General Conditions for Predictivity in Learning Theory Michael Pfeiffer

10. Basic Regressions with Times Series Data 10.1 The Nature of Time Series Data 10.2 Examples of Time Series Regression Models 10.3 Finite Sample Properties.

Margin-Sparsity Trade-off for the Set Covering Machine ECML 2005 François Laviolette (Université Laval) Mario Marchand (Université Laval) Mohak Shah (Université.

Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.

Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.

Biointelligence Laboratory, Seoul National University

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Summarizing Risk Analysis Results To quantify the risk of an output variable, 3 properties must be estimated: A measure of central tendency (e.g. µ ) A.

Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.

Lecture 5 Introduction to Sampling Distributions.

CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.

Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.

Chapter 6 Sampling and Sampling Distributions

Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.

Ch 2. The Probably Approximately Correct Model and the VC Theorem 2.3 The Computational Nature of Language Learning and Evolution, Partha Niyogi, 2004.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Visual Recognition Tutorial

3. The X and Y samples are independent of one another.

Empirical risk minimization

ECE 5424: Introduction to Machine Learning

Chapter 7 Sampling Distributions.

CSCI B609: “Foundations of Data Science”

Computational Learning Theory

Computational Learning Theory

Chapter 7 Sampling Distributions.

Chapter 7 Sampling Distributions.

Empirical risk minimization

Chapter 7 Sampling Distributions.

16. Mean Square Estimation

Presentation transcript:

5.3 Algorithmic Stability Bounds Summarized by: Sang Kyun Lee

(c) 2005 SNU CSE Biointelligence Lab2 Robustness of a learning algorithm Instead of compression and reconstruction function, now we think about the “robustness of a learning algorithm A ” Robustness  a measure of the influence of an additional training example (x, y) 2 Z on the learned hypothesis A (z) 2 H  quantified in terms of the loss achieved at any test object x 2 X  A robust learning algorithm guarantees |expected risk - empirical risk| < M even if we replace one training example by its worst counterpart  This fact is of great help when using McDiarmid’s inequality (A.119) – a large deviation result perfectly suited for the current purpose

(c) 2005 SNU CSE Biointelligence Lab3 McDiarmid’s Inequality (A.119)

(c) 2005 SNU CSE Biointelligence Lab Algorithmic Stability for Regression Framework  Training sample:   drawn iid from an unknown distribution  Hypothesis:  a real-valued function  Loss function:  l : R £ R ! R  a function of predicted value and observed value t

(c) 2005 SNU CSE Biointelligence Lab5 Notations Given & 

(c) 2005 SNU CSE Biointelligence Lab6  m -stability (1/2)  this implies robustness in the more usual sense of measuring the influence of an extra training example. This is formally expressed in the following theorem.

(c) 2005 SNU CSE Biointelligence Lab7  m -stability (2/2) Proof (theorem 5.27)

(c) 2005 SNU CSE Biointelligence Lab8 Lipschitz Loss Function (1/3)  Thus, given Lipschitz continuous loss function l,  That is, we can use the difference of the two functions to bound the losses incurred by themselves at any test object x.

(c) 2005 SNU CSE Biointelligence Lab9 Lipschitz Loss Function (2/3) Examples of Lipschitz continuous loss functions

(c) 2005 SNU CSE Biointelligence Lab10 Lipschitz Loss Function (3/3) Using the concept of Lipschitz continuous loss functinos we can upper bound the value of  m for a large class of learning algorithms, using the following theorem (Proof at Appendix C9.1): Using this, we’re able to cast most of the learning algorithms presented in Part I of this book into this framework

(c) 2005 SNU CSE Biointelligence Lab11 Algorithmic Stability Bound for Regression Estimation Now, in order to obtain generalization error bounds for  m - stable learning algorithms A we proceed as follows: 1.To use McDiarmid’s inequality, define a random variable g(Z) which measure |R[f z ] – R emp [f z,z]| or |R[f z ] – R loo [ A,z]|. (ex) g(Z) = R[f z ] – R emp [f z,z] 2.Then we need to upper bound E[g] over the random draw of training samples z 2 Z m. This is because we’re only interested in the prob. that g(Z) will be larger than some prespecified . 3.We also need an upper bound on which should preferably not depend on i 2 {1,…,m} Little bit crappy here!

(c) 2005 SNU CSE Biointelligence Lab12 Algorithmic Stability Bound for Regression Estimation (C9.2 – 1/8) ====== Quick Proof: Expectation over the random draw of training samples z 2 Z m

(c) 2005 SNU CSE Biointelligence Lab13 Algorithmic Stability Bound for Regression Estimation (C9.2 – 2/8) Quick Proof:

(c) 2005 SNU CSE Biointelligence Lab14 Algorithmic Stability Bound for Regression Estimation (C9.2 – 3/8)

(c) 2005 SNU CSE Biointelligence Lab15 Algorithmic Stability Bound for Regression Estimation (C9.2 – 4/8) Proof by Lemma C.21

(c) 2005 SNU CSE Biointelligence Lab16 Algorithmic Stability Bound for Regression Estimation (C9.2 – 5/8) Summary:  The two bounds are essentially the same  the additive correction ¼  m  the decay of the prob. is O(exp(-   /m  m  ))  This result is slightly surprising, because  VC theory indicates that the training error Remp is only a good indicator of the generalization error when the hypothesis space has a small VC dimension (Thm. 4.7)  In contrast, the loo error disregards VC dim and is an almost unbiased estimator of the expected generalization error of an algorithm (Thm 2.36)

(c) 2005 SNU CSE Biointelligence Lab17 Algorithmic Stability Bound for Regression Estimation (C9.2 – 6/8) However, recall that  VC theory is used for empirical risk minimization algos which only consider the training error as the coast function to be minimized  In contrast, in the current formulation we have to guarantee a certain stability of the learning algorithm : in case of ! 0 (the learning algorithm minimizes the emp risk only, we can no longer guarantee a finite stability.

(c) 2005 SNU CSE Biointelligence Lab18 Algorithmic Stability Bound for Regression Estimation (C9.2 – 7/8) Let’s consider  m -stable algorithm A s.t.  m ·  m -1  From thm 5.32, ! with probability of at least 1- .  This is an amazingly tight generalization error bound whenever  ¿ because the expression is dominated by the second term  Moreover, this provides us practical guides on the possible values of the trade-off parameter. From (5.19), regardless of the empirical term Remp[ A (z),z]

(c) 2005 SNU CSE Biointelligence Lab19

(c) 2005 SNU CSE Biointelligence Lab Algorithmic Stability for Classification Framework  Training sample:  Hypothesis:  Loss function:  Confine to zero-one loss,  although the following also applies to any loss that takes a finite set of values.

(c) 2005 SNU CSE Biointelligence Lab21  m stability For a given classification algorithm However, here we have  m 2 {0,1} only.  m = 0 occurs if,  for all training samples z 2 Z m and all test examples (x,y) 2 Z, which is only possible if H only contains on hypothesis.  If we exclude this trivial case, then thm 5.32 gives trivial result

(c) 2005 SNU CSE Biointelligence Lab22 Refined Loss Function (1/2) In order to circumvent this problem, we think about the real-valued output f(x) and the classifier of the form h( ¢ )=sign(f( ¢ )).  As our ultimate interest is the generalization error,  Consider a loss function: which is a upper bound of the function  Advantage from this loss function settings:

(c) 2005 SNU CSE Biointelligence Lab23 Refined Loss Function (2/2) Another useful requirement on the refined loss function l  is Lipschitz continuity with a small Lipschitz constant  This can be done by adjusting the linear soft margin loss : where y 2 {-1,+1} 1.Modify this function to output at least  on the correct side 2.Loss function has to pass through 1 for f(x)=0 1. Thus the steepness of the function is 1/  2. Therefore the Lipschitz constant is also 1/  3.The function should be in the interval [0,1] because the zero- one loss will never exceed 1.

(c) 2005 SNU CSE Biointelligence Lab24

(c) 2005 SNU CSE Biointelligence Lab25 Algorithmic Stability for Classification (1/3) For  ! 1, the first term is provably non-increasing whereas the second term is always decreasing

(c) 2005 SNU CSE Biointelligence Lab26 Algorithmic Stability for Classification (2/3) Consider this thm for the special case of linear soft margin SVM for classification (see 2.4.2) WLOG, assume  = 1

(c) 2005 SNU CSE Biointelligence Lab27 Algorithmic Stability for Classification (3/3) This bounds provides an interesting model selection criterion, by which we select the value of (the assumed noise level). In contrast to the result of Subsection 4.4.3, this bound only holds for the linear soft margin SVM The results in this section are so recent that no empirical studies have yet been carried out

(c) 2005 SNU CSE Biointelligence Lab28 Algorithmic Stability for Classification (4/4)