Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University.

Slides:



Advertisements
Similar presentations
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Bounds on Code Length Theorem: Let l ∗ 1, l ∗ 2,..., l ∗ m be optimal codeword lengths for a source distribution p and a D-ary alphabet, and let L ∗ be.
Lecture 3 Nonparametric density estimation and classification
Model Assessment and Selection
Visual Recognition Tutorial
1. Introduction Consistency of learning processes To explain when a learning machine that minimizes empirical risk can achieve a small value of actual.
Instructor : Dr. Saeed Shiry
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Methods of Pattern Recognition chapter 5 of: Statistical learning methods by Vapnik Zahra Zojaji.
Instructor : Saeed Shiry
دانشگاه صنعتي اميرکبير Author: Vladimir N. Vapnik
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Statistical Methods Chichang Jou Tamkang University.
Chapter 7 Sampling and Sampling Distributions
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Reduced Support Vector Machine
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Learning From Data Chichang Jou Tamkang University.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Introduction to Predictive Learning
SVM Support Vectors Machines
Bayesian Learning Rong Jin.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Business Statistics: Communicating with Numbers
1 Ch6. Sampling distribution Dr. Deshi Ye
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
1 A(n) (extremely) brief/crude introduction to minimum description length principle jdu
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
● Final exam Wednesday, 6/10, 11:30-2:30. ● Bring your own blue books ● Closed book. Calculators and 2-page cheat sheet allowed. No cell phone/computer.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
CS433: Modeling and Simulation Dr. Anis Koubâa Al-Imam Mohammad bin Saud University 15 October 2010 Lecture 05: Statistical Analysis Tools.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Chapter 7: Sample Variability Empirical Distribution of Sample Means.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Quantifying Knowledge Fouad Chedid Department of Computer Science Notre Dame University Lebanon.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
INTRODUCTION TO Machine Learning 3rd Edition
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved. Essentials of Business Statistics: Communicating with Numbers By Sanjiv Jaggia and.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Rate Distortion Theory. Introduction The description of an arbitrary real number requires an infinite number of bits, so a finite representation of a.
Chapter 10 The Support Vector Method For Estimating Indicator Functions Intelligent Information Processing Laboratory, Fudan University.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Evolutionary Computation: Advanced Algorithms and Operators
Estimating standard error using bootstrap
Kolmogorov Complexity
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
Predictive Learning from Data
Econ 3790: Business and Economics Statistics
Nonparametric density estimation and classification
Supervised machine learning: creating a model
Presentation transcript:

Chapter 6 The Structural Risk Minimization Principle Junping Zhang Intelligent Information Processing Laboratory, Fudan University March 23, 2004

Objectives

Structural risk minimization

Two other induction principles

The Scheme of the SRM induction principle

Real-Valued functions

Principle of SRM

SRM

Minimum Description Length and SRM inductive principles The idea about the Nature of Random Phenomena Minimum Description Length Principle for the Pattern Recognition Problem Bounds for the MDL SRM for the simplest Model and MDL The Shortcoming of the MDL

The idea about the Nature of Random Phenomena Probability theory (1930s, Kolmogrov) Formal inference Axiomatization hasn ’ t considered nature of randomness Axioms: given probability measures

The idea about the Nature of Random Phenomena The model of randomness Solomonoff (1965), Kolmogrov (1965), Chaitin (1966). Algorithm (descriptive) complexity The length of the shortest binary computer program Up to an additive constant does not depend on the type of computer. Universal characteristic of the object.

A relatively large string describing an object is random If algorithm complexity of an object is high If the given description of an object cannot be compressed significantly. MML (Wallace and Boulton, 1968)& MDL (Rissanen, 1978) Algorithm Complexity as a main tool of induction inference of learning machines

Minimum Description Length Principle for the Pattern Recognition Problem Given l pairs containing the vector x and the binary value ω Consider two strings: the binary string

Question Q: Given (147), is the string (146) a random object? A: to analyze the complexity of the string (146) in the spirit of Solomonoff- Kolmogorov-Chaitin ideas

Compress its description Since ω i i=1,…l are binary values, the string (146) is described by l bits. Since training pairs were drawn randomly and independently. The value ω i depend on the vector x i but not on the vector x j.

Model

General Case: not contain the perfect table.

Randomness

Bounds for the MDL Q: Does the compression coefficient K(T) determine the probability of the test error in classification (decoding) vectors x by the table T? A: Yes

Comparison between the MDL and ERM in the simplest model

SRM for the simplest Model and MDL

The power of compression coefficient To obtain bound for the probability of error Only information about the coefficient need to be known.

The power of compression coefficient How many examples we used How the structure of code books was organized Which code book was used and how many tables were in this code book. How many errors were made by the table from the code book we used.

MDL principle To minimize the probability of error One has to minimize the coefficient of compression

The shortcoming of the MDL MDL uses code books with a finite number of tables. Continuously depends on parameters, one has to first quantize that set to make the tables.

Quantization How do we make the ‘ smart ’ quantization for a given number of observations. For a given set of functions, how can we construct a code book with a small number of tables but with good approximation ability?

The shortcoming of the MDL Finding a good quantization is extremely difficult and determines the main shortcoming of MDL principle. The MDL principle works well when the problem of constructing reasonable code books has a good solution.

Consistency of the SRM principle and asymptotic bounds on the rate of convergence Q: Is the SRM consistent? What is the bound on the (asymptotic) rate of convergence?

Consistency of the SRM principle.

Simplification version

Remark To avoid choosing the minimum of functional (156) over the infinite number of elements of the structure. Additional constraint Choose the minimum from the first l elements of the structure where l is equal to the number of observations.

Discussions and Example

The rate of convergence is determined by two contradictory requirements on the rule n=n(l). The first summand: The larger n=n(l), the smaller is the deviation The second summand: The larger n=n(l), the larger deviation For structures with a known bound on the rate of approximation, select the rule that assures the largest rate of convergence.

Bounds for the regression estimation problem

The model of regression estimation by series expansion

Example

The problem of approximating functions

To get high asymptotic rate of approximation the only constraint is that the kernel should be a bounded function which can be described as a family of functions possessing finite VC dimension.

Problem of local risk minimization

Local Risk Minimization Model

Note Using local risk minimization methods, one probably does not need rich sets of approximating functions. Whereas the classical semi-local methods are based on using a set of constant functions.

Note For local estimation functions in the one-dimensional case, it is probably enough to consider elements S k, k=0,1,2,3 containing the polynomials of degree 0,1,2,3

Summary MDL SRM Local Risk Functional