The Nature of Statistical Learning Theory by V. Vapnik

Slides:

Advertisements

Similar presentations

Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.

Advertisements

Introduction to Support Vector Machines (SVM)

Support Vector Machine

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Support Vector Machines

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Support Vector Machine

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Methods of Pattern Recognition chapter 5 of: Statistical learning methods by Vapnik Zahra Zojaji.

Instructor : Saeed Shiry

دانشگاه صنعتي اميرکبير Author: Vladimir N. Vapnik

Support Vector Machines (and Kernel Methods in general)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.)

Financial time series forecasting using support vector machines Author: Kyoung-jae Kim 2003 Elsevier B.V.

Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Computational Learning Theory

Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.

Learning From Data Chichang Jou Tamkang University.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Sparse Kernels Methods Steve Gunn.

CS 4700: Foundations of Artificial Intelligence

1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.

2806 Neural Computation Support Vector Machines Lecture Ari Visa.

SVM Support Vectors Machines

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Computational Learning Theory

Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:

An Introduction to Support Vector Machines Martin Law.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

Support Vector Machine (SVM) Based on Nello Cristianini presentation

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

Universit at Dortmund, LS VIII

Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.

An Introduction to Support Vector Machines (M. Law)

1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Support Vector Machines Project מגישים : גיל טל ואורן אגם מנחה : מיקי אלעד נובמבר 1999 הטכניון מכון טכנולוגי לישראל הפקולטה להנדסת חשמל המעבדה לעיבוד וניתוח.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

Support Vector Machines

Support Vector Machines Tao Department of computer science University of Illinois.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.

Support-Vector Networks C Cortes and V Vapnik (Tue) Computational Models of Intelligence Joon Shik Kim.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.

Generalization Error of pac Model  Let be a set of training examples chosen i.i.d. according to  Treat the generalization error as a r.v. depending on.

Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)

Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.

STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.

CS 9633 Machine Learning Support Vector Machines

LECTURE 16: SUPPORT VECTOR MACHINES

CH. 2: Supervised Learning

The Nature of Statistical Learning Theory by V. Vapnik

An Introduction to Support Vector Machines

LECTURE 17: SUPPORT VECTOR MACHINES

Support Vector Machines and Kernels

Linear Discrimination

Support Vector Machines 2

Presentation transcript:

The Nature of Statistical Learning Theory by V. Vapnik Statistical Learning Theory & Classifications Based on Support Vector Machines The Nature of Statistical Learning Theory by V. Vapnik 2014: Anders Melen 2015: Rachel Temple Turn off all the lights Introduce myself and major Introduce the topic

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session We are going to start with Empirical data modeling which we’ve talked about many times in this course and you should feel very familiar with. If you’re unsure you should pick up the idea pretty quickly in a few minutes here. Quick simplified explanation of what statistical learning theory is and what its all about. Example has what supervised learning is Risk Minimization etc! There is a lot to talk about here Everytime I reach a new section I’m going to jump back to the table of contents so you know we’re moving on to a new topic I generally talk really fast during presentations so if I do, please feel free to yell at me so you can keep up I also tried to strike a good balance between covering a balance of high level concepts and some important low level details

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session Lets start off with something we should all be familiar with. Empirical Data Modeling

Empirical Data Modeling Observations of a system are collected Induction on observations are used to build up a model of the system. Model is then used to deduce responses of an unobserved system. Sampling is typically non-uniform High dimensional problems will form a sparse distribution in the input space Basically refers to any kind computer modeling that uses empirical observations rather than mathematical relationships Empirical data being data that has been gathered by observation.

Modeling Error Approximation error is the consequence of the hypothesis space not fitting the target space Globally Optimal Model Best Reachable Model Selected Model The underlying function may lie outside the hypothesis space A poor choice of the model space will result in a large approximation error (model mismatch)

Modeling Error Approximation error is the consequence of the hypothesis space not fitting the target space Globally Optimal Model Best Reachable Model Selected Model Goal Choose a model from the hypothesis space which is closest (w/ respect to some error measure) to the function target space

This forms the Generalization Error Estimation Error is the error between the best model in our hypothesis space and the model within our hypothesis space that we selected. Approximation Error Globally Optimal Model Generalization Error Best Reachable Model Estimation Error Selected Model Globally optimal model & selected model → generalization error This forms the Generalization Error

The Globally optimal model & the selected model form the generalization error which measures how well our data model adapts to new and unobserved data

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Statistical Learning Theory Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnik 17) Learning itself falls into several categories most importantly the first two Unsupervised learning supervised learning online learning reinforcement learning

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Model of Supervised Learning Training The supervisor takes each generated x value and returns an output value y. Each (x,y) pair is part of the training set: F(x,y) = F(x) F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl) Many of the algorithms we already discussed in this course are based on supervised learning theory. The second example here shows the expanded form of the conditional probability function that most of us are all familiar with. It reads “the probability of y given x” What this diagram here is essentially saying is that we pass in a training set to both S and LM where each training row has a solution Y. Thus we supervise the building of a model that can accurately predict future values outside of our training set.

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Risk Minimization To find the best function, we need to measure loss L is the discrepancy function which is based on the y’s generated by the supervision and the ŷ’s generated by the estimate functions F is a predictor such that expected loss is minimized L(y, F(x,𝛂)) This is a very common setup in machine learning. We essentially want to be able to look at our training data drawn from an unknown distribution From there we want to determine a predictor f() such that the expected loss is minimized thus giving us better predictive accuracy.

Risk Minimization Pattern Recognition With pattern recognition, the supervisor’s output y can only take on 2 values, y = {0,1} and the loss takes the following values. So the risk function determines the probability of different answers being given by the supervisor and the estimation function. The main difference between a pattern recognition approach to risk minimization is that the output y will be a boolean value.

Some Simplifications From Here On Training Set {(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl} Loss Function L(y, F(x,𝛂)) → Q(z,𝛂) From here on out I want to declare a few simplifications namely the training set and the loss function Keep these in mind as they are used throughout the rest of the presentation

Empirical Risk Minimization (ERM) We want to measure the risk over the training set rather than the set of all

Empirical Risk Minimization (ERM) The empirical risk must converge to the actual risk over the set of loss functions One really important fact is that the empirical risk must converge to the actually risk over the set of loss functions as denoted by this limit here.

Empirical Risk Minimization (ERM) In both directions! This indeed has to be the case in both directions. Again this minimum limit is denoting the required convergence criteria

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Vapnik-Chervonenkis Dimensions Lets just call them VC Dimensions Developed by Alexey Jakovlevich Chervonenkis & Vladimir Vapnik The VC dimension is scalar value that measures the capacity of a set of functions

Vapnik-Chervonenkis Dimensions The VC dimension is a set of functions responsible for the generalization ability of learning machines The VC dimension of a set of indicator functions Q(z,𝛂)𝛂 ∈ 𝞚 is the maximum number h of vectors z1, …, zh that can be separated into two classes in all 2h possible ways using functions of the set.

Upper Bound For Risk It can be shown that where is the confidence interval and h is the VC dimension

Upper Bound For Risk ERM only minimizes and , the confidence interval, is fixed based on the VC dimension of the set of functions determined by apriori ERM must tune the confidence interval based on the problem to avoid overfitting and underfitting

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Structural Risk Management (SRM) SRM attempts to minimize the right hand size of the inequality over both terms simultaneously

Structural Risk Management (SRM) The term is dependent on a specific function’s error while the term depends on the dimension of the space that the functions lives in. The VC dimension is the controlling variable

Structural Risk Management (SRM) We define the hypothesis space S to be the set of functions: Q(z,𝛂)𝛂 ∈ 𝞚 We say that Sk= {Q(z,𝛂)},𝛂 ∈ 𝞚k is the hypothesis space of a VC dimension, k, such that: For a set of observations {z1, …, zn}, SRM will choose the loss function Q(z,𝛂) minimizing the empirical risk in subset Sk for which the guaranteed risk is minimal. There is a few take away messages for SRM as follows SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases the minima of the empirical risks decrease but the confidence interval increase SRM is more general than ERM alone because it uses the subset Sk for which minimizing Remp(𝛂) **empirical risk of alpha** yields the best bound on R(𝛂) **risk of alpha**

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Support Vector Machines (SVM) Map input vectors x into a high-dimensional feature space using a kernel function: (zi, z) = K(x, xi) In this feature space the optimal separating hyperplane is constructed

Support Vector Machines (SVM) Feature space… Optimal hyperplane… What are you talking about... Feature space… Optimal hyperplane… what the heck am I talking about? I really can’t stand this diagram so I found a little animation that better illustrates the concept of a SVM

Support Vector Machines (SVM) As you watch notice there exists no linear solution to this classification problem so what we we going to do? We are going to use a polynomial kernel so that we can find a plane that correctly classifies the data points. To simplify this, we are projecting the plane into a hyperplane such that we can find a plane that accurately classifies the data

Support Vector Machines (SVM) Lets try a basic one dimensional example! Lets start off with a really basic one dimensional example! Can we find a plane that accurately classifies the data points?

Support Vector Machines (SVM) Aw snap, that was easy! Well that was really easy!

Support Vector Machines (SVM) Ok, what about a harder one dimensional example? Alright, lets try a little bit harder one dimensional example Can anyone figure out what we need to do?

Support Vector Machines (SVM) Project the lower dimensional data into a higher dimensional space just like in the animation! Just like in the animation all we need to do is use a little calculus to project our lower dimensional data onto a higher dimensional space

Support Vector Machines (SVM) There is several ways to implement a SVM Polynomial Learning Machine (Like the animation) Radial Basis Function Machines Two-Layer Neural Networks There is several different implementations of SVM’s that all use different kernel functions Polynomial Learning Machines Radial Basis Function Machines Two Layer Neural Networks We already saw a quick polynomial learning example in the animation Lets take a quick look at Two layer neural network implementation

Neural Networks are computer science models inspired by nature! Simple Neural Network Neural Networks are computer science models inspired by nature! The brain is a massive natural neural network consisting of neurons and synapses Neural networks can be modeled using a graphical model If anyone has taken Josh Bongards Evolutionary Robotics course you should be well versed in neural networks. For those of you who haven’t I’ll try and explain what they are quickly so you’ll have some context for the following section

Neurons → Nodes Synapses → Edges Simple Neural Network As a computer scientist you should be familiar with graphs as a data structure To model a natural neural network like a brain we need to define a few things. Neurons can be thought of as nodes Synapses can be thought of as edges Synapses have weight associated with them. When values are passed into an input node they propagate through the neural network being affected by the synapses until they reach the output nodes Molecular Form Neural Network Model

Two-Layer Neural Network Kernel is a sigmoid function Implementing the rules

Two-Layer Neural Network Using this technique the following are found automatically: Architecture of a two-layer machine Determining N number of units in first layer (# of support vectors) The vectors of the weights wi = xi in the first layer The vector of weights for the second layer (values of 𝛂)

How well can the machine generalize? Conclusion The quality of a learning machine is characterized by three main components How rich and universal is the set of functions that the LM can approximate? How well can the machine generalize? How fast does the learning process for this machine converge

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Exam Question #1 What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machine The kernel function The difference is the kernel function

Exam Question #2 What is empirical data modeling? Give a summary of the main concept and its components Empirical data modeling is the induction of observations to build up a model. Then the model is used to deduce responses of an unobserved system.

What must the Remp(𝛂) do over the set of loss functions? Exam Question #3 What must the Remp(𝛂) do over the set of loss functions? It must converge to the R(𝛂) What must the empirical risk of alpha do over the set of loss functions? it must converge to the general risk function over the set of loss function

Empirical Data Modeling What is Statistical Learning Theory Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Classification Optimal Separating Hyperplane & Quadratic Programming Support Vector Machines (SVM) Exam Questions Q & A Session

End Any questions?