Kernel Classifiers from a Machine Learning Perspective (sec. 2.1- 2.2) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering.

Slides:



Advertisements
Similar presentations
Statistical Machine Learning- The Basic Approach and Current Research Challenges Shai Ben-David CS497 February, 2007.
Advertisements

Support Vector Machine
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Linear Regression.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Biointelligence Laboratory, Seoul National University
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
Model Assessment and Selection
Model Assessment, Selection and Averaging
The loss function, the normal equation,
Visual Recognition Tutorial
Instructor : Dr. Saeed Shiry
Assuming normally distributed data! Naïve Bayes Classifier.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
The Nature of Statistical Learning Theory by V. Vapnik
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Reduced Support Vector Machine
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Machine Learning course fall 2007 Lecturer: Amnon Shashua Teaching Assistant: Yevgeny Seldin School of Computer Science and Engineering.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Visual Recognition Tutorial
Linear Discriminant Functions Chapter 5 (Duda et al.)
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Based on: The Nature of Statistical Learning Theory by V. Vapnick 2009 Presentation by John DiMona and some slides based on lectures given by Professor.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning CSE 681 CH2 - Supervised Learning.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ICML2004, Banff, Alberta, Canada Learning Larger Margin Machine Locally and Globally Kaizhu Huang Haiqin Yang, Irwin King, Michael.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Linear Learning Machines and SVM The Perceptron Algorithm revisited
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Biointelligence Laboratory, Seoul National University
Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Progressive Sampling Instance Selection and Construction for Data Mining Ch 9. F. Provost, D. Jensen, and T. Oates 신수용.
Learning Kernel Classifiers Chap. 3.3 Relevance Vector Machine Chap. 3.4 Bayes Point Machines Summarized by Sang Kyun Lee 13 th May, 2005.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Ch 2. The Probably Approximately Correct Model and the VC Theorem 2.3 The Computational Nature of Language Learning and Evolution, Partha Niyogi, 2004.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CS 9633 Machine Learning Support Vector Machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Biointelligence Laboratory, Seoul National University
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Presentation transcript:

Kernel Classifiers from a Machine Learning Perspective (sec ) Jin-San Yang Biointelligence Laboratory School of Computer Science and Engineering Seoul National University

(C) 2005, SNU Biointelligence Lab, The Basic Setting Definition 2.1 (Learning problem) finding the unknown (functional) relationship h between objects x and targets y based on a sample z of size m. For a given object x, evaluate the distribution and decide on the class y by Problems of estimating P Z based on the given sample.  Cannot predict classes for a new object  Need to constrain the set of possible mappings from objects to classes Definition 2.2 (Features and feature space) Definition 2.4 (Linear function and linear classifier)  Similar objects are mapped to similar classes via linearity  Linear classifier is unaffected by the scale of weight (and hence the weight is assumed to be of unit length.)

(C) 2005, SNU Biointelligence Lab, F is isomorphic to W  The task of learning reduces to finding the best classifier in the hypothesis space F Properties for the goodness of a classifier:  Dependent on the unknown P Z  Make the maximization task computationally easier  Pointwise wrt. the object-class pairs (due to the independence of samplings ) Expected risk: Example 2.7 (Cost matrices) In classifying handwritten digits, 0-1 loss function is inappropriate. (There are approximately 10 times more “no pictures of 1” than “pictures of 1”.)

(C) 2005, SNU Biointelligence Lab, Remark 2.8 (Geometrical picture) Linear classifiers, parameterized by weight w, are hyperplanes passing through the origin in feature space K. (Hypothesis space) (Feature space)

(C) 2005, SNU Biointelligence Lab, Learning by Risk Minimization Definition 2.9 (Learning algorithm) a mapping A such that (where X: object space, Y:output space, F: set of feature mappings) - we have no knowledge of the function (or P Z ) to be optimized Definition 2.10 (Generalization error) Definition 2.11 (Empirical Risk) the empirical risk functional over F or training error of f is defined as

(C) 2005, SNU Biointelligence Lab, The (Primal) Perceptron Algorithm When is misclassified by the linear classifier, the update step amounts changing into and thus attracts the hyperplane.

(C) 2005, SNU Biointelligence Lab, Definition 2.12 (Version Space) The set of all classifiers consistent with the training sample. Given the training sample and a hypothesis space,  For linear classifiers, the set of consistent weights is called a version space:

(C) 2005, SNU Biointelligence Lab, Regularized Risk Functional Drawbacks of minimizing empirical risk  ERM makes the learning task an ill-posed one. (a slight variation of training sample makes large deviation of expected risks, overfitting) Regularization is one way to overcome this problem  Introduce a regularizer a-priori  To restrict the space of solutions to be compact subsets of the (originally overly large space) F.  This can be achieved by requiring the set to be compact for each positive number  If we decrease for increasing sample size in the right way, it can be shown that the regularization method leads to as  minimize only the empirical risk, minimizes only with regularizer

(C) 2005, SNU Biointelligence Lab, Structural risk minimization (SRM) by Vapnik:  Define a structuring of the hypothesis space F into nested subsets of increasing complexity.  In each hypothesis space, empirical minimization is preformed  SRM returns the classifier with the smallest risk and can be used with complexity values. Maximum-a-posteriori (MAP) estimation: empirical risk as the negative log-probability of the training sample z, for a classifier f.  MAP estimate maximized the mode of the posterior densitiy  The choice of regularizer is comparable to the choice of prior probability in the Bayesian framework and reflects prior knowledge.

(C) 2005, SNU Biointelligence Lab,