Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.

Slides:



Advertisements
Similar presentations
VC Dimension – definition and impossibility result
Advertisements

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Data Mining Classification: Alternative Techniques
CHAPTER 10: Linear Discrimination
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Data mining in 1D: curve fitting
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Chapter 2: Pattern Recognition
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Computational Learning Theory
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Machine Learning CSE 681 CH2 - Supervised Learning.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
An Introduction to Support Vector Machine (SVM)
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall Perceptron Rule and Convergence Proof Capacity.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Extending linear models by transformation (section 3.4 in text) (lectures 3&4 on amlbook.com)
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
STATISTICAL LEARNING THEORY & CLASSIFICATIONS BASED ON SUPPORT VECTOR MACHINES presenter: Xipei Liu Vapnik, Vladimir. The nature of statistical.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Geometrical intuition behind the dual problem
The Elements of Statistical Learning
CH. 2: Supervised Learning
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
INTRODUCTION TO Machine Learning
Machine Learning Week 1.
LEARNING VECTOR QUANTIZATION Presentation By : Mihajlo Grbovic
CSSE463: Image Recognition Day 14
COSC 4335: Other Classification Techniques
Supervised Learning Berlin Chen 2005 References:
Linear Discrimination
Supervised machine learning: creating a model
Machine learning: What is it?
INTRODUCTION TO Machine Learning
SVMs for Document Ranking
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension

Rise and fall of supervised machine learning techniques, Jensen and Bateman, Bioinformatics 2011 Predominance of ANN has diminished

Can be used as a black box Makes nonlinear modeling easy Believed to have a biological basis ANN do not mimic brain function. ANN belong to the class of non-parametric, statistical, machine-learning techniques. This class discusses ANN in the context of other machine-learning techniques Why was ANN so popular?

Unsupervised learning: input only – no labels Coins in a vending machine cluster by size and weight How many clusters are here? Would different attributes make clusters more distinct?

Supervised learning: every example has a label Labels have enabled a model based on linear discriminants that will let the vending machine guess coin value without facial recognition.

Reinforcement learning: No one correct output Data: input, graded output Find relationship between input and high-grade outputs

In-sample error, E in How well do boundaries match training data? Out-of-sample error, E out How often will this system fail if implement in the field?

Quality of data mainly determines success of machine learning How many data points? How much uncertainty? We assume each datum is labeled correctly. Uncertainties is in values of attributes

Choosing the right model A good model has small in-sample error and generalizes well. Often a tradeoff between these characteristics is required.

A type of model defines an hypothesis set A particular member of the set is selected by minimizing some in-sample error. Error definition varies with problem but usually are local. (i.e. accumulated from error in each data point) Linear discrimants

11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) examples of family cars Supervised learning is the focus of this course Example: Dichotomy based on 2 attributes Family-Car is a product line No uncertainty in the label Issue is how well do price and engine size distinguish a family car.

12 These data suggest family car (class C) uniquely defined by a range of price and engine power. Assume this is true and blue rectangle shows the true range of these attributes.

Hypothesis class H : axis aligned rectangles 13 In-sample error on h is defined by h = yellow rectangle is a particular member of H Count misclassifications

Hypothesis class H : axis aligned rectangles 14 For dataset shown, in-sample error on h is zero, but we expect out-of-sample error to be nonzero h = yellow rectangle is a particular member of H h leaves room for false positives and false negatives

Should we expect the negative examples to cluster? family car

S, G, and the Version Space 16 most specific hypothesis, S, with no E in most general hypothesis, G any h  H, between S and G is consistent (no error) and makes up the version space

G S I have access to a database that associates product line with price, p, and engine power, e, via VIN number. How does the version space change with the following new data: (1) family car (p,e) inside S, (2) family car (p,e) in version space, (3) family car (p,e) outside G, (4) not family (p,e) inside S, (5) not family (p,e) in version space, (6) not family (p,e) outside G

18 Margin: distance between boundary of hypothesis and closest instance in a specified class S and G hypotheses have narrow margins; not expected to “generalize” well. Even though E in is zero, we expect E out to be large. G S

19 Choose h in the version space with largest margin to maximize generalization Data points that determine S and G are shaded. They “support” h with largest margins Logic behind “support vector machines” Hypothesis with E in =0 and wide margin

Vapnik Chervonenkis Dimension, d VC H is a hypothesis set for 2-way classification (dichotomizer) H(X) is set of dichotomies created by application to H to dataset X with N examples (points in attribute space). |H(X)|= # of dichotomies that H can generate in X. N points can be labeled + 1 in 2 N ways. |H(X)|< 2 N Let m be largest number of points in X consistent with some member of H. m < N d VC (H(X)) = m is the “capacity” of H on X H “shatters” m points in X k = m+1 is the “break point” of H on X 20

VC dimension of 2D linear dichoromizers 2D datasets can be represented as dots in attribute plane. Collinear datasets have aligned examples with n > 2. 2D linear dichotomizers can be represented by lines. 3 non-collinear data points are linearly separable regardless of class labels. 21

Every set of 4 non-collinear points has 2 labeling that are not linearly separable. k=4 is the break point for the 2D linear dichotomizer. d vc = 3 For dD dichotomizer, d vc = d+1. Break point of 2D linear dichotomizer

VC dimension is conservative 23 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) VC dimension is based on all possible ways to label examples VC ignores the probability distribution from which dataset was drawn. In real-world, examples with small differences in attributes usually belong to the same class Basis of “similarity” classification methods. K nearest neighbors (KNN) is this type of classifier family car

Margin Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25 Defined as distance between boundary and closest instance S and G hypotheses have narrow margins; not expected to “generalize” well. Even though E in is zero, we expect E out to be large. Why?

G S

What is the VC dimension of the hypothesis class defined by the union of all axis-aligned rectangles?

G S Any new data in the version space reduces its size Positive example increases S, negative example decreases G