Machine Learning CSE 681 CH2 - Supervised Learning.

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Evaluating Classifiers
CHAPTER 10: Linear Discrimination
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Indian Statistical Institute Kolkata
Data mining in 1D: curve fitting
What is Statistical Modeling
Classification and Decision Boundaries
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
x – independent variable (input)
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Neural Networks Lecture 8: Two simple learning algorithms
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Inductive learning Simplest form: learn a function from examples
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
Universit at Dortmund, LS VIII
Learning from observations
Learning from Observations Chapter 18 Through
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Occam’s Razor No Free Lunch Theorem Minimum.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
Machine Learning Chapter 2. Concept Learning and The General-to-specific Ordering Tom M. Mitchell.
Outline Inductive bias General-to specific ordering of hypotheses
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
1 Chapter 10 Introduction to Machine Learning. 2 Chapter 10 Contents (1) l Training l Rote Learning l Concept Learning l Hypotheses l General to Specific.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Support Vector Machines
Machine Learning Concept Learning General-to Specific Ordering
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
Machine Learning 5. Parametric Methods.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Fundamentals of machine learning 1 Types of machine learning In-sample and out-of-sample errors Version space VC dimension.
CH. 2: Supervised Learning
Data Mining Lecture 11.
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
INTRODUCTION TO Machine Learning
COSC 4335: Other Classification Techniques
Supervised Learning Berlin Chen 2005 References:
Supervised machine learning: creating a model
INTRODUCTION TO Machine Learning
INTRODUCTION TO Machine Learning 3rd Edition
Presentation transcript:

Machine Learning CSE 681 CH2 - Supervised Learning

Learning a Class from Examples 2  Let us say we want to learn the class, C, of a “family car.”  We have a set of examples of cars, and we have a group of people that we survey to whom we show these cars. The people look at the cars and label them as as “family car” or “not family car”.  A car may have many features. Examples of features: year, make, model, color, seating capacity, price, engine power, type of transmission, miles/gallon, etc.  Based on expert knowledge or some other technique, we decide the most important (relevant) features (atributes) that separate a family car from other cars are the price and engine power.  This is called dimensionality reduction: There are many algorithms for dimensionality reduction (Principal components analysis, Factor analysis, Vector quantization, Mutual information, etc.

Class Learning Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3  Class learning is finding a description (model) that is shared by all positive examples and none of the negative examples (or multiple classes).  After finding a model, we can make a prediction: Given a car that we have not seen before, by checking with the model learned, we will be able to say whether it is a family car or not.

Input representation 4  Let us denote price as the first input attribute x 1 (e.g., in U.S. dollars) and engine power as the second attribute x 2 (e.g., engine volume in cubic centimeters). Thus we represent each car using two numeric values x1x1 x2x2 r ,

Training set X Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5

Learning a Class from Examples 6  After further discussions with the expert and the analysis of the data, we may have reason to believe that for a car to be a family car, its price and engine power should be in a certain range  This equation (function) assumes class C to be a rectangle in the price-engine power space.

Class C 7

Hypothesis class H 8  Formally, the learner should choose in advance a set of predictors (functions). This set is called hypothesis class and is denoted by H.  The hypothesis can be a well-known type of function: hyperplanes (straight lines in 2-D), circles, ellipses, rectangles, donut shapes, etc.  In our example, we assume that the hypothesis class is a set of rectangles.  The hypothesis class is also called inductive bias.  The learning algorithm then finds the particular hypothesis, h ∈ H, to approximate C as closely as possible.

What's the right hypothesis class H? 9

Linearly separable data 10

Not linearly separable 11 Source: CS540

Quadratically separable 12 Source: CS540

Hypothesis class H 13 Source: CS540 Function Fitting (Curve Fitting)

Hypothesis h ∈ H 14  Each hypothesis h ∈ H is a function mapping from x to r. After deciding on H, the learner samples a training set S and uses a minimization rule to choose a predictor out of the hypothesis class.  The learner try to choose a hypothesis h ∈ H, which minimizes the error over the training set. By restricting the learner to choose a predictor from H, we bias it toward a particular set of predictors.  This preference is often called an inductive bias. Since H is chosen in advance we refer to it as a prior knowledge on the problem.  Though the expert defines this hypothesis class, the values of the parameters are not known; that is, though we choose H, we do not know which particular h ∈ H is equal, or closest, to real class C.

Hypothesis for the example 15  Depending on values of p 1, p 2, e 1, and e 2, there are many rectangles ( h ∈ H ) that respresent HYPOTHESIS CLASS H.  Given a hypothesis class (rectangle in the example), then the learning problem is just to find the four parameters that define h.  The aim is to find h ∈ H that is as similar as possible to C. Let us say the hypothesis h makes a prediction for an instance x such that

Hypothesis class H Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16 Error of h on H

Empirical Error Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17  In real life we do not know C(x), so we cannot evaluate how well h(x) matches C(x). What we have is the training set X, which is a small subset of the set of all possible empirical error x. The empirical error is the proportion of training instances where predictions of h do not match the required values given in X. The error of hypothesis h given the training set X is where l(a ≠ b) is 1 if a ≠ b and is 0 if a = b

Generalization 18  In our example, each rectangle with values (p 1, p 2, e 1, e 2 ) defines one hypothesis, h, from H.  We need to choose the best one, or in other words, we need to find the values of these four parameters given the training set, to include all the positive examples and none of the negative examples.  We can find infinitely many rectangles that are consistent with the training examples, i.e, the error or loss E is 0.  However, different hypotheses that are consistent with the training examples may behave differently with future examples that are part of the training set.  Generalization is the problem of how well the learned classifier will classify future unseen examples. A good learned hypothesis will make fewer mistakes in the future.

Most Specific Hypothesis S 19  The most specific hypothesis, S, that is the hypothesis tightest rectangle that includes all the positive examples and none of the negative examples  The most general hypothesis G is the largest axis-aligned rectangle we can draw including all positive examples and no negative examples.  Any hypothesis h ∈ H between S and G is a valid hypothesis with no errors, and thus consistent with the training set.  All such hypotheses h make up the Version Space of hypotheses.

S, G, and the Version Space Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 20 most specific hypothesis, S most general hypothesis, G h  H, between S and G is consistent and make up the version space (Mitchell, 1997)

How to choose h Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21  It seems intuitive to choose h halfway between S and G with the maximum margin.  For the error function to have a minimum value at h with the maximum margin, we should use an error (loss) function which not only checks whether an instance is on the correct side of the boundary but also how far away it is.

Margin 22  Choose h with largest margin

Supervised Learning Process 23  In the supervised learning problem, our goal is, to learn a function h : x → r so that h(x) is a “good” predictor for the corresponding value of r.  For historical reasons, this function h is called a hypothesis.  Formally, given a training set:

Supervised Learning Process 24 Training Set Learning Algorithm h predicted rNew input x

Supervised Learning 25  When r can take on only a small number of discrete values (such as “family car” or “not family car” in our example), we call it as a classification problem.  When the target variable r that we’re trying to predict is continuous, such as tempareture in wheather prediction, we call the learning problem as a regression problem (or prediction problem in some data mining books).