The basic notions related to machine learning

Slides:

Advertisements

Similar presentations

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

Advertisements

Data mining in 1D: curve fitting

What is Statistical Modeling

The loss function, the normal equation,

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 

Machine Learning CMPT 726 Simon Fraser University

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Bayesian Learning Rong Jin.

COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.

Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.

Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.

For Wednesday No reading Homework: –Chapter 18, exercise 6.

For Monday No new reading Homework: –Chapter 18, exercises 3 and 4.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.

CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Machine Learning Lecture 1: Intro + Decision Trees Moshe Koppel Slides adapted from Tom Mitchell and from Dan Roth.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

Pattern recognition – basic concepts. Sample input attribute, attribute, feature, input variable, independent variable (atribut, rys, příznak, vstupní.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:

1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.

1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Support Vector Machines

Evaluating Classifiers

Deep Feedforward Networks

Dan Roth Department of Computer and Information Science

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Goodfellow: Chap 5 Machine Learning Basics

Reading: Pedro Domingos: A Few Useful Things to Know about Machine Learning source: /cacm12.pdf reading.

CH. 2: Supervised Learning

Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.

Bias and Variance of the Estimator

Instance Based Learning (Adapted from various sources)

Roberto Battiti, Mauro Brunato

K Nearest Neighbor Classification

Data Mining Practical Machine Learning Tools and Techniques

CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off

Support Vector Machines Most of the slides were taken from:

Machine Learning: Lecture 3

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Gaussian Mixture Models And their training with the EM algorithm

The goal of machine learning

Model Evaluation and Selection

Overfitting and Underfitting

Feature space tansformation methods

The loss function, the normal equation,

Model Combination.

Mathematical Foundations of BME Reza Shadmehr

Model generalization Brief summary of methods

Machine learning overview

Supervised machine learning: creating a model

Shih-Yang Su Virginia Tech

Introduction to Neural Networks

Memory-Based Learning Instance-Based Learning K-Nearest Neighbor

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

INTRODUCTION TO Machine Learning 3rd Edition

COSC 4368 Intro Supervised Learning Organization

Support Vector Machines 2

Machine Learning: Lecture 5

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

The basic notions related to machine learning

Feature extraction It is a vital step before the actual learning: we have to create the input feature vector Obviously, the optimal feature set is task-dependent Ideally, the features are recommended by an expert of the given domain In practice, however, we (engineers) have to solve it Good feature set: contains relevant and few features In many practical tasks it is not clear what are the relevant features Eg. influenza – fever: relevant, color of eye: irrelevant, age ??? When we are unsure, let’s include the feature It’s not that simple: including irrelevant features makes the learning more difficult for two reasons Curse of dimensionality It introduces noise in the data that many algorithms have difficulties to handle

Curse of Dimensionality Too many features make learning more difficult Number of features = dimensions of the feature space Learning becomes harder at larger dimensional spaces Example: let’s consider the following simple algorithm Learning: we divide the feature space into little hypercubes, and count the examples falling into them. We label each cube by the class that has the most examples in it Classification: a new test case is always labeled by the label of the cube it falls into The number of cubes increases exponentially with the number of dimensions! With a fixed number of examples more and more cubes remain empty More and more examples are required to reach a certain density of examples Real learning algorithms are more clever, but the problem is the same More features  we need much more training examples

The effect of irrelevant features The irrelevant features may make the learning algorithms less efficient Example: nearest neighbor method Learning: we simply store the training examples Classify: we identify a new example by the label of its nearest neighbor Good features: the points of the same class fall close to each other What if we include a noise-like feature: the points are randomly scattered along the new dimension, the distance relations fall apart Most of the learning algorithms are more clever, but their operation is also disturbed by an irrelevant (noise-like) feature

Optimizing the feature space We usually try to pick the best features manually But of course, there are also automatic methods for this Feature selection algorithms They retain M<N features from the original set of N features We can reduce the feature space not only by throwing away less relevant features, but also by transforming the feature space Feature space transformation methods The new feature are obtained by some combination of the old features We usually also reduce the number of dimensions at the same time (the new feature space has fewer dimensions than the old one)

Evaluating the trained model Based on the training examples, the algorithm constructs a model (hypothesis) from the function (x1,…,xN)c függvényre This function can guess the value of the function for any (x1,…,xN) Our main goal is not to perfectly learn the labels of the training samples, but to generalize to examples not seen during training Hoe can we give an estimate on the generalization ability? We leave out a subset of the training examples during training  test set Evaluation: We evaluate the model on the test set  estimated class labels We compare the estimated and the guessed labels

Evaluating the trained model 2 How to quantify the error of estimation for a regression task: Example: the algorithm outputs a straight line – the error is shown by the yellow arrows Summarizing the error indicated by the yellow arrows: Mean squared error or Root-mean-squared error

Evaluating the trained model 3 Quantifying the error for a classification task: Simplest solution: classification error rate Number of incorrectly classified test samples/Number of all test samples More detailed error analysis: with the help of the confusion matrix It helps understand which classes are missed by algorithm It also allows defining an error function that counts different mistakes by different weights For this we can define a weight matrix for the different cells „0-1 loss”: it weights the elements of the main diagonal by 0, the other cells by 1 Same as the classification error rate

Evaluating the trained model 4 We can also weight the different mistakes differently The most usual when we have only too classes Example: diagnosing an illness The cost matrix is sized 2x2 : Error 1: False negative: the patient is ill, but the machine said no Error 2: False positive: the machine said yes, but the patient is not ill These have different costs! Metrics: see fig. Metrics preferred by doctors: Sensitivity: tp/(tp+fn) Specificity: tn/(tn+fp)

„No Free Lunch” theorem There exists no such universal learning algorithm that would outperform all other algorithms on all possible tasks The optimal learning algorithm is always task-dependent For every learning algorithm one can find task on which it performs well, and task for which it performs poorly Demonstration: Hypothesis of Method 1 and method 2 on the same examples: Which hypothesis is correct? It depends on the real distribution:

„No Free Lunch” theorem 2 Put another way: The average performance (over „all posible tasks”) of all training algorithms is the same Ok, but then… what is the sense in constructing machine learning algorithms? We should concentrate on just one type of tasks rather than trying to solve all tasks by one algorithm! It makes sense to look for a good algorithm for eg. speech recognition or face recognition You should be very careful when making claims like algorithm A is better than algorithm B Machine learning databases: for the purpose of objective evaluation of machine learning algorithms over a broad range of tasks Pl: UCI Machine Learning Repository

Generalization vs. overfitting No Free Lunch theorem: we can never be sure that the trained model generalizes correctly to the cases not seen during training But then, how should it chose from the possible hypotheses? Experience: increasing the complexity of the model increases its flexibility, so it becomes and more correct on the training examples However, its performance starts dropping on the test examples! This phenomenon is called overfitting: after learning the general properties, the model starts to learn the pecularities of the given finite training set

The „Occam’s razor” heuristics Experience: usually the simpler model generalizes better But of course, a too simple model is not good either Einstein: „Things should be explained as simple as possible. But no simpler.” – this is practically the same as the Occam’s razor heuristics The optimal model complexity is different for each task How can we find the optimum point shown in the figure? Theoretical approach: we formalize the complexity of a hypothesis Minimum Description Length principle: we seek that hypothesis h for which K(h,D)=K(h)+K(D|h) is minimal K(h): the complexity of hypothesis h K(D|h): the complexity of representing set D by the hypothesis h K(): Kolmogorov-complexity

Bias and variance Another formalism for a model being „too simple” or „too complex” For the case of regression Example: we fit the red polinomial on the blue points, green is the optimal solution Polinomial of too low degree: cannot fit on the examplesbias Too high degree: fits on the examples, but oscillates in between them  variance Formally: Let’s select a random D training set with n elements, and run the training on them Repeat this many times, and analyze expectation of the squared error between the g(x,D) approximation and the original F(x) function at a given x point

Bias-variance trade-off Bias: The difference between the average of the estimates and F(x) If it is not 0, then the model is biased: it has a tendency to over- or under-estimate the F(x) By increasing the model complexity (in our example the order of the polinom) the bias decreases Variance: The variance of the estimates (their average difference from the average estimate) A large variance is not good (we get quite different estimates depending on the choice of D) Increasing model complexity increases the variance Optimum: somewhere in between

Finding the optimal complexity – A practical approach (Almost) all machine learning algorithms have meta-parameters These allow us to tune the complexity of the model E.g. polinomial fitting: the degree of the polinomial These are called meta-parameters (or hyperparameters) , to separate them from the real parameters (eg. polinomials: coefficients) Different meta-parameter values result in slightly different models How can we find the optimal meta-parameters? We separate a small validation (also called development) set from the training set Over all, our data is divided into train-dev-test sets We repeat training on the train set several times with several meta-parameters We evalute the models obtained on the dev set (to estimate the red curve of Fig.) Finally, the we evaluate the model that performed best on the dev set on the test