Goodfellow: Chap 5 Machine Learning Basics

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Model Assessment, Selection and Averaging
The loss function, the normal equation,
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
Radial Basis Function Networks
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Machine Learning 5. Parametric Methods.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Machine Learning Basics ( 1/2 ) 周岚. Machine Learning Basics What do we mean by learning? Mitchell (1997) : A computer program is said to learn from experience.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Chapter 7. Classification and Prediction
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Evaluating Classifiers
Deep Feedforward Networks
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
CH. 2: Supervised Learning
Christoph Eick: Learning Models to Predict and Classify 1 Learning from Examples Example of Learning from Examples  Classification: Is car x a family.
Machine Learning Basics
Softmax Classifier + Generalization
Data Mining Lecture 11.
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Goodfellow: Chap 6 Deep Feedforward Networks
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Goodfellow: Chapter 14 Autoencoders
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Model generalization Brief summary of methods
Parametric Methods Berlin Chen, 2005 References:
Multivariate Methods Berlin Chen
Machine learning overview
Multivariate Methods Berlin Chen, 2005 References:
Shih-Yang Su Virginia Tech
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
INTRODUCTION TO Machine Learning 3rd Edition
COSC 4368 Intro Supervised Learning Organization
Goodfellow: Chapter 14 Autoencoders
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Goodfellow: Chap 5 Machine Learning Basics Dr. Charles Tappert The information here, although greatly condensed, comes almost entirely from the chapter content.

Chapter Sections Introduction 1 Learning Algorithms 2 Capacity, Overfitting and Underfitting 3 Hyperparameters and Validation Sets 4 Estimators, Bias andVariance 5 Maximum Likelihood Estimation 6 Bayesian Statistics (not covered in this course) 7 Supervised Learning Algorithms 8 Unsupervised Learning Algorithms (later in course) 9 Stochastic Gradient Descent 10 Building a Machine Learning Algorithm 11 Challenges Motivating Deep Learning

Introduction Machine Learning Two central approaches Form of applied statistics with extensive use of computers to estimate complicated functions Two central approaches Frequentist estimators and Bayesian inference Two categories of machine learning Supervised learning and unsupervised learning Most machine learning algorithms based on stochastic gradient descent optimization Focus on building machine learning algorithms

1 Learning Algorithms A machine learning algorithm learns from data A computer program learns from experience E with respect to some class of tasks T and performance measure P if it’s performance P on tasks T improves with experience E

1.1 The Task, T Machine learning tasks are too difficult to solve with fixed programs designed by humans Machine learning tasks are usually described in terms of how the system should process an example where An example is a collection of features measured from an object or event Many kinds of tasks Classification, regression, transcription, anomaly detection, imputation of missing values, etc.

1.2 The Performance Measure, P Performance P is usually specific to the Task T For classification tasks, P is usually accuracy The proportion of examples correctly classified Here we are interested in the accuracy on data not seen before – a test set

1.3 The Experience, E Most learning algorithms in this book are allowed to experience an entire dataset Unsupervised learning algorithms Learn useful structural properties of the dataset Supervised learning algorithms Experience a dataset containing features with each example associated with a label or target For example, target provided by a teacher Blur between supervised and unsupervised

1.4 Example: Linear Regression Linear regression takes a vector x as input and predicts the value of a scalar y as it’s output Task: predict scalar y from vector x Experience: set of vectors X and vector of targets y Performance: mean squared error => normal eqs Solution of the normal equations is

Pseudoinverse Method Intro to Algorithms by Cormen, et al. General linear regression Assume polynomial functions Minimizing mean squared error => normal eqs Solution of the normal equations is

Pseudoinverse Method Intro to Algorithms by Cormen, et al.

Example: Pseudoinverse Method Simple linear regression homework example Find Given points

Example: Pseudoinverse Method

Linear Regression MSE(train) Figure 5.1 Linear regression example Optimization of w 3 0.55 0.50 0.45 0.40 0.35 0.30 0.25 0.20 2 1 MSE(train) y -1 -2 -3 -1.0 -0.5 0.0 0.5 1.0 x1 0.5 1.0 w1 1.5 Figure 5.1 (Goodfellow 2016)

2 Capacity, Overfitting and Underfitting The central challenge in machine learning is to perform well on new, previously unseen inputs, and not just on the training inputs We want the generalization error (test error) to be low as well as the training error This is what separates machine learning from optimization

2 Capacity, Overfitting and Underfitting How can we affect performance on the test set when we get to observe only the training set? Statistical learning theory provides answers We assume train and test sets are generated independent and identically distributed – i.i.d. Thus, expected train and test errors are equal

2 Capacity, Overfitting and Underfitting Because the model is developed from the training set the expected test error is usually greater than the expected training error Factors determining machine performance are Make the training error small Make the gap between train and test error small These two factors correspond to the two challenges in machine learning Underfitting and overfitting

2 Capacity, Overfitting and Underfitting Underfitting and overfitting can be controlled somewhat by altering the model’s capacity Capacity = model’s ability to fit variety of functions Low capacity models struggle to fit the training data High capacity models can overfit the training data One way to control the capacity of a model is by choosing its hypothesis space For example, simple linear vs quadratic regression

Underfitting and Overfitting in Polynomial Estimation Underfitting Appropriate capacity Overfitting y y y x0 x0 x0 Figure 5.2 (Goodfellow 2016)

2 Capacity, Overfitting and Underfitting Model capacity can be changed in many ways – above we changed the number of parameters We can also change the family of functions –called the model’s representational capacity Statistical learning theory provides various means of quantifying model capacity The most well-known is the Vapnik-Chervonenkis dimension Old non-statistical Occam’s razor method takes simplest of competing hypotheses

2 Capacity, Overfitting and Underfitting While simpler functions are more likely to generalize (have a small gap between training and test error) we still want low training error Training error usually decreases until it asymptotes to the minimum possible error Generalization error usually has a U-shaped curve as a function of model capacity

Generalization and Capacity Training error Generalization error Underfitting zone Overfitting zone Error Generalization gap Optimal Capacity Capacity Figure 5.3 (Goodfellow 2016)

2 Capacity, Overfitting and Underfitting Non-parametric models can reach arbitrarily high capacities For example, the complexity of the nearest neighbor algorithm is a function of the training set size Training and generalization error vary as the size of the training set varies Test error decreases with increased training set size

Training Set Size Figure 5.4 Bayes error 3.5 3.0 2.5 2.0 1.5 1.0 0.5 Bayes error Train (quadratic) Test (quadratic) Test (optimal capacity) Error (MSE) Train (optimal capacity) 0.0 Figure 5.4 100 101 102 103 Number of training examples 104 105 Optimal capacity (polynomial degree) 20 15 10 5 100 101 102 103 Number of training examples 104 105 (Goodfellow 2016)

2.1 No Free Lunch Theorem Averaged over all possible data generating distributions, every classification algorithm has the same error rate when classifying previously unobserved points In other words, no machine learning algorithm is universally any better than any other However, by making good assumptions about the encountered probability distributions, we can design high-performing learning algorithms This is the goal of machine learning research

2.2 Regularization The no free lunch theorem implies that we design our learning algorithm on a specific task Thus, we build preferences into the algorithm Regularization is a way to do this and weight decay is a method of regularization

2.2 Regularization Applying weight decay to linear regression This expresses a preference for smaller weights l controls the preference strength with l = 0 imposing no preference strength The next figure shows 9th degree polynomial training with various values of l The true function is quadratic

Weight Decay Underfitting (Excessive λ) Appropriate weight decay (Medium λ) Overfitting (λ →() y y y x( x( x( (Goodfellow 2016)

2.2 Regularization More generally, we can add a penalty called a regularizer to the cost function Expressing preferences for one function over another is a more general way of controlling a model’s capacity, rather than including or excluding members from the hypothesis space Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error

3 Hyperparameters and Validation Sets Most machine learning algorithms have several settings, called hyperparameters, to control the behavior of the algorithm Examples In polynomial regression, the degree of the polynomial is a capacity hyperparameter In weight decay, l is a hyperparameter

3 Hyperparameters and Validation Sets To avoid overfitting during training, the training set is often split into two sets Called the training set and the validation set Typically, 80% of the training data is used for training and 20% for validation Cross-Validation For small datasets, k-fold cross validation partitions the data into k non-overlapping subsets k trials are run, train on (k-1)/k of the data, test on 1/k Test error is estimated by averaging error on each run

4 Estimators, Bias and Variance 4.1 Point estimation Prediction of quantity – e.g., max likelihood 4.2 Bias 4.3 Variance and Standard Error

4 Estimators, Bias and Variance 4.4 Trading off Bias and Variance to Minimize Mean Squared Error Most common way to negotiate this trade-off is to use cross-validation Can also use MSE of the estimates The relationship between bias and variance is tightly linked to capacity, underfitting, and overfitting

Bias and Variance Underfitting zone Overfitting zone Bias Generalization error Variance Optimal capacity Capacity (Goodfellow 2016)

5 Maximum Likelihood Estimation Covered in Duda textbook

6 Bayesian Statistics Not covered in this course

7 Supervised Learning Algorithms 7.1 Probabilistic Supervised Learning Bayes decision theory 7.2 Support Vector Machines Also covered by Duda – maybe later in semester 7.3 Other Simple Supervised Learning Alg. k-nearest neighbor algorithm Decision trees – breaks input space into regions

Decision Trees 1 10 00 01 11 010 010 011 110 111 00 01 1110 1111 011 110 1 11 10 1110 111 1111 Figure 5.7 (Goodfellow 2016)

8 Unsupervised Learning Algorithms Algorithms with unlabeled examples 8.1 Principal Component Analysis (PCA) Covered by Duda – later in semester 8.2 k-means Clustering

Principal Components Analysis 20 20 10 10 x2 z2 -10 -10 -20 -20 -20 -10 0 10 20 x1 -20 -10 0 10 20 z1 Figure 5.8 (Goodfellow 2016)

9 Stochastic Gradient Descent Nearly all of deep learning is powered by stochastic gradient descent (SGD) SGD is an extension of gradient descent introduced on section 4.3

10 Building a Machine Learning Algorithm Deep learning algorithms use a simple recipe Combine a specification of a dataset, a cost function, an optimization procedure, and a model

11 Challenges Motivating Deep Learning The simple machine learning algorithms work well on many important problems But they don’t solve the central problems of AI Recognizing objects, etc. Deep learning was motivated by this failure

11.1 Curse of Dimensionality Machine learning problems get difficult when the number of dimensions in the data is high This phenomenon is known as the curse of dimensionality Next figure shows 10 regions of interest in 1D 100 regions in 2D 1000 regions in 3D

Curse of Dimensionality Figure 5.9 (Goodfellow 2016)

11.2 Local Consistency and Smoothness Regularization To generalize well, machine learning algorithms need to be guided by prior beliefs One of the most widely used “priors” is the smoothness prior or local constancy prior Function should change little within a small region To distinguish O(k) regions in input space requires O(k) samples For the kNN algorithm, each training sample defines at most one region

Nearest Neighbor Figure 5.10 (Goodfellow 2016)

11.3 Manifold Learning A manifold is a connected region Manifold algorithms reduce space of interest Such as variation across n-dimensional Euclidean space reduced by assuming that most of the space consists of invalid input Probability distributions over images, text strings, and sounds that occur in real life are highly concentrated Fig 11 is a manifold, 12 is static noise, 13 shows the manifold structure of a dataset of faces

Manifold Learning Figure 5.11 2.5 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 0.5 3.0 3.5 4.0 Figure 5.11 (Goodfellow 2016)

Uniformly Sampled Images Figure 5.12 (Goodfellow 2016)

QMUL Dataset Figure 5.13 (Goodfellow 2016)