Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Neural Networks and Kernel Methods

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Neural networks Introduction Fitting neural networks

Linear Regression.

Polynomial Curve Fitting BITS C464/BITS F464 Navneet Goyal Department of Computer Science, BITS-Pilani, Pilani Campus, India.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Model generalization Test error Bias, variance and complexity

Model Assessment and Selection

Model Assessment, Selection and Averaging

Model assessment and cross-validation - overview

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

x – independent variable (input)

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

An Introduction to Support Vector Machines Martin Law.

Radial Basis Function Networks

Neural Networks Lecture 8: Two simple learning algorithms

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

CLassification TESTING Testing classifier accuracy

Biointelligence Laboratory, Seoul National University

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

CSC2515 Fall 2007 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.ppt,.ps, &.htm at

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Aug. 27, 2003IFAC-SYSID2003 Functional Analytic Framework for Model Selection Masashi Sugiyama Tokyo Institute of Technology, Tokyo, Japan Fraunhofer FIRST-IDA,

WB1440 Engineering Optimization – Concepts and Applications Engineering Optimization Concepts and Applications Fred van Keulen Matthijs Langelaar CLA H21.1.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

An Introduction to Support Vector Machines (M. Law)

Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.

CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.

CSC321: Lecture 7:Ways to prevent overfitting

Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.

1 Statistics & R, TiP, 2011/12 Neural Networks  Technique for discrimination & regression problems  More mathematical theoretical foundation  Works.

Machine Learning 5. Parametric Methods.

Validation methods.

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

CEE 6410 Water Resources Systems Analysis

LECTURE 11: Advanced Discriminant Analysis

Introduction to Machine Learning

第 3 章神经网络.

CSE 4705 Artificial Intelligence

A Simple Artificial Neuron

Logistic Regression Classification Machine Learning.

CSC 578 Neural Networks and Deep Learning

Hyperparameters, bias-variance tradeoff, validation

Artificial Intelligence Chapter 3 Neural Networks

Biointelligence Laboratory, Seoul National University

Artificial Intelligence Chapter 3 Neural Networks

The loss function, the normal equation,

Mathematical Foundations of BME Reza Shadmehr

Artificial Intelligence Chapter 3 Neural Networks

Neural networks (1) Traditional multi-layer perceptrons

Machine learning overview

Artificial Intelligence Chapter 3 Neural Networks

Support Vector Machines 2

Artificial Intelligence Chapter 3 Neural Networks

Presentation transcript:

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2

A simple example: 1-D regression 3

Example: Boston Housing data Concerns housing values in suburbs of Boston. Features CRIM: per capita crime rate by town RM: average number of rooms per dwelling Use this to predict house prices in other neighborhoods 4

Represent the Data 5

Noise A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise Imprecision in data attributes (input noise) Errors in data targets (mislabeling) Additional attributes not taken into account by data attributes, affect target values (latent variables) Model may be too simple to account for data targets 6

Least-squares Regression 7

Optimizing the Objective 8

Optimizing Across Training Set 9 A systems of 2 linear equations

Non-iterative Least-squares Regression An alternative optimization approach is non- iterative: take derivatives, set to zero, and solve for parameters. 10

Multi-dimensional inputs 11

Linear Regression It is mathematically easy to fit linear models to data. There are many ways to make linear models more powerful while retaining their nice mathematical properties: 1. By using non-linear, non-adaptive basis functions, we can get generalized linear models that learn non-linear mappings from input to output but are linear in their parameters – only the linear part of the model learns. 12

Linear Regression 2. By using kernel methods we can handle expansions of the raw data that use a huge number of non-linear, non-adaptive basis functions. 3. By using large margin kernel methods we can avoid overfitting even when we use huge numbers of basis functions. But linear methods will not solve most AI problems. They have fundamental limitations. 13

Some types of basis functions in 1-D Sigmoid and Gaussian basis functions can also be used in multilayer neural networks, but neural networks learn the parameters of the basis functions. This is much more powerful but also much harder and much messier. 14 Sigmoids Gaussians Polynomials

Two types of linear model that are equivalent with respect to learning The first model has the same number of adaptive coefficients as the dimensionality of the data +1. The second model has the same number of adaptive coefficients as the number of basis functions +1. Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) So we’ll just focus on the first model 15

Fitting a polynomial Now we use one of these basis functions: an M th order polynomial function We can use the same approaches to optimize the values of the weights on each coefficient Analytic, and iterative 16

Minimizing squared error 17

Minimizing squared error 18

Online Least mean squares: An alternative approach for really big datasets This is called “online“ learning. It can be more efficient if the dataset is very redundant and it is simple to implement in hardware. It is also called stochastic gradient descent if the training cases are picked at random. Care must be taken with the learning rate to prevent divergent oscillations, and the rate must decrease at the end to get a good fit. 19

1-D regression illustrates key concepts Data fits – is linear model best (model selection)? Simplest models do not capture all the important variations (signal) in the data: underfit More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model One method of assessing fit: test generalization = model’s ability to predict the held out data Optimization is essential: stochastic and batch iterative approaches; analytic when available 20

Some fits to the data which is best? 21

Overfitting 22

Overfitting Over-fitting causes Model complexity E.g., Model with a large number of parameters (degrees of freedom) Low number of training data Small data size compared to the complexity of the model 23

Model complexity Polynomials with larger are becoming increasingly tuned to the random noise on the target values. 24

Number of training data & overfitting Over-fitting problem becomes less severe as the size of training data increases. 25

Avoiding Over-fitting Determine a suitable value for model complexity Simple method: Hold some data out of the training set called validation set Use held-out data to optimize model complexity. Regularization Explicit preference towards simple models Penalize for the model complexity in the objective function 26

Validation Almost invariably, all the pattern recognition techniques that we have introduced have one or more free parameters Two issues arise at this point Model Selection: How do we select the “optimal” parameter(s) for a given classification problem? Validation: Once we have chosen a model, how do we estimate its true error rate? The true error rate is the classifier’s error rate when tested on the ENTIRE POPULATION 27

Validation In real applications only a finite set of examples is available This number is usually smaller than we would hope for! Why? Data collection is a very expensive process One may be tempted to use the entire training data to select the “optimal” classifier, then estimate the error rate This naïve approach has two fundamental problems The final model will normally overfit the training data The error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to achieve 100% correct classification on training data 28

Validation We must make the best use of our (limited) data for Training Model selection Performance estimation Methods Holdout Cross validation 29

The holdout method Split dataset into two or three groups Training set: used to train the classifier Validation set: used to select the “optimal” parameter(s) for a given classification problem. Test set: used to estimate the error rate of the trained classifier 30

Simple hold-out: model selection 31

Training, validation, test sets 32

Validation The holdout method has two basic drawbacks In problems where we have a sparse dataset we may not be able to afford the “luxury” of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “unfortunate” split These limitations of the holdout can be overcome at the expense of higher computational cost Cross validation Random subsampling K-fold cross-validation Leave-one-out cross-validation Bootstrap 33

Random subsampling 34

Cross-validation 35

Leave-One-Out Cross Validation 36

Cross Validation In practice, the choice for K depends on the size of the dataset For large datasets, even 3-fold cross validation will be quite accurate For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible A common choice for is K=10 37

Bootstrap The bootstrap is a technique with replacement From a dataset with examples Randomly select (with replacement) examples and use this set for training The remaining examples that were not selected for training are used for testing This value is likely to change from fold to fold Repeat this process for a specified number of folds () 38

Bootstrap 39

Procedure outline in training 1. Divide the available data into training, validation and test set 2. Select architecture and training parameters 3. Train the model using the training set 4. Evaluate the model using the validation set 5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation sets 7. Assess this final model using the test set 40

41