Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2

A simple example: 1-D regression 3

Example: Boston Housing data Concerns housing values in suburbs of Boston. Features CRIM: per capita crime rate by town RM: average number of rooms per dwelling Use this to predict house prices in other neighborhoods 4

Represent the Data 5

Noise A simple model typically does not exactly fit the data – lack of fit can be considered noise Sources of noise Imprecision in data attributes (input noise) Errors in data targets (mislabeling) Additional attributes not taken into account by data attributes, affect target values (latent variables) Model may be too simple to account for data targets 6

Least-squares Regression 7

Optimizing the Objective 8

Optimizing Across Training Set 9 A systems of 2 linear equations

Non-iterative Least-squares Regression An alternative optimization approach is non- iterative: take derivatives, set to zero, and solve for parameters. 10

Multi-dimensional inputs 11

Linear Regression It is mathematically easy to fit linear models to data. There are many ways to make linear models more powerful while retaining their nice mathematical properties: 1. By using non-linear, non-adaptive basis functions, we can get generalized linear models that learn non-linear mappings from input to output but are linear in their parameters – only the linear part of the model learns. 12

Linear Regression 2. By using kernel methods we can handle expansions of the raw data that use a huge number of non-linear, non-adaptive basis functions. 3. By using large margin kernel methods we can avoid overfitting even when we use huge numbers of basis functions. But linear methods will not solve most AI problems. They have fundamental limitations. 13

Some types of basis functions in 1-D Sigmoid and Gaussian basis functions can also be used in multilayer neural networks, but neural networks learn the parameters of the basis functions. This is much more powerful but also much harder and much messier. 14 Sigmoids Gaussians Polynomials

Two types of linear model that are equivalent with respect to learning The first model has the same number of adaptive coefficients as the dimensionality of the data +1. The second model has the same number of adaptive coefficients as the number of basis functions +1. Once we have replaced the data by the outputs of the basis functions, fitting the second model is exactly the same problem as fitting the first model (unless we use the kernel trick) So we’ll just focus on the first model 15

Fitting a polynomial Now we use one of these basis functions: an M th order polynomial function We can use the same approaches to optimize the values of the weights on each coefficient Analytic, and iterative 16

Minimizing squared error 17

Minimizing squared error 18

Online Least mean squares: An alternative approach for really big datasets This is called “online“ learning. It can be more efficient if the dataset is very redundant and it is simple to implement in hardware. It is also called stochastic gradient descent if the training cases are picked at random. Care must be taken with the learning rate to prevent divergent oscillations, and the rate must decrease at the end to get a good fit. 19

1-D regression illustrates key concepts Data fits – is linear model best (model selection)? Simplest models do not capture all the important variations (signal) in the data: underfit More complex model may overfit the training data (fit not only the signal but also the noise in the data), especially if not enough data to constrain model One method of assessing fit: test generalization = model’s ability to predict the held out data Optimization is essential: stochastic and batch iterative approaches; analytic when available 20

Some fits to the data which is best? 21

Overfitting 22

Overfitting Over-fitting causes Model complexity E.g., Model with a large number of parameters (degrees of freedom) Low number of training data Small data size compared to the complexity of the model 23

Model complexity Polynomials with larger are becoming increasingly tuned to the random noise on the target values. 24

Number of training data & overfitting Over-fitting problem becomes less severe as the size of training data increases. 25

Avoiding Over-fitting Determine a suitable value for model complexity Simple method: Hold some data out of the training set called validation set Use held-out data to optimize model complexity. Regularization Explicit preference towards simple models Penalize for the model complexity in the objective function 26

Validation Almost invariably, all the pattern recognition techniques that we have introduced have one or more free parameters Two issues arise at this point Model Selection: How do we select the “optimal” parameter(s) for a given classification problem? Validation: Once we have chosen a model, how do we estimate its true error rate? The true error rate is the classifier’s error rate when tested on the ENTIRE POPULATION 27

Validation In real applications only a finite set of examples is available This number is usually smaller than we would hope for! Why? Data collection is a very expensive process One may be tempted to use the entire training data to select the “optimal” classifier, then estimate the error rate This naïve approach has two fundamental problems The final model will normally overfit the training data The error rate estimate will be overly optimistic (lower than the true error rate) In fact, it is not uncommon to achieve 100% correct classification on training data 28

Validation We must make the best use of our (limited) data for Training Model selection Performance estimation Methods Holdout Cross validation 29

The holdout method Split dataset into two or three groups Training set: used to train the classifier Validation set: used to select the “optimal” parameter(s) for a given classification problem. Test set: used to estimate the error rate of the trained classifier 30

Simple hold-out: model selection 31

Training, validation, test sets 32

Validation The holdout method has two basic drawbacks In problems where we have a sparse dataset we may not be able to afford the “luxury” of setting aside a portion of the dataset for testing Since it is a single train-and-test experiment, the holdout estimate of error rate will be misleading if we happen to get an “unfortunate” split These limitations of the holdout can be overcome at the expense of higher computational cost Cross validation Random subsampling K-fold cross-validation Leave-one-out cross-validation Bootstrap 33

Random subsampling 34

Cross-validation 35

Leave-One-Out Cross Validation 36

Cross Validation In practice, the choice for K depends on the size of the dataset For large datasets, even 3-fold cross validation will be quite accurate For very sparse datasets, we may have to use leave-one-out in order to train on as many examples as possible A common choice for is K=10 37

Bootstrap The bootstrap is a technique with replacement From a dataset with examples Randomly select (with replacement) examples and use this set for training The remaining examples that were not selected for training are used for testing This value is likely to change from fold to fold Repeat this process for a specified number of folds () 38

Bootstrap 39

Procedure outline in training 1. Divide the available data into training, validation and test set 2. Select architecture and training parameters 3. Train the model using the training set 4. Evaluate the model using the validation set 5. Repeat steps 2 through 4 using different architectures and training parameters 6. Select the best model and train it using data from the training and validation sets 7. Assess this final model using the test set 40

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Similar presentations

Presentation on theme: "Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Similar presentations

Presentation on theme: "Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2."— Presentation transcript:

Similar presentations

About project

Feedback