Download presentation
Presentation is loading. Please wait.
1
Intro to NLP and Deep Learning - 236605
Tutorial 7 Reminders: problems in supervised learning Bias-Variance tradeoff Overfitting Regularization Dropout Tomer Golany –
2
Supervised Learning
3
The supervised learning problem - Reminder
Given a collection of Labeled examples , where D will be called the Learning Data Xi is an input variable, Yi is the output variable (or label) which matches Xi. X is the input space , Y is the output space. The goal is to learn a mapping (function, algorithm) which computes the right output for ALL possible input (not just for the examples in D, but for all possible inputs in the input space.) We will call the learning mapping function ”Predicator” or “prediction function” Intro to NLP and Deep Learning
4
Prediction Vs. Approximation
Lets sharp the difference between the Approximation problem to the Prediction problem. In the Approximation problem, We are interested in finding a function which describes the connection between the input and output over the given group of points: In the Prediction problem, We are interested in finding a match point y for a new input x, that means, a point x which doesn’t exist in the learning group D. From the learning data D we need to conclude on the entire input space. הסקה אינדוקטיבית – הסקה מהפרט אל הכלל Intro to NLP and Deep Learning
5
Predicator Types מחלקת ההשערות , המודל, משפחה פונקציונלית The first step in learning is to define the group of Predicators which from them we will eventually choose the desired prediction function This group is called ”Assumptions class” \ ”The Model” \ “Functions Family” Examples: Linear predicators: comes from linear and logistic regression. Support vector machines – SVM Decision Trees K-NN Neural Networks Intro to NLP and Deep Learning
6
Parametric models Most of the models we are interested about are parametric models, of the form: Where , that means the prediction functions are determined by a vector of real parameters (weights) from a given dimension: The learning is based on tuning the weights w. The most simple example for a parametric model is the linear regression function: K-NN is an example for a none parametric model Intro to NLP and Deep Learning
7
Quality measurements and the statistical learning model
Now we want to answer the question of how good is our predicator, and how do we measure it. For that we will define a general model which describes the “world” where we are at, and from which the learning examples are taken and new inputs are taken (which we need to classify) The model consists of: Input space X Output space Y Probability distribution on the input space X. this is the “Real distribution” of the input. Ideal prediction function which defines the true tag for each possible input x. This is the target function we want to learn. The probability distribution and the Ideal prediction function are not known to the learning algorithm Intro to NLP and Deep Learning
8
Quality measurements and the statistical learning model
The learning is based only on the learning examples Where: The input series is given by independent samples from the distribution The tags are true: In order to define the measurement performance we define a loss function on the output space, which gives the cost on a prediction error: where Intro to NLP and Deep Learning
9
Quality measurements and the statistical learning model
For a prediction function we define 2 types of performance measurements: The Risk (The Real Price) : The mean is taken from the distribution (“The real distribution”) of the input x. For a continues variable : This is the Real value we want to minimize. That is, Ideally we want to find the best predicator from the assumptions class F : נזכיר כי אם לדוגמא מחלקת ההשערות שלנו כוללת רק פונקציות לינאריות, אז לאו דווקא הפונקציה f* שווה לפונקציה האידיאלית f0 , אשר עלולה להיות לא לינארית כמובן שאי אפשר לעשות זאת מדיוק מכיוון ש- px ו – f0 לא ידועים Intro to NLP and Deep Learning
10
Quality measurements and the statistical learning model
The Empirical Risk: This price is calculated only from the given data examples for a prediction function the empirical risk is defined as: In contrast to the Real Risk function, we can calculate easily this measure for every prediction function f, and we can look at it as an approximation to the real world. But this approximation is not always a good measure for choosing f because of the overfitting problem Intro to NLP and Deep Learning
11
Learning by minimizing the empirical risk
We can now define the naïve approach to learn the predicator from the series examples and relative to the assumptions class of prediction functions: Find the predicator in the class F which brings to minimum the empirical risk relative to the examples D : This predicator by his definition gives the best match to the learning examples, but isn’t necessary the best predicator for a new input x. Intro to NLP and Deep Learning
12
General problems in Supervised Learning
13
Intro to NLP and Deep Learning - 236605
General problems We will now describe a set of basic problems in supervised learning – learning from series of samples. Choosing the learning model Choosing the number of parameters in the model overfitting Intro to NLP and Deep Learning
14
Intro to NLP and Deep Learning - 236605
Choosing the Model Choosing the type of model – Linear, neural network, non parametric model … which fits a certain problem is a problem without one solution. Choosing the model will be based on: past experience with similar problems comparing the performance of different models Computing overload Personal prefrences Intro to NLP and Deep Learning
15
Choosing size of the model
The size of the model (number of parameters) generally determines the size of the group of functions that are included in the model. As the size goes bigger, we have a bigger group with different possible and complex functions. The basic dilemma of choosing the size of the model: Too simple model will not allow an accurate description of the “real” connection between the input and the output. For example linear model will give bad match for a squared function. Too complex model might need a huge number of examples (and long learning time) in order to perform a reasonable generalization. For example: to fit a square model to a squared function we need 3 examples. A good fit for a polynomial model of size 700 will need much more Intro to NLP and Deep Learning
16
Bias-Variance Tradeoff
The contrast between those two are called Bias-Variance Tradeoff. In order to understand that better lets take a closer look at the components of the Real Risk of the learning predicator is called the approximation error , which is also called the Bias. It describes the minimal risk we can get from all the predication function in F. this size is determinist and does not depend on the learning series D. Intro to NLP and Deep Learning
17
Bias-Variance Tradeoff
is the generalization error. This size is the gap difference between the optimal function in the model to the one we got from our learning algorithm based on the learning series D The Variance is defined as the mean square of the generalization error for a given learning series with n samples and some learning algorithm, this is the dependency of the risk on the model complexity: Intro to NLP and Deep Learning
18
Bias-Variance Tradeoff
is the generalization error. This size is the gap difference between the optimal function in the model to the one we got from our learning algorithm based on the learning series D The Variance is defined as the mean square of the generalization error for a given learning series with n samples and some learning algorithm, this is the dependency of the risk on the model complexity: ראו הסבר על הלוח Intro to NLP and Deep Learning
19
Bias-Variance Tradeoff
The optimal complexity of the model is at the minimum point of the risk The estimation error decreases, generally, as the number of examples in the learning series rises. While the approximation error is independent on that number. Therefore, as the number of examples rises, we expect that the optimum point will move right and will be given for a more complex model Important method to decrease the complexity of the model is with Regularization Intro to NLP and Deep Learning
20
Overfitting
21
Intro to NLP and Deep Learning - 236605
Overfitting Overfitting means that we are fitting our predicator “too good” to the series examples D. This may lead to bad results outside the learning series. In the terms we defined before, decreasing the empirical risk Will not necessary bring the Real risk (which is not known) to minimum: The reason for overfitting is because of a high estimation error which occurs for models with high complexity: Intro to NLP and Deep Learning
22
Intro to NLP and Deep Learning - 236605
Overfitting Intro to NLP and Deep Learning
23
Intro to NLP and Deep Learning - 236605
Overfitting Intro to NLP and Deep Learning
24
Regularization
25
Intro to NLP and Deep Learning - 236605
Regularization Loss functions In general, regularization is a technique that applies to objective functions in ill-posed problems formulated as optimization problems. Regularization adds a penalty on the different parameters of the model to reduce the freedom of the model. Hence, the model will be less likely to fit the noise of the training data and will improve the generalization abilities of the model. בעיה חישובית נקראת ill-posed כאשר לא קיים פתרון יחיד, או כאשר קיימת רגישות גבוהה של הפתרון לנתוני הבעיה Intro to NLP and Deep Learning
26
Intro to NLP and Deep Learning - 236605
L1 Regularization The L1 regularization adds a penalty equal to the sum of the absolute value of the coefficients. The L1 regularization will shrink some parameters to zero. Hence some variables will not play any role in the model, L1 regression can be seen as a way to select features in a model. Intro to NLP and Deep Learning
27
Intro to NLP and Deep Learning - 236605
L1 Regularization Example: We want to predict the prices of houses from House Sales in King County, USA dataset on Kaggle. We will train the data on 0.5% of the dataset, we are taking such a small dataset to ensure that they will be overfitting. The test will be done on the other 99.5% of the data. As lambda grows bigger, more coefficient will be cut. Below is the evolution of the value of the different coefficients while lambda is growing: Intro to NLP and Deep Learning
28
Intro to NLP and Deep Learning - 236605
L1 Regularization As expected, coefficients are cut one by one until no variables remain. Let’s see how the test error is evolving: At the beginning, cutting coefficient reduces the overfitting and the generalization abilities of the model. Hence, the test error is decreasing. However, as we are cutting more and more coefficient, the test error start increasing. The model is not able to learn complex pattern with so few variables. Intro to NLP and Deep Learning
29
Intro to NLP and Deep Learning - 236605
L2 Regularization The L2 regularization adds a penalty equal to the sum of the squared value of the coefficients: The L2 regularization will force the parameters to be relatively small, the bigger the penalization, the smaller (and the more robust) the coefficients are. When we compare to the L1 regularization, we notice that the coefficients decrease progressively and are not cut to zero. They slowly decrease to zero. Intro to NLP and Deep Learning
30
Intro to NLP and Deep Learning - 236605
Example Given dataset Solution of linear regression (predicting function of type y=W1*x+W0) with mean square error and regularizations is described: Intro to NLP and Deep Learning
31
Example Match between the graphs and the equations: 1 – c 2 – b 3 – a
Intro to NLP and Deep Learning
32
Avoiding overfitting in neural networks
33
Regularization in Deep Learning
most common techniques of regularization used nowadays in industry: Dataset augmentation Early stopping Dropout layer Weight penalty L1 and L2 Intro to NLP and Deep Learning
34
Intro to NLP and Deep Learning - 236605
Dataset augmentation An overfitting model (neural network or any other type of model) can perform better if learning algorithm processes more training data. While an existing dataset might be limited, for some machine learning problems there are relatively easy ways of creating synthetic data. For images some common techniques include translating the picture a few pixels, rotation, scaling. For classification problems it’s usually feasible to inject random negatives — e.g. unrelated pictures. There is no general recipe regarding how the synthetic data should be generated and it varies a lot from problem to problem. The general principle is to expand the dataset by applying operations which reflect real world variations as close as possible. Having better dataset in practice significantly helps quality of the models, independent of the architecture. Intro to NLP and Deep Learning
35
Intro to NLP and Deep Learning - 236605
Early stopping Early-stopping combats overfitting interrupting the training procedure once model’s performance on a validation set gets worse. A validation set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The validation examples are considered to be representative of future test examples. Early stopping is effectively tuning the hyper-parameter number of epochs/steps. Intuitively as the model sees more data and learns patterns and correlations, both training and test error go down. After enough passes over training data the model might start overfitting and learning noise in the given training set. In this case training error would continue going down while test error (how well we generalize) would get worse. Early stopping is all about finding this right moment with minimum test error. Intro to NLP and Deep Learning
36
Dropout layer
37
Intro to NLP and Deep Learning - 236605
Dropout layer What is Dropout in Neural Networks? The term “dropout” refers to dropping out units (both hidden and visible) in a neural network. Dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random At each training stage, individual neurons are either dropped out of the net with probability 1-p or kept with probability p A reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Intro to NLP and Deep Learning
38
Intro to NLP and Deep Learning - 236605
Why do we need Dropout? To prevent Overfitting A fully connected layer occupies most of the parameters. And hence, neurons develop co-dependency amongst each other during training which reduces the individual power of each neuron leading to over-fitting of training data. Intro to NLP and Deep Learning
39
Dropout layer – Technical Details
Dropout is an approach to regularization in neural networks which helps reducing interdependent learning amongst the neurons. Training Phase: For each hidden layer, for each training sample, for each iteration: ignore (zero out) a random fraction, p, of nodes (and corresponding activations). Intro to NLP and Deep Learning
40
Dropout layer – Technical Details
Intro to NLP and Deep Learning
41
Dropout layer – Technical Details
Testing Phase: Use all activations, but reduce them by a factor p (to account for the missing activations during training). Some Observations: Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less. With H hidden units, each of which can be dropped, we have 2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p. Intro to NLP and Deep Learning
42
Dropout layer – Experiments
We build a deep network in Keras and tried to validate it on the CIFAR-10 dataset. The deep network is built with three convolution layers of size 64, 128 and 256 followed by two fully connected layers of size 512 and an output layer layer of size 10 Took ReLU as the activation function for hidden layers and sigmoid for the output layer Used the standard categorical cross-entropy loss Finally, used dropout in all layers and increase the fraction of dropout from 0.0 (no dropout at all) to 0.9 with a step size of 0.1 and ran each of those to 20 epochs. Intro to NLP and Deep Learning
43
Dropout layer – Experiments
Results: Intro to NLP and Deep Learning
44
Dropout layer – Experiments
Conclusions: with increasing the dropout, there is some increase in validation accuracy and decrease in loss initially before the trend starts to go down. There could be two reasons for the trend to go down if dropout fraction is 0.2: 0.2 is actual minima for the this dataset, network and the set parameters used More epochs are needed to train the networks. Intro to NLP and Deep Learning
45
Intro to NLP and Deep Learning - 236605
References Took material from course # – Machine learning Intro to NLP and Deep Learning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.