Bias and Variance (Machine Learning 101)

Slides:



Advertisements
Similar presentations
Artificial Intelligence 12. Two Layer ANNs
Advertisements

Slides from: Doug Gray, David Poole
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Neural Nets Spring 2015
CSC321: Neural Networks Lecture 3: Perceptrons
Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Deep Learning and Neural Nets Spring 2015
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Perceptrons Gary Cottrell. Cognitive Science Summer School 2 Perceptrons: A bit of history Frank Rosenblatt studied a simple version of a neural net called.
CSC321: Lecture 7:Ways to prevent overfitting
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Machine Learning 5. Parametric Methods.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
Digital Image Processing Lecture 8: Fourier Transform Prof. Charlene Tsai.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Convolutional Neural Network
Linear Regression CSC 600: Data Mining Class 12.
Artificial Neural Networks
Data Mining, Neural Network and Genetic Programming
ECE 5424: Introduction to Machine Learning
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computer Science and Engineering, Seoul National University
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Goodfellow: Chap 1 Introduction
Many slides and slide ideas thanks to Marc'Aurelio Ranzato and Michael Nielson.
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification with Perceptrons Reading:
Intelligent Information System Lab
Neural networks (3) Regularization Autoencoder
Description and Analysis of Systems
Machine Learning Today: Reading: Maria Florina Balcan
Introduction to Neural Networks
Goodfellow: Chap 6 Deep Feedforward Networks
Outline Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no.
CSCI 5822 Probabilistic Models of Human and Machine Learning
Cos 429: Face Detection (Part 2) Viola-Jones and AdaBoost Guest Instructor: Andras Ferencz (Your Regular Instructor: Fei-Fei Li) Thanks to Fei-Fei.
of the Artificial Neural Networks.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
10701 / Machine Learning Today: - Cross validation,
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
The Bias Variance Tradeoff and Regularization
Neural Networks Geoff Hulten.
October 6, 2011 Dr. Itamar Arel College of Engineering
CS 188: Artificial Intelligence Fall 2008
Artificial Intelligence 12. Two Layer ANNs
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence 10. Neural Networks
Machine learning overview
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
CSCI 5922 Neural Networks and Deep Learning: Convolutional Nets For Image And Speech Processing Mike Mozer Department of Computer Science and Institute.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Toward a Great Class Project: Discussion of Stoianov & Zorzi’s Numerosity Model Psych 209 – 2019 Feb 14, 2019.
Introduction to Neural Networks
Word representations David Kauchak CS158 – Fall 2016.
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

Bias and Variance (Machine Learning 101) Mike Mozer Department of Computer Science and Institute of Cognitive Science University of Colorado at Boulder position audience

Learning Is Impossible What’s my rule? 1 2 3 ⇒ satisfies rule 4 5 6 ⇒ satisfies rule 6 7 8 ⇒ satisfies rule 9 2 31 ⇒ does not satisfy rule Possible rules 3 consecutive single digits 3 consecutive integers 3 numbers in ascending order 3 numbers whose sum is less than 25 3 numbers < 10 1, 4, or 6 in first column “yes” to first 3 sequences, “no” to all others

“What’s My Rule” For Machine Learning x1 x2 x3 y 1 ? 16 possible rules (models) With 𝒏 binary inputs and 𝒑 training examples, there are 𝟐 𝟐 𝒏 −𝒑 possible models.

Model Space More data helps models consistent with data correct model all possible models Size of model class = complexity More data helps In the limit of infinite data, look up table model is fine

Model Space Restricting model class can help Or it can hurt models consistent with data correct model all possible models restricted model class Size of model class = complexity Restricting model class can help Or it can hurt Depends on whether restrictions are domain appropriate

Models range in their flexibility to fit arbitrary data Restricting Models Models range in their flexibility to fit arbitrary data complex model unconstrained large capacity may allow it to fit quirks in data and fail to capture regularities simple model constrained small capacity may prevent it from representing all structure in data low bias high variance high bias low variance constrained : unconstrained :: parametric : nonparametric

Bias [*] underpredicts at the ends; overpredicts at the center Regardless of training sample, or size of training sample, model will produce consistent errors

Variance Different samples of training data yield different model fits

Formalizing Bias and Variance Given data set 𝓓={ 𝒙 𝟏 , 𝑦 𝟏 , …, 𝒙 𝑵 , 𝑦 𝑵 } And model built from data set, 𝑓 𝒙;𝓓 We can evaluate the effectiveness of the model using mean squared error: MSE=𝑬 𝑦−𝑓 𝒙;𝓓 𝟐 with constant 𝓓 =𝑵 MSE Expectation: the way you're used to seeing it [*] p(x,y) over environment of inputs and outputs [*] p(x,y,D) over environment and all possible data sets Imagine I give each person their own data set, they build model, and we average MSE across all possible data sets of size N MSE= 𝑬 𝒑 𝒙,𝑦 𝑦−𝑓 𝒙;𝓓 𝟐 MSE= 𝑬 𝒑 𝒙,𝒚,𝓓 𝒚−𝑓 𝒙;𝓓 𝟐

variance of models (across data sets) for a given point MSE 𝒙 &= 𝑬 𝓓|𝒙 𝑦−𝑓 𝒙;𝓓 𝟐 & =& 𝑬 𝓓 𝑓 𝒙;𝓓 −𝑬 𝑦 𝒙 𝟐 & + 𝑬 𝓓 𝑓 𝒙;𝓓 − 𝑬 𝓓 𝑓 𝒙;𝓓 𝟐 &+ 𝑬 𝑦−𝑬 𝑦 𝒙 𝟐 bias: difference between average model prediction (across data sets) and the target 𝑬 𝓓 𝑓 𝒙;𝓓 𝑬 𝑦 𝒙 variance of models (across data sets) for a given point intrinsic noise in data set LET’S FOCUS ON A GIVEN X in test set. D is still RV [*] mathematically equivalent to these three terms [*] bias based on fits to simple (linear) model bias neg at left and right ends; pos in center bias vs bias^2 [*] variance based on fits to complex model [*] graphs have E_D(model)) [*] residual or intrinsic noise in data set either because we lack the input features to predict deterministically or because of true randomness mathematical fact! now we can return to models varying in complexity… 𝑬 𝓓 𝑓 𝒙;𝓓

Bias-Variance Trade Off MSE MSE is high on left because bias^2 is high MSE is high on right because variance is high [*] optimal complexity

MSEtest model complexity (polynomial order) variance bias2 MSEtest here’s an example with actual (simulated) data left: made up function, 30 observations with noise imagine fitting with polynomials of various orders [*]test error from the function mean, split into bias^2 and variance model complexity (polynomial order) Gigerenzer & Brighton (2009)

Bias-Variance Trade Off Is Revealed Via Test Set Not Training Set MSEtest early stopping and deep nets bengio ICLR batch norm understanding deep learning rethinking generalizationf kushner H Weak confergence methods and singularly perturbed stochastic control and filtering problems Borkar – stochastic approximation with two time scales l.dickens@ucl.ac.uk MSEtrain

Bias-Variance Trade Off Is Revealed Via Test Set Not Training Set MSEtest MSEtrain London’s mean daily temp in 2000. 3d and 12th order polynomial fit with least squares [*] mean error in fitting samples of 30 observations and mean prediction error of same models, for varying polynomial degrees Gigerenzer & Brighton (2009)

Back To The Venn Diagram… low bias, high variance model class correct model all possible models high bias, low variance model class Bias is not intrinsically bad if it is suitable for the problem domain

Current Perspective In Machine Learning We can learn complex domains using low bias model (deep net) tons of training data But will we have enough data? E.g., speech recognition more data more data [*] What does tons of training data do for us in terms of bias-variance trade off? Doesn’t affect bias [*] Reduces variance [*] allows more complex model to be used [shift of optimal complexity to the right]

domain complexity (also model complexity) Scaling Function Single-speaker, small vocabulary, isolated words Multiple-speaker, small vocabulary, isolated words Multiple-speaker, small vocabulary, connected speech Multiple-speaker, large vocabulary, connected speech Intelligent chatbot / Turing test log required data set size [*]single speaker, small vocab [*] mult speaker, small vocab [*] connected speech, small vocab projected scaling? [*] connected speech, large vocab multiple speaker, large vocab connected speech… took a long time to get here. didn’t expect to see it in my lifetime domain complexity (also model complexity)

The Challenge To AI Will it scale? 1995 2015 ? In the 1960s Neural nets (perceptrons) created wave of excitement But Minsky and Papert (1969) showed challenges to scaling In the 1990s Neural nets (back propagation) created wave of excitement Worked great on toy problems but arguments about scaling (Elman et al., 1996; Marcus, 1998) Now in the 2010s Neural nets (deep learning) created a wave of excitement Researchers have clearly moved beyond toy problems Nobody is yet complaining about scaling But there is no assurance that methods won’t disappoint again 1995 Will it scale? 2015 Minsky and Papert: computational complexity analysis -- amount of data required, amount of time to converge, number of hidden units required ?

Solution To Scaling Dilemma Use domain-appropriate bias to reduce complexity of learning task log required data set size If you’re Geoff Hinton you may think this is cheating. But, unfortunately for us, none of us are Geoff Hinton. domain complexity (also model complexity)

Example Of Domain-Appropriate Bias: Vision Architecture of primate visual system visual hierarchy transformation from simple, low-order features to complex, high-order features transformation from position-specific features to position-invariant features source: neuronresearch.net/vision Example of a domain-appropriate bias: CONV NETS FOR VISION hierarchy of visual layers; simple, position specific -> position invariant simple -> complex

Example Of Domain-Appropriate Bias: Vision source: neuronresearch.net/vision Convolutional nets spatial locality features at nearby locations in an image are most likely to have joint causes and consequences spatial position homogeneity features deemed significant in one region of an image are likely to be significant in others spatial scale homogeneity locality and position homogeneity should apply across a range of spatial scales [END OF SLIDE] I love the Goodfellow, Bengio, & Courville text But it says nothing about domain-appropriate bias and how to incorporate knowledge you have into a model. I love tf/theano/etc. but like all simulation environments, they make it really easy to pick defaults and use generic domain-independent methods. ML is really about crafting models to the domain, not tossing unknown data into a generic architecture. Deep learning seems to mask this issue. source: benanne.github.io