Deep Learning and Neural Nets Spring 2015

Slides:

Advertisements

Similar presentations

Neural Networks and Kernel Methods

Advertisements

Multi-Layer Perceptron (MLP)

NEURAL NETWORKS Backpropagation Algorithm

Neural networks Introduction Fitting neural networks

Stochastic Neural Networks Deep Learning and Neural Nets Spring 2015.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Practical Advice For Building Neural Nets Deep Learning and Neural Nets Spring 2015.

Simple Neural Nets For Pattern Classification

Unsupervised Learning With Neural Nets Deep Learning and Neural Nets Spring 2015.

Sparse vs. Ensemble Approaches to Supervised Learning

Artificial Neural Networks ML Paul Scheible.

Chapter 5 NEURAL NETWORKS

Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.

CS 4700: Foundations of Artificial Intelligence

ICS 273A UC Irvine Instructor: Max Welling Neural Networks.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Radial Basis Function Networks

Deep Learning and Neural Nets Spring 2015

Artificial Neural Networks

Classification Part 3: Artificial Neural Networks

Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

Back Propagation and Representation in PDP Networks Psychology 209 February 6, 2013.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CSE & CSE6002E - Soft Computing Winter Semester, 2011 Neural Networks Videos Brief Review The Next Generation Neural Networks - Geoff Hinton.

Intro. ANN & Fuzzy Systems Lecture 14. MLP (VI): Model Selection.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:

CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.

CSC321: Lecture 7:Ways to prevent overfitting

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Neural Networks William Cohen [pilfered from: Ziv; Geoff Hinton; Yoshua Bengio; Yann LeCun; Hongkak Lee - NIPs 2010 tutorial ]

Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.

CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Multinomial Regression and the Softmax Activation Function Gary Cottrell.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

Regularization Techniques in Neural Networks

RNNs: An example applied to the prediction task

Deep Feedforward Networks

Artificial Neural Networks

Data Mining, Neural Network and Genetic Programming

Computer Science and Engineering, Seoul National University

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

Structure learning with deep autoencoders

Classification / Regression Neural Networks 2

RNNs: Going Beyond the SRN in Language Prediction

ECE 471/571 - Lecture 17 Back Propagation.

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

Neural Networks Geoff Hulten.

Neural Networks ICS 273A UC Irvine Instructor: Max Welling

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Neural networks (1) Traditional multi-layer perceptrons

Neural networks (3) Regularization Autoencoder

Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Introduction to Neural Networks

CSC 578 Neural Networks and Deep Learning

Presentation transcript:

Deep Learning and Neural Nets Spring 2015 Tricks of the Trade II Deep Learning and Neural Nets Spring 2015

Agenda Review Discussion of homework Odds and ends The latest tricks that seem to make a difference

Cheat Sheet 1 Perceptron Linear associator (a.k.a. linear regression) Activation function Weight update Linear associator (a.k.a. linear regression) assumes minimizing squared error loss function

Cheat Sheet 2 Two layer net (a.k.a. logistic regression) activation function weight update Softmax net (a.k.a. multinomial logistic regression) assumes minimizing squared error loss function

Cheat Sheet 3 Back propagation activation function weight update assumes minimizing squared error loss function

Cheat Sheet 4 Loss functions squared error cross entropy

How Many Hidden Units Do We Need To Learn Handprinted Digits? Two isn’t enough Think of hidden as a bottleneck conveying all information from input to output Sometimes networks can surprise you e.g., autoencoder

Autoencoder Self-supervised training procedure Given a set of input vectors (no target outputs) Map input back to itself via a hidden layer bottleneck How to achieve bottleneck? Fewer neurons Sparsity constraint Information transmission constraint (e.g., add noise to unit, or shut off randomly, a.k.a. dropout)

Autoencoder and 1-of-N Task Input/output vectors 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 How many hidden units are require to perform this task?

When To Stop Training 1. Train n epochs; lower learning rate; train m epochs bad idea: can’t assume one-size-fits-all approach 2. Error-change criterion stop when error isn’t dropping My recommendation: criterion based on % drop over a window of, say, 10 epochs 1 epoch is too noisy absolute error criterion is too problem dependent Karl’s idea: train for a fixed number of epochs after criterion is reached (possibly with lower learning rate) NOTE: these belong in practical_advice.pptx. Move after 2015.

When To Stop Training 3. Weight-change criterion Compare weights at epochs t-10 and t and test: Don’t base on length of overall weight change vector Possibly express as a percentage of the weight Be cautious: small weight changes at critical points can result in rapid drop in error

Setting Model Hyperparameters How do you select the appropriate model size, i.e., # of hidden units, # layers, connectivity, etc.? validation method split training set into two parts, T and V train many different architectures on T choose the architecture that minimizes error on V fancy Bayesian optimization methods are starting to become popular

The Danger Of Minimizing Network Size My sense is that local optima arise only if you use a highly constrained network minimum number of hidden units minimum number of layers minimum number of connections xor example? Having spare capacity in the net means there are many equivalent solutions to training e.g., if you have 10 hidden and need only 2, there are 45 equivalent solutions

Regularization Techniques Instead of starting with smallest net possible, use a larger network and apply various tricks to avoid using the full network capacity 7 ideas to follow… why is early stop

Regularization Techniques 1. early stopping Rather than training network until error converges, stop training early Rumelhart hidden units all go after the same source of error initially -> redundancy Hinton weights start small and grow over training when weights are small, model is mostly operating in linear regime Dangerous: Very dependent on training algorithm e.g., what would happen with random weight search? While probably not the best technique for controlling model complexity, it does suggest that you shouldn’t obsess over finding a minimum error solution. why is early stop

Regularization Techniques 2. Weight penalty terms L2 weight decay L1 weight decay weight elimination See Reed (1993) for survey of ‘pruning’ algorithms why is early stop

Regularization Techniques 3. Hard constraint on weights Ensure that for every unit If constraint is violated, rescale all weights: [See Hinton video @ minute 4:00] I’m not clear why L2 normalization and not L1 4. Injecting noise [See Hinton video]

Regularization Techniques 6. Model averaging Ensemble methods Bayesian methods 7. Drop out [watch Hinton video] why is early stop

More On Dropout With H hidden units, each of which can be dropped, we have 2H possible models Each of the 2H-1 models that include hidden unit h must share the same weights for the units serves as a form of regularization makes the models cooperate Including all hidden units at test with a scaling of 0.5 is equivalent to computing the geometric mean of all 2H models exact equivalence with one hidden layer “pretty good approximation” according to Geoff with multiple hidden layers

Two Problems With Deep Networks Credit assignment problem Vanishing error gradients note y(1-y) ≤ 25

Unsupervised Pretraining Suppose you have access to a lot of unlabeled data in addition to labeled data “Semisupervised learning” Can we leverage unlabeled data to initialize network weights? alternative to small random weights requires an unsupervised procedure: autoencoder With good initialization, we can minimize credit assignment problem.

Autoencoder Self-supervised training procedure Given a set of input vectors (no target outputs) Map input back to itself via a hidden layer bottleneck How to achieve bottleneck? Fewer neurons Sparsity constraint Information transmission constraint (e.g., add noise to unit, or shut off randomly, a.k.a. dropout)

Autoencoder Combines An Encoder And A Decoder

Stacked Autoencoders ... copy deep network Note that decoders can be stacked to produce a generative model of the domain

Rectified Linear Units Version 1 Version 2 Do we need to worry about z=0? Do we need to worry about lack of gradient for z<0? Note sparsity of activation pattern Note no squashing of error derivative why is early stop

Rectified Linear Units Hinton argues that this is a form of model averaging why is early stop

Hinton Bag Of Tricks Deep network Unsupervised pretraining if you have lots of data Weight initialization to prevent gradients from vanishing or exploding Dropout training Rectified linear units Convolutional NNs if spatial/temporal patterns