Deep Learning Methods For Automated Discourse CIS 700-7

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Neural networks Introduction Fitting neural networks
Regularization David Kauchak CS 451 – Fall 2013.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 8: Modeling text using a recurrent neural network trained with a really fancy.
Artificial Neural Networks ML Paul Scheible.
Improved BP algorithms ( first order gradient method) 1.BP with momentum 2.Delta- bar- delta 3.Decoupled momentum 4.RProp 5.Adaptive BP 6.Trinary BP 7.BP.
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Artificial Neural Networks
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Artificiel Neural Networks 3 Tricks for improved learning Morten Nielsen Department of Systems Biology, DTU.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSC321: Neural Networks Lecture 9: Speeding up the Learning
Machine Learning Supervised Learning Classification and Regression
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift CS838.
Regularization Techniques in Neural Networks
RNNs: An example applied to the prediction task
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Deep Feedforward Networks
Artificial Neural Networks
CS : Designing, Visualizing and Understanding Deep Neural Networks
Computer Science and Engineering, Seoul National University
Matt Gormley Lecture 16 October 24, 2016
Classification with Perceptrons Reading:
LECTURE 28: NEURAL NETWORKS
Neural networks (3) Regularization Autoencoder
Machine Learning Basics
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Random walk initialization for training very deep feedforward networks
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RNNs: Going Beyond the SRN in Language Prediction
CSC 578 Neural Networks and Deep Learning
Presented by Xinxin Zuo 10/20/2017
Steepest Descent Algorithm: Step 1.
CS 4501: Introduction to Computer Vision Training Neural Networks II
Logistic Regression & Parallel SGD
INF 5860 Machine learning for image classification
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
of the Artificial Neural Networks.
Large Scale Support Vector Machines
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Artificial Neural Networks
Neural Networks Geoff Hulten.
Lecture Notes for Chapter 4 Artificial Neural Networks
ML – Lecture 3B Deep NN.
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
LECTURE 28: NEURAL NETWORKS
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
RNNs: Going Beyond the SRN in Language Prediction
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation David Kauchak CS159 – Fall 2019.
Image Classification & Training of Neural Networks
Neural networks (3) Regularization Autoencoder
COSC 4335: Part2: Other Classification Techniques
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Introduction to Neural Networks
Batch Normalization.
CSC 578 Neural Networks and Deep Learning
CS249: Neural Language Model
Overall Introduction for the Lecture
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Deep Learning Methods For Automated Discourse CIS 700-7 Fall 2017 http://dialog-systems-class.org/ João Sedoc with Chris Callison-Burch and Lyle Ungar joao@upenn.edu January 24th, 2017

Office Hours Tuesdays 5-6 in Levine 512 Sign up to present HW 1

Homework 2 due date moved to Jan 31st. Does everyone have a group Homework 2 due date moved to Jan 31st. Does everyone have a group? Has everyone asked for permission for AWS p2.xlarge instance?

Optimization of Neural Network An exploration in the the dark arts.

Minibatch Shuffling data important Processed in parallel (usually on GPU) in powers of 2 Batches as a regularizer? Why can’t we have 10k batch sizes, which allow for second order derivatives? Newton’s method, or the saddle free augmentation still not possible, but maybe soon.

Minibatch Usually, I have seen projects using mini batch size of 128 ? Is there any intuition behind coming up with this value? In general, how to chose our mini batch size as per our system configuration (GPUs, memory, etc.) If no memory or time constraints are present, do second-order optimization algorithms work better than the first order ones, because I had read that in practice first order algorithms work fairly well even for small problems? Also, are there certain classes of problems in which one works better than the other? Datasets are growing rapidly in size in comparison to computing power and thus not all of training data is used. Is there value in data augmentation as a tool for regularization in this case since the training data is chosen stochastically or no?

Learning rate and momentum

Learning rate and momentum What’s interaction between batch size and learning rate/momentum? How would you adjust learning rate if at all?

Parameter initialization Optimization perspective suggests that the weights should be large enough to propagate information successfully, but some regularization concerns encourage making them smaller SGD parameters near to initialization point Why is gradient descent with early stopping similar to weight decay? Is this L2 btw?

Parameter initialization m – inputs and n - outputs Sparse initialization for deep wide networks

Parameter initialization Why are we not randomly initializing biases? In initializing weights what are the heuristic to decide the scale of the distribution? When talking about normalization in weight initialization they comment that a certain choice of initialization is designed to achieve equal activation variance at each function and equal gradient variance. What does this mean? What does it mean to initialize parameters as a orthogonal matrix? In initialization, we does not initialize the bias randomly because would it affect the model or is it an overdo ? Also, assigning to zero would not always be true as it would also depend on the activation function. For example, in sigmoid, at x=0, the value is 0.5; so does that mean in this case we need to initialize it with large negative value ? Is sparse initialization of weight akin to dropout at initialization time and does it also help in regularization? Also, I do not understand fully how having a high bias at initialization time (which sparse initialization introduces) effects maxout units?

Optimizers What was your favorite and why?

Optimizers Adagrad – learning rate inversely proportional square root of the sum of all of their historical squared values RMSProp – learning rate inversely proportional to entire gradient history Adam - combination of RMSProp and momentum with a few important distinctions Conjugate gradients Newton’s Method – with finite memory

Conjugate gradient There is an alternative to going to the minimum in one step by multiplying by the inverse of the curvature matrix. Use a sequence of steps each of which finds the minimum along one direction. Make sure that each new direction is “conjugate” to the previous directions so you do not mess up the minimization you already did. “conjugate” means that as you go in the new direction, you do not change the gradients in the previous directions.

A picture of conjugate gradient The gradient in the direction of the first step is zero at all points on the green line. So if we move along the green line we don’t mess up the minimization we already did in the first direction.

Optimizers What is the motivation behind Adam? Why isn't Adam better than AdaGrad when it is basically RMSProp with momentum where RMSProp is supposedly a modified AdaGrad that is better in non-convex models like ones for Deep Learning? Could we know under which condition, which optimization algorithm might work better? Where does the bias that is corrected by in the Adam algorithm seep in through? Is this the bias introduced by the initial values assigned for the weights and the "bias" at initialization?

Gamma and beta is learned – beta 0 and gamma 1 Different for every layer Representations still general

Batch Normalization “One of the most import method in deep learning in the last five years.” Faster training and more accurate Constant learning rate Larger networks Can train sigmoid without pre-training Expensive for recurrent neural networks? - Is there a relationship between batch normalization and dropout? Layer normalization – along units

Batch normalization The theory and experiments show that batch normalization smooths and speeds up the learning process. Does it also increase the accuracy? If so, why? Since in batch normalization, as we are finding H' on every minibatch rather than on entire batch, is it similar to adding random noise in hidden layers? Would this lead to regularization as well? In batch normalization, the input to each layer is whitened first (zero mean, unit sd).  I have read that whitening can  exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. Won't it create a problem?

Batch normalization Why are standard deviation of gradient and activations important? Why normalizing solves the problem of many layers and of simultaneous updates? And isn't it more complicated to train lower layers, since now we have to take the derivative of the standard deviation and mean? Why do we do it for mini batches and not all the dataset? How does this relate to activation function?

Next lecture Chapter 10: Recurrent neural networks! Please read chapter 10 very carefully

AWS – headache Where is everyone on AWS? Do you have access to p2.xlarge? Please pay attention to cost!!!!

Questions 1. Since in batch normalization, as we are finding H' on every minibatch rather than on entire batch, is it similar to adding random noise in hidden layers? Would this lead to regularization as well? 2. Is there any algorithm on how to  divide features into blocks in block coordinate descent? 3. In batch normalization, the input to each layer is whitened first (zero mean, unit sd).  I have read that whitening can  exaggerate the noise in the data, since it stretches all dimensions (including the irrelevant dimensions of tiny variance that are mostly noise) to be of equal size in the input. Won't it create a problem?

Questions 1. Recently, Adam is quite popular in optimizing deep learning models. However, the book says there's no consensus on choosing which optimization algorithms to use. Could we know under which condition, which optimization algorithm might work better? 2. As the book states, batch normalization is one of the most exciting innovations in deep learning community. The theory and experiments show that batch normalization smooths and speeds up the learning process. Does it also increase the accuracy? If so, why? 3. The book illustrates the greedy layer-wise supervised pre-training in detail as it is one of the main algorithms that make the deep networks working before 2011. Is this still important? How do we interpret this idea today?

Questions - Why are we not randomly initializing biases? In initializing weights what are the heuristic to decide the scale of the distribution? - Why are standard deviation of gradient and activations important? - Why normalizing solves the problem of many layers and of simultaneous updates? And isn't it more complicated to train lower layers, since now we have to take the derivative of the standard deviation and mean? Why do we do it for mini batches and not all the dataset?

Questions Section 8.5.4 says “there is currently no consensus on this point” for choosing the right optimization algorithm. Should we try different algorithms in small data set and validate it against testing set? Because, our ultimate goal is to do good prediction in test set and not overfit the data. Also, I have heard that keeping learning rate constant does not hurt much as our main concern is not to overfit. A very good converge in training error might result in overfitting. Is it true ?  In initialization, we does not initialize the bias randomly because would it affect the model or is it an overdo ? Also, assigning to zero would not always be true as it would also depend on the activation function. For example, in sigmoid, at x=0, the value is 0.5; so does that mean in this case we need to initialize it with large negative value ?  Usually, I have seen projects using mini batch size of 128 ? Is there any intuition behind coming up with this value? In general, how to chose our mini batch size as per our system configuration (GPUs, memory, etc.)

Questions 1) Is sparse initialization of weight akin to dropout at initialization time and does it also help in regularization? Also, I do not understand fully how having a high bias at initialization time (which sparse initialization introduces) effects maxout units? 2) Where does the bias that is corrected by in the Adam algorithm seep in through? Is this the bias introduced by the initial values assigned for the weights and the "bias" at initialization? 3) If no memory or time constraints are present, do second-order optimization algorithms work better than the first order ones, because I had read that in practice first order algorithms work fairly well even for small problems? Also, are there certain classes of problems in which one works better than the other?

Questions When talking about normalization in weight initialization they comment that a certain choice of initialization is designed to achieve equal activation variance at each function and equal gradient variance. What does this mean? Why is gradient descent with early stopping similar to weight decay? Is this L2 btw What is the motivation behind Adam?

Questions 1. What does it mean to initialize parameters as a orthogonal matrix? Also, does this mean all hidden layers has the same number of hidden units / and thus should the number of input parameters and output need to be the same?  2. In the batch and minibatch section, they mentioned how datasets are growing rapidly in size in comparison to computing power and thus not all of training data is used. Is there value in data augmentation as a tool for regularization in this case since the training data is chosen stochastically or no? 3. Why isn't Adam better than AdaGrad when it is basically RMSProp with momentum where RMSProp is supposedly a modified AdaGrad that is better in non-convex models like ones for Deep Learning?