Deep Feedforward Networks

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Neural networks Introduction Fitting neural networks

Linear Regression.

Support Vector Machines

Supervised Learning Recap

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Visual Recognition Tutorial

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 14 – Neural Networks

Pattern Recognition and Machine Learning

x – independent variable (input)

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Radial Basis Function Networks

An Introduction to Support Vector Machines Martin Law.

Collaborative Filtering Matrix Factorization Approach

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

PATTERN RECOGNITION AND MACHINE LEARNING

Biointelligence Laboratory, Seoul National University

Classification Part 3: Artificial Neural Networks

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

An Introduction to Support Vector Machines (M. Law)

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Computacion Inteligente Least-Square Methods for System Identification.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.

Machine Learning Supervised Learning Classification and Regression

Neural networks and support vector machines

Learning Deep Generative Models by Ruslan Salakhutdinov

Probability Theory and Parameter Estimation I

LECTURE 11: Advanced Discriminant Analysis

10701 / Machine Learning.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Overview of Supervised Learning

Random walk initialization for training very deep feedforward networks

Roberto Battiti, Mauro Brunato

Neural Networks and Backpropagation

Goodfellow: Chap 6 Deep Feedforward Networks

Hidden Markov Models Part 2: Algorithms

Probabilistic Models with Latent Variables

Collaborative Filtering Matrix Factorization Approach

Ying shen Sse, tongji university Sep. 2016

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Biointelligence Laboratory, Seoul National University

Deep Learning for Non-Linear Control

Generally Discriminant Analysis

Parametric Methods Berlin Chen, 2005 References:

Neural networks (1) Traditional multi-layer perceptrons

Image Classification & Training of Neural Networks

Logistic Regression Chapter 7.

Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.

Linear Discrimination

Introduction to Neural Networks

Presentation transcript:

Deep Feedforward Networks 6 ~ 6.2 Kyunghyun Lee 2016. 10. 17

목차 Introduction Example: Learning XOR Gradient-Based Learning Cost Functions Output Units

1. Introduction Deep Feedforward Networks A feedforward network defines a mapping Feedforward : information flows x to f(x), there are no feedback connections Networks : composing together many different functions

1. Introduction Deep Feedforward Networks depth first layer, input layer output layer second layer, hidden later

1. Introduction Deep Feedforward Networks Hidden layers training examples specify directly what the output layer must do behavior of the other layers(hidden layers) is not directly specified by the training data learning algorithm must decide how to use these layers to best implement an approximation Each hidden layer of the network is typically vector-valued dimensionality of these hidden layers determines the width

model’s output is always 0.5! 2. Example: Learning XOR Learning the XOR function Use linear regression loss function linear model learning result x1 x2 f 1 fitting model XOR model’s output is always 0.5!

2. Example: Learning XOR Add a different feature space : h space model

2. Example: Learning XOR Add a different feature space : h space we must use a nonlinear function to describe the features activation function : rectified linear unit or ReLU

2. Example: Learning XOR Add a different feature space : h space solution let calculation activation function multiplying by w

2. Example: Learning XOR Conclusion we simply specified the solution, then showed that it obtained zero error In a real situation, there might be billions of model parameters and billions of training examples Instead, a gradient-based optimization algorithm can find parameters that produce very little error

3. Gradient-Based Learning Neural network vs Linear models Designing and training a neural network is not much different from training any other machine learning model with gradient descent The largest difference between the linear models we have seen so far and neural networks is that the nonlinearity of a neural network causes most interesting loss functions to become non- convex

3. Gradient-Based Learning Non-convex optimization no such convergence guarantee sensitive to the values of the initial parameters For feedforward neural networks important to initialize all weights to small random values biases may be initialized to zero or to small positive values

3. Gradient-Based Learning Gradient descent train models such as linear regression and support vector machines with gradient descent must choose a cost function must choose how to represent the output of the model

3.1 Cost Functions An important aspect of the design of a deep neural network is the choice of the cost function Parametric model defines a distribution maximum likelihood predict some statistic of y conditioned on x specialized loss functions

3.1 Cost Functions Learning Conditional Distributions with Maximum Likelihood Most modern neural networks are trained using maximum likelihood cost function is simply the negative log-likelihood removes the burden of designing cost functions for each model model cost function

3.1 Cost Functions Learning Conditional Statistics Instead of learning a full probability distribution we often want to learn just one conditional statistic ex) mean of y two results derived using calculus of variations(19.4.2) mean squared error : mean absolute error :

3.2 Output Units The choice of cost function is tightly coupled with the choice of output unit suppose that the feedforward network provides a set of hidden features role of the output layer is then to provide some additional transformation from the features to complete the task

3.2 Output Units Linear Units for Gaussian Output Distributions Given features h, a layer of linear output units produces a vector Linear output layers are often used to produce the mean of a conditional Gaussian distribution Maximizing the log-likelihood is then equivalent to minimizing the mean squared error may be used with gradient-based optimization algorithms and a wide variety of optimization algorithms (linear units do not saturate)

3.2 Output Units Sigmoid Units for Bernoulli Output Distributions for predicting the value of a binary variable ex) classification problems with two classes Bernoulli distribution neural net needs to predict only For this number to be a valid probability, it must lie in the interval [0, 1] linear unit with threshold Any time that strayed outside the unit interval, the gradient of the output of the model with respect to its parameters would be 0

3.2 Output Units Sigmoid Units for Bernoulli Output Distributions sigmoid output units combined with maximum likelihood sigmoid output unit : Probability distributions : assumption : unnormalized log probabilities are linear in y and z

3.2 Output Units Sigmoid Units for Bernoulli Output Distributions cost function with sigmoid approach cost function used with maximum likelihood is loss function : gradient-based learning is available

3.2 Output Units Softmax Units for Multinoulli Output Distributions generalization of the Bernoulli distribution probability distribution over a discrete variable with n possible values Softmax function often used as the output of a classifier

3.2 Output Units Softmax Units for Multinoulli Output Distributions log softmax function first term : input zi always has a direct contribution to the cost function second term : roughly approximated by negative log-likelihood cost function always strongly penalizes the most active incorrect prediction for correct answer

3.2 Output Units Softmax Units for Multinoulli Output Distributions softmax saturation when the differences between input values become extreme saturates to 1 input is maximal and greater than all of the other inputs saturates to 0 input is not maximal and the maximum is much greater if the loss function is not designed to compensate for softmax saturation, it can cause similar difficulties for learning

3.2 Output Units Softmax Units for Multinoulli Output Distributions softmax argument : z can be produced in two different ways using the linear layer this approach actually overparametrizes the distribution constraint that the n outputs must sum to 1 means that only n−1 parameters are necessary impose a requirement that one element of z be fixed example : sigmoid - softmax - with a two-dimensional z

3.2 Output Units Other Output Types In general, we can think of the neural network as representing a function outputs of this function are not direct predictions of the value y provides the parameters for a distribution over y

3.2 Output Units Other Output Types example : variance learn the variance of a conditional Gaussian for y, given x negative log-likelihood will provide a cost function make optimization procedure incrementally learn the variance

3.2 Output Units Other Output Types example : mixture density networks Neural networks with Gaussian mixtures as their output The neural network must have three outputs vector : mixture components matrix : means tensor : covariances

3.2 Output Units Other Output Types example : mixture density networks mixture components multinoulli distribution over the n different components can typically be obtained by a softmax over an n-dimensional vector means indicate the center associated with the i-th Gaussian component Learning these means with maximum likelihood is slightly more complicated than learning the means of a distribution with only one output mode

3.2 Output Units Other Output Types example : mixture density networks covariances specify the covariance matrix for each component i maximum likelihood is complicated by needing to assign partial responsibility for each point to each mixture component