Bayesian Neural Networks

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Bayesian Belief Propagation
A Brief Overview of Neural Networks By Rohit Dua, Samuel A. Mulder, Steve E. Watkins, and Donald C. Wunsch.
EE 690 Design of Embodied Intelligence
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Introduction to Neural Networks Computing
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
CS 678 –Boltzmann Machines1 Boltzmann Machine Relaxation net with visible and hidden units Learning algorithm Avoids local minima (and speeds up learning)
Kostas Kontogiannis E&CE
Artificial Neural Networks - Introduction -
Neural Networks Marco Loog.
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Radial Basis Function (RBF) Networks
Biointelligence Laboratory, Seoul National University
Evolving a Sigma-Pi Network as a Network Simulator by Justin Basilico.
Artificial Neural Networks
Multi-Layer Perceptrons Michael J. Watts
Chapter 9 Neural Network.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
Appendix B: An Example of Back-propagation algorithm
Classification / Regression Neural Networks 2
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Multi-Layer Perceptron
Non-Bayes classifiers. Linear discriminants, neural networks.
MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
CS621 : Artificial Intelligence
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Artificial Neural Networks (ANN). Artificial Neural Networks First proposed in 1940s as an attempt to simulate the human brain’s cognitive learning processes.
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Neural Networks - Berrin Yanıkoğlu1 MLP & Backpropagation Issues.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Introduction to Sampling based inference and MCMC
CSSE463: Image Recognition Day 14
Reinforcement Learning
Deep Feedforward Networks
The Gradient Descent Algorithm
Learning with Perceptrons and Neural Networks
Extreme Learning Machine
Data Mining, Neural Network and Genetic Programming
Real Neurons Cell structures Cell body Dendrites Axon
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Jun Liu Department of Statistics Stanford University
Data Mining Lecture 11.
Classification / Regression Neural Networks 2
Bayesian inference Presented by Amir Hadadi
Markov chain monte carlo
Prof. Carolina Ruiz Department of Computer Science
Logistic Regression & Parallel SGD
of the Artificial Neural Networks.
Artificial Neural Networks
Neural Networks Geoff Hulten.
Lecture Notes for Chapter 4 Artificial Neural Networks
Machine Learning: Lecture 4
Boltzmann Machine (BM) (§6.4)
Machine Learning: UNIT-2 CHAPTER-1
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
Machine Learning: Lecture 6
More on HW 2 (due Jan 26) Again, it must be in Python 2.7.
Machine Learning: UNIT-3 CHAPTER-1
Local Search Algorithms
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Simulated Annealing & Boltzmann Machines
EE 193/Comp 150 Computing with Biological Parts
Prof. Carolina Ruiz Department of Computer Science
Presentation transcript:

Bayesian Neural Networks Tobias Williges 3/11/10

Outline: Some Analysis Future paths Neural network overview mostly taken from: Neal, R. M. (1994) Bayesian Learning for Neural Networks, Ph.D. Thesis, Dept. of Computer Science, University of Toronto, 195 pages Papers: Neal, R. M. (1990) ``Learning stochastic feedforward networks'', Technical Report CRG-TR-90-7, Dept. of Computer Science, University of Toronto, 34 pages Neal, R. M. (1992) ``Bayesian training of backpropagation networks by the hybrid Monte Carlo method'', Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto, 21 pages M. Crucianu ,  R. Bon ,  J. -p. Asselin De Beauville(2000) Bayesian Learning in Neural Networks for Sequence Processing Some Analysis Future paths

Basic Neural network This a basic multilayer perceptron (or backpropagation) network. Feedforward networks do not form cycles and the outputs can be computed in a single forward pass. The inputs are used to model the outputs (y). Here is the weight of the connection from the input unit i to hidden unit j and is the weight of the connection from hidden unit j to the output k. A and B are the biases of the hidden and output units. In this example hyperbolic tangent is being used, but other functions could be used, but if it is linear, the hidden layer would not be needed and one could just directly connect the inputs and outputs.

Basics continued Neural networks are commonly used for either fitting data or classification. Then one minimizes the squared error or an error function on the training set or maximizes the likelihood.

Hidden Nodes Since there is a theorem that says with enough hidden nodes only one layer is needed, we still need to find the number of hidden nodes. Then some method is needed to try and avoid over fitting by having too many hidden nodes. (DIC,BIC or some sort of weight decaying likelihood)

Main Objective The main objective of Bayesian neural networks is to be able to fit new “test” points. In Forwardfeed networks where the input is not models the likelihood is:

This paper was comparing Boltman Machines to neural networks. Neal, R. M. (1990) ``Learning stochastic feedforward networks'', Technical Report CRG-TR-90-7, Dept. of Computer Science, University of Toronto This paper was comparing Boltman Machines to neural networks. I implied from this paper that Boltman Machines were the predecessor to neural networks.

What is a Boltman Machine? It is some fixed number of 2 valued units linked by symmetric connections. The 2 values are either 0 and 1 or -1 and 1. The energy is and then the probability is with which makes low energy states more probable. The value of each state is picked via w which takes into account every other state.

Boltman Machine? There is also a negative and a positive phase. The positive phase increases weights and the negative phase decreases weights. One runs this method until the weights are in equilibrium. The positive phase follows the gradient and the negative phase uses the unconditional distribution for . The negative phase makes it slower and more sensitive to statistical errors.

Results:

Neal, R. M. (1992) ``Bayesian training of backpropagation networks by the hybrid Monte Carlo method'', Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto This paper was trying to motivate the hybrid Monte Carlo method to solve neural networks. In this paper they use a weight decay method to decide the number of hidden modes. This paper also used annealing a lot. Which is a technique to explore a space to try and find optimum solutions.

Model If one assumes things to be Gaussian then one can get a Gaussian probability, but this paper is trying to avoid that. Which means Gibbs steps are out. Usually one could use Metropolis Hasting, but this paper wants to do the hybrid method instead. For a Monte Carlo method the options proposed in the paper are to either change one weight at a time or change all the weights simultaneously. The paper says that terms have a strong interaction so one weight at a time does not make much sense and is expensive to calculate. Also if one tries to change all the weights at once one must take very small steps or it will just be rejected.

Hybrid Monte Carlo method Hybrid method uses gradient information to pick candidate directions with higher probability of being accepted. This eliminates much of the random walk aspect of the metropolis algorithm, but speeds the exploration of the space.

Results: The paper said that the plain Metropolis algorithm did much worse and had not converged in this time. This table is for the hybrid case. For each spot on the table they did 9 tests and the failures are what is being recorded (So the bigger the square the worse it did). The v is the step size and the ℰ is variable that affects convergence speed. They concluded that the hybrid Monte Carlo method was better and that annealing helped.

M. Crucianu ,  R. Bon ,  J. -p. Asselin De Beauville(2000) Bayesian Learning in Neural Networks for Sequence Processing This is work on non-forwardfeed neural networks. This is an attempt to do something along the line of time series with neural networks. They say this can be thought of as a forwardfeed network with time delayed inputs.

Model:

Prior Scheme They also give a scheme of generating a prior. In which they generate a synthetic data set with their prior knowledge and then use that data set to generate their prior. I believe this is their method to avoid improper priors.

Super basic: One Xi for each Yi and with clear separation. I used one node.

Stepped up There is still only one Xi per Yi, but now there are 5 groups. The picture below is with one node. I tried it again with 2 nodes but as one would expect it showed almost no improvement.

Harder case For this case I made an X matrix with 5 columns for each Y with 6 groups. Below is the data in with the first 2 X columns plotted.

Harder continued For one hidden node one has to fit 18 weights and for each hidden node after that one has to fit 12 more weights. One can see that using BIC or something else is going to restrict the number of hidden nodes.

Going from here Well I would have to get this harder model working. The first paper needed Bayesian stuff which the second paper solved. The second paper needed to be able to do more time series like stuff and the third paper solved that. From here one would want to do more test cases to see how much effect the time series method has to the non-time series method when the data has the relation and does not. Also comparing speeds would be good.