Download presentation
1
Artificial neural networks
Ricardo Ñanculef Alegría Universidad Técnica Federico Santa María Campus Santiago
2
Learning from Natural Systems
Bio-inspired systems Ants colony Genetic algorithms Artificial neural networks The power of the brain Examples: vision, text-processing Other animals: dolphins, bats Las redes neuronales se describen tradicionalmente con un modelo bio-inspirado. Los sistemas de computación bioinspirados son aquellos que se basan en la observación y estudio de los sistemas naturales con el fin de obtener sistemas de cómputo, de procesamiento de información que imiten dichos sistemas y hereden algunas de sus características esenciales. Como su nombre da a entender, las redes neuronales aparecen como un modelo simplificado del funcionamiento del cerebro, el cual posee muchas características deseables en una máquina de cómputo y que no están presentes en el modelo tradicional de computación de Von Neumann. Algunas de estas características son: Computación en paralelo: gran cantidad de procesadores simples (neuronas) Representación distribuida del conocimiento: memoria no localizada, integrada al procesador Capacidad de aprender y generalizar Adaptabilidad: podemos reaccionar rápidamente a nuevos contextos Robustez y tolerancia a fallas: obtenidamente gracias a la redundancia y al paralelismo.
3
Modeling the Human Brain key functional characteristics
Learning and generalization ability Continuous adaptation Robustesness and fault tolerance Las redes neuronales se describen tradicionalmente con un modelo bio-inspirado. Los sistemas de computación bioinspirados son aquellos que se basan en la observación y estudio de los sistemas naturales con el fin de obtener sistemas de cómputo, de procesamiento de información que imiten dichos sistemas y hereden algunas de sus características esenciales. Como su nombre da a entender, las redes neuronales aparecen como un modelo simplificado del funcionamiento del cerebro, el cual posee muchas características deseables en una máquina de cómputo y que no están presentes en el modelo tradicional de computación de Von Neumann. Algunas de estas características son: Computación en paralelo: gran cantidad de procesadores simples (neuronas) Representación distribuida del conocimiento: memoria no localizada, integrada al procesador Capacidad de aprender y generalizar Adaptabilidad: podemos reaccionar rápidamente a nuevos contextos Robustez y tolerancia a fallas: obtenidamente gracias a la redundancia y al paralelismo.
4
Modeling the Human Brain key structural characteristics
Massive parallelism Distributed knowledge representation: memory Basic organization: networks of neurons Las redes neuronales se describen tradicionalmente con un modelo bio-inspirado. Los sistemas de computación bioinspirados son aquellos que se basan en la observación y estudio de los sistemas naturales con el fin de obtener sistemas de cómputo, de procesamiento de información que imiten dichos sistemas y hereden algunas de sus características esenciales. Como su nombre da a entender, las redes neuronales aparecen como un modelo simplificado del funcionamiento del cerebro, el cual posee muchas características deseables en una máquina de cómputo y que no están presentes en el modelo tradicional de computación de Von Neumann. Algunas de estas características son: Computación en paralelo: gran cantidad de procesadores simples (neuronas) Representación distribuida del conocimiento: memoria no localizada, integrada al procesador Capacidad de aprender y generalizar Adaptabilidad: podemos reaccionar rápidamente a nuevos contextos Robustez y tolerancia a fallas: obtenidamente gracias a la redundancia y al paralelismo. receptors Neural nets effectors
5
Modeling the Human Brain neurons
6
Human Brain in numbers Cerebral cortex: E11 neurons (more than the number of stars in the milky-way) Massive connectivity: E3 to E4 connections per neuron (in total, E15 connections) Time response E-3 seconds. Silicon chips E-9 seconds (one million times faster ) Yet, human are more efficient than computers at computationally complex tasks. Why?
7
Artificial Neural Networks
“ A neural network is a massively parallel distributed processor made up of simple processing units, which has a natural propensity for storing experiential knowledge and making it available for use. It resembles the brain in: Knowledge is acquired by a learning process Connection strengths between processing units are used to store the acquired knowledge. ” Simon Haykin, “Neural Networks, a comprehensive foundation”, 2nd Ed, Reprint 2005, Prentice Hall
8
Artificial Neural Networks
“ From the perspective of pattern recognition, neural networks can be regarded as an extension of the many conventional techniques which have been developed over several decades (…) for example, discriminant functions, logit ..” Christopher Bishop, “Neural Networks for Pattern Recogniton”, Reprint, 2005, Oxford University Press
9
Artificial Neural Networks diverse applications
Pattern Classification Clustering Function Approximation: Regression Time Series Forecasting Optimization Content-addressable Memory
10
The beginnings McCulloch and Pitts, 1943: “A logical calculus of the ideas immanent in nervous activity” First neuron model based on simplifications about the brain behavior binary incoming signals Connection strengths: weights to each incoming signal binary response: active or inactive activation threshold or bias Just some years earlier: boolean algebra
11
The beginnings The model activation threshold (bias)
connection weights
12
The beginnings These neurons can compute logical operations
13
The beginnings Perceptron (1958). Rosenblatt proposes the use of “layers of neurons” as a computational tool. Proposes a training algorithm Emphasis in the learning capabilities of NN McCulloch and Pitts derived to automata theory
14
The perceptron Architecture …
15
The perceptron Notation …
16
Perceptron Rule We have a set of patterns with desired responses …
If used for classification … number of neurons in the perceptron clase
17
Perceptron Rule Initialize the weights and the thresholds
Present a pattern vector Update the weights according to If used for classification … learning rate
18
Separating hyperplanes
19
Separating hyperplanes
Hyperplane: set L of points satisfying For any pair of points lying in L Hence, the normal vector to L is
20
Separating hyperplanes
Signed distance of any point to L
21
Separating hyperplanes
Consider a two-class classificaton problem One class coded as +1 and one as -1 An input is classified as the sign of the distance to the hyperplane How to train the classifier?
22
Separating hyperplanes
An idea: train to minimize the distance of the misclassified inputs to the hyperplane Note this is very different to train with the quadratic loss of all the points
23
Gradient Descent Suppose we have to minimize on Where is a vector
For example Iterate:
24
Stochastic Gradient Descent
Suppose we have to minimize on Where is a random variable We have samples of Iterate:
25
Separating hyperplanes
If M is fixed Stochastic gradient descent
26
Separating hyperplanes
For correctly classified inputs no correction on the parameters is applied Now, note that
27
Separating hyperplanes
Perceptron rule
28
Perceptron convergence theorem
Theorem: If there exists a set of connection weights and activation threshold which is able to separate the two clases, the perceptron algorithm will converge to some solution in a finite number of steps and indepently of the initialization of the weights and bias.
29
Perceptron Conclusion: perceptron rule, with two clases, is a stochastic gradient descent algorithm that aims to minimize the distances of the misclassified examples to the hyperplane. With more than two classes, the perceptron uses one neuron to model a class againts the others. This is an actual perspective
30
Delta Rule Widrow and Hoff It considers general activation functions
31
Delta Rule Update the weights according to …
32
Delta Rule Update the weights according to …
33
Delta Rule Can the perceptron rule be obtained as a special case from this rule? Step function is not differentiable Note that with this algorithm all the patterns are observed before correction, while with the Rosenblatt's algorithm each pattern induces a correction
34
Perceptrons Perceptrons and logistic regression
With more than 1 neuron: each neuron has the form of a logistic model of one class against the others.
35
Neural Networks Death Minsky: 1969, “Perceptrons”. ... xn y = 1 x1
b x1 y = -1 ... x1 x2 x3 xn
36
Neural Networks Death A perceptron cannot learn the XOR
37
Neural Networks renaissance
Idea: map the data to a feature space where the solution is linear
38
Neural Networks renaissance
Problem: this transformation is problem dependent
39
Neural Networks renaissance
Solution: multilayer perceptrons (FANN) More biologically plausible Internal layers learn the map
40
Architecture
41
Architecture: regression
each output corresponds to a response
42
Architecture: classification
each output corresponds to a class, such that Training data has to be coded by 0-1 response variables
43
Universal approximation
Theorem: Let an admissible activation function and let be a compact subset of Hence, for any continuous function and for any Theorem
44
Universal approximation
Admissible activation functions Theorem
45
Universal approximation
norm extensions: other output activation functions other norms Theorem
46
Fitting Neural Networks
The back-propagation algorithm: A generalization of the delta rule for multilayer perceptrons It is a gradient descent algorithm for the quadratic loss function
47
Back-propagation Gradient descent generates a sequence of aproximations related as
48
Back-propagation Equations
For Why back propagation? …
49
Back-propagation Algorithm
Initialize the weights and the thresholds For each example i compute Update the weights according to Iterate 2 and 3 until convergence
50
Stochastic Back-propagation
Initialize the weights and the thresholds For each example i compute Iterate 2 until convergence
51
Some Issues in Training NN
Local Minima Architecture selection Generalization and Overfitting Other training functions
52
Local Minima Back-propagation is a gradient descent procedure and hence converges to any configuration of weights such that This can be local
53
Local Minima Starting values Usually random values near zero
Note that the sigmoid functions roughly linear if the weights are near zero Training as non-linearity increasing
54
Local Minima Starting values
Stochastic back-propagation: order in presentation of the examples Multiple neural networks Select the best Average the networks Average the weights Ensemble models
55
Local Minima Other optimization algorithms
Back-propagation with momentum Momentum term Momentum parameter
56
Overfitting Early stopping and validation set
Divide the available data into training and validation sets. Compute the validation error rate periodically during training. Stop training when the validation error rate "starts to go up".
57
Overfitting Early stopping and validation set
58
Overfitting Regularization by weight decay
Weight decays shrinks towards a linear (very simple!!) model
59
A Closer Look at Regularization
Values of risk Space of functions Convergence in A doesn’t guarantee convergence in B
60
Overfitting Regularization by weight decay
Tiknohov regularization: let us consider the problem of estimating a function by observing Suppose we minimize on some space H
61
Overfitting Regularization by weight decay
it is well-known that ever for continuos A is not true Key regularization theorem: If H is compact, the last property holds!!
62
Overfitting Regularization by weight decay
compactness: a metric space is called compact if it is bounded and closed suppose we minimize on H where the sets are compact
63
Overfitting Regularization by weight decay Let a function such that
Hence under some selections of Example
64
A Closer Look at Weight Decay
Less complicated hypothesis has lower error rate
65
NN for classification Loss function: Is the quadratic loss appropriate?
66
NN for classification
67
Projection Pursuit Generalization of 2-layer regression NN
Universal approximator Good for prediction Not good for deriving interpretable models of data Basis functions (activation functions) are now “learned” from data Weights are viewed as projection directions we have to “pursuit”
68
Projection Pursuit Output ridge functions & unit vectors Inputs
69
PPR: Derived Features Dot product is projection of the signal onto
Ridge function varies in the direction of
70
PPR: Training Minimize squared error Consider
Given , we derive features and smooth Given , we minimize over with Newton’s (like) Method Iterate those two steps to convergence
71
PPR: Newton’s Method Use derivatives to iteratively improve estimate
72
PPR: Newton’s Method Use derivatives to iteratively improve estimate
Weighted least squares regression to hit the target
73
PPR: Implementation Details
Suggested smoothing methods Local regression Smoothing splines ( , ) pairs added in a forward stage-wise manner Very close to ensemble methods
74
Conclusions Neural Networks are very general approach to both regression and classification Effective learning tool when: Prediction is desired Formulating a description of a problem’s solution is not desired
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.