START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Neural Network I Week 7 1. Team Homework Assignment #9 Read pp. 327 – 334 and the Week 7 slide. Design a neural network for XOR (Exclusive OR) Explore.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Backpropagation CS 478 – Backpropagation.
Kostas Kontogiannis E&CE
Artificial Neural Networks
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification Neural Networks 1
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
Overview over different methods – Supervised Learning
x – independent variable (input)
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Connectionist Modeling Some material taken from cspeech.ucd.ie/~connectionism and Rich & Knight, 1991.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 4700: Foundations of Artificial Intelligence
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Artificial Neural Networks
Computer Science and Engineering
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Linear Discrimination Reading: Chapter 2 of textbook.
Linear Classification with Perceptrons
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CS 478 – Tools for Machine Learning and Data Mining Perceptron.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Chapter 2 Single Layer Feedforward Networks
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Artificial Neural Network
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Fall 2004 Backpropagation CS478 - Machine Learning.
Fall 2004 Perceptron CS478 - Machine Learning.
第 3 章 神经网络.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
CSC 578 Neural Networks and Deep Learning
Classification Neural Networks 1
Artificial Intelligence Chapter 3 Neural Networks
Perceptron as one Type of Linear Discriminants
Artificial Neural Networks
Lecture Notes for Chapter 4 Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

START OF DAY 4 Reading: Chap. 3 & 4

Project

Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection of objective(s) – Description of the methods used Data preparation Learning algorithms used – Description of the results Project presentations

Perceptron

Neural Networks Sub-symbolic approach: – Does not use symbols to denote objects – Views intelligence/learning as arising from the collective behavior of a large number of simple, interacting components Motivated by biological plausibility

Natural Neuron

Artificial Neuron Captures the essence of the natural neuron (Dendrites) Input values X i from the environment or other neurons (Synapses) Real-valued weights w i associated with each input (Soma’s chemical reaction) Function F({X i },{w i }) computing activation as a function of input values and weights (Axon) Activation value that may serve as input to other neurons

Feedforward Neural Networks Sets of (highly) interconnected artificial neurons (i.e., simple computational units) – Layered organization Characteristics – Massive parallelism – Distributed knowledge representation (i.e., implicit in patterns of interactions) – Graceful degradation (e.g., grandmother cell) – Less susceptible to brittleness – Noise tolerant – Opaque (i.e., black box) There exist other types of NNs

FFNN Topology Pattern of interconnections among neurons: primary source of inductive bias Characteristics – Number of layers – Number of neurons per layer – Interconnectivity (fully connected, mesh, etc.)

Perceptrons (1958) Simplest class of neural networks Single-layer, i.e., only one set of connection weights between inputs and outputs Boolean activation (aka step function) x1x1 xnxn x2x2 w1w1 w2w2 wnwn z

Learning for Perceptrons Algorithm devised by Rosenblatt in 1958 Given an example (i.e., labeled input pattern): – Compute output – Check output against target – Adapt weights

Example (I).8.3 z net =.8*.4 +.3*-.2 =.26 =1 x1x1 x2x2 t Output matches target

Example (II).8.3 z net =.4*.4 +.1*-.2 =.14 =1 x1x1 x2x2 t Output does not match target

Learn-Perceptron When should weights be changed? – Output does not match target:(t i -z i ) How should weights be changed? – By some fixed amount (learning rate):c(t i -z i ) – Proportional to input value:c(t i -z i )x i Algorithm: – Initialize weights (typically random) – For each new training example Compute network output Change weights:  w i = c  t i – z i )x i – Repeat until no change in weights

What About Θ? > > 1 Augmented Version > > 1 Treat threshold like any other weight. Call it a bias since it biases the output up or down Since we start with random weights anyways, ignore -  notion; just think of the bias as an extra available weight Always use a bias weight

Example Assume 3-input perceptron (plus bias, outputs 1 if net > 0, else 0) Assume c=1 and initial weights all 0 Training set: > > > > 0 PatternTargetWeightNetOutput  W____  w i = c  t i – z i )x i

Another Example Assume 2-input perceptron (plus bias, outputs 1 if net > 0, else 0) Assume c=1 and initial weights all 0 Training set:0 0 -> > > > 1 PatternTargetWeightNetOutput  W _  w i = c  t i – z i )x i What is happening? Why?

Decision Surface Assume 2-input perceptron – z=1 if w 1 x 1 +w 2 x 2 ≥Θ – z=0 if w 1 x 1 +w 2 x 2 <Θ Decision boundary: w 1 x 1 +w 2 x 2 =Θ – A line with slope –w 1 /w 2 and intercept Θ/w 2 – No bias  line goes through origin In general: hyperplane (i.e., a linear surface)

Linear Separability Generalization: noise vs. exception Limited functionality?

The Plague of Linear Separability The good news is: – Learn-Perceptron is guaranteed to converge to a correct assignment of weights if such an assignment exists The bad news is: – Such an assignment exists only for linearly separable tasks The really bad news is: – There is a large number of non-linearly separable tasks Let d be the number of inputs – Too many tasks escape the algorithm

Are We Stuck? So far we have used What if we preprocessed the inputs in a non-linear way and did To the perceptron algorithm it would look just the same, except with different inputs For example, for a problem with two inputs x and y (plus the bias), we could also add the inputs x 2, y 2, and x·y The perceptron would just think it is a 5-dimensional task, and it is linear in those 5 dimensions – But what kind of decision surfaces would it allow for the 2-d input space?

Quadric Machine Example All quadratic surfaces (2 nd order) – ellipsoid – parabola – Etc. For example: f1f1 f2f2 f1f1 Perceptron with just feature f 1 cannot separate the data Assume we add another feature to our perceptron f 2 = f 1 2

Quadric Machine All quadratic surfaces (2 nd order) – ellipsoid – parabola – etc. That significantly increases the number of problems that can be solved, but there are still many problems that are not quadrically separable Could go to 3 rd and higher order features, but number of possible features grows exponentially Multi-layer neural networks will allow us to discover high-order features automatically from the input space

Backpropagation

Towards a Solution Main problem: – Learn-Perceptron implements discrete model of error (i.e., identifies the existence of error and adapts to it, but not the magnitude of the error – since step function) First thing to do: – Allow nodes to have real-valued activations (amount of error = difference between computed and target output) Second thing to do: – Design learning rule that adjusts weights based on error Last thing to do: – Use the learning rule to implement a multi-layer algorithm

Real-valued Activation Replace the threshold unit (step function) with a linear unit, where: For instance d:

Defining Error We define the training error of a hypothesis, or weight vector, by: Goal: minimize E – Find direction of steepest ascent of E (aka, gradient) – Move in the opposite direction (i.e., to decrease E)

Minimizing the Error

The Delta Rule Gradient descent on the error surface: Initialize weights to small random values Repeat until no progress – Initialize each  w i to 0 – For each training example Compute output o for x For each weight w i –  w i   w i + c(t – o)x i – For each weight w i w i  w i +  w i Stochastic version For each weight w i – w i  w i + c(t – o)x i Better? Note change in sign since we minimize E

Discussion Gradient-descent learning (with linear units) requires more than one pass through the training set The good news is: – Convergence is guaranteed if the problem is solvable The bad news is: – Still produces only linear functions – Even when used in a multi-layer context (composition of linear functions is a linear function) Needs to be further generalized!

Non-linear Activation Introduce non-linearity with a sigmoid function: 1. Differentiable (required for gradient-descent) 2. Most unstable in the middle

Derivative of the Sigmoid You need only compute the sigmoid; its derivative comes for free!

Multi-layer Feed-forward NN i i i i j k k k

Backpropagation Learning Repeat – Present a training instance d – Compute error  k of output units – For each hidden layer Compute error  j using error from next layer – Update all weights: w pq  w pq +  w pq where Until stopping criterion Note that BP is stochastic

Setting Up the Derivation

Output Units

Hidden Units

Putting it all together i i i i j k k k Max when sigmoid is most unstable

Example (I) Consider a simple network composed of: – 3 inputs: x, y, z – 1 hidden node: h – 2 outputs: q, r Assume c=0.5, all weights are initialized to 0.2 and weight updates are incremental Consider the training set: – – 0 1 – – iterations over the training set

Example (II)

Local Minima FFNN can get stuck in local minimum – More common for small networks – For most large networks (many weights), local minima rarely occur in practice Many dimensions of weights  unlikely to be in a minima in every dimension simultaneously – almost always a way down (e.g., water running down a high-dimensional surface) If needed, can use momentum or train several NNs

Momentum Simple speed-up modification Weight update maintains momentum in the direction it has been going – Faster in flats – Could leap past minima (good or bad) – Significant speed-up, common value  ≈.9 – Effectively increases learning rate in areas where the gradient is consistently the same sign (a common approach in adaptive learning rate methods)

Learning Parameters Connectivity: typically fully connected between layers Number of hidden nodes: – Too many nodes make learning slower, could overfit – Too few will underfit Number of layers: usually 1 or 2 hidden layers which are usually sufficient, attenuation makes learning very slow – 1 most common Momentum: ( ) Most common method to set parameters: a few trial and error runs (CV) All of these could be set automatically by the learning algorithm and there are numerous approaches to do so

Backpropagation Summary Most common neural network approach – Many other different styles of neural networks (RBF, Hopfield, etc.) Excellent empirical results Scaling – the pleasant surprise – Local minima very rare as problem and network complexity increase User defined parameters usually handled by multiple experiments Many variants, such as – Regression – Typically linear output nodes, normal hidden nodes – Adaptive parameters, ontogenic (growing and pruning) learning algorithms – Many different learning algorithm approaches – Recurrent networks – Deep networks – Still an active research area

END OF DAY 4 Homework: Decision Tree Learning