Machine Learning Neural Networks.

Slides:



Advertisements
Similar presentations
Slides from: Doug Gray, David Poole
Advertisements

Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.
Tuomas Sandholm Carnegie Mellon University Computer Science Department
Artificial Neural Networks - Introduction -
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Classification Neural Networks 1
Perceptron.
Machine Learning Neural Networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Connectionist models. Connectionist Models Motivated by Brain rather than Mind –A large number of very simple processing elements –A large number of weighted.
Neural Networks Marco Loog.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Data Mining with Neural Networks (HK: Chapter 7.5)
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 4700: Foundations of Artificial Intelligence
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
CISC 4631 Data Mining Lecture 11: Neural Networks.
Artificial Neural Networks
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Computer Science and Engineering
Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
Artificial Neural Network Yalong Li Some slides are from _24_2011_ann.pdf.
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
Classification / Regression Neural Networks 2
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
CS621 : Artificial Intelligence
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Artificial Neural Network
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
Perceptrons Michael J. Watts
Chapter 6 Neural Network.
Lecture 2 Introduction to Neural Networks and Fuzzy Logic President UniversityErwin SitompulNNFL 2/1 Dr.-Ing. Erwin Sitompul President University
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning with Perceptrons and Neural Networks
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
Classification Neural Networks 1
Perceptron as one Type of Linear Discriminants
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Seminar on Machine Learning Rada Mihalcea
David Kauchak CS158 – Spring 2019
Presentation transcript:

Machine Learning Neural Networks

Human Brain

Neurons

Input-Output Transformation Input Spikes Output Spike Spike (= a brief pulse) (Excitatory Post-Synaptic Potential)

Human Learning Number of neurons: ~ 1010 Connections per neuron: ~ 104 to 105 Neuron switching time: ~ 0.001 second Scene recognition time: ~ 0.1 second 100 inference steps doesn’t seem much

Machine Learning Abstraction

Artificial Neural Networks Typically, machine learning ANNs are very artificial, ignoring: Time Space Biological learning processes More realistic neural models exist Hodkins & Huxley (1952) won a Nobel prize for theirs (in 1963) Nonetheless, very artificial ANNs have been useful in many ML applications

Perceptrons The “first wave” in neural networks Big in the 1960’s McCulloch & Pitts (1943), Woodrow & Hoff (1960), Rosenblatt (1962)

Bias (x0 =1,always) Inputs A single perceptron x1 x0 w1 w0 x2 w2 x3 w3

Logical Operators x0 x1 x0 x2 x1 x0 x1 x2 AND NOT OR -0.8 0.5 0.0 -1.0 -0.3 0.5

Perceptron Hypothesis Space What’s the hypothesis space for a perceptron on n inputs?

Learning Weights Perceptron Training Rule Gradient Descent (other approaches: Genetic Algorithms) x0 ? x1 ? ? x2

Perceptron Training Rule Weights modified for each example Update Rule: where learning rate target value perceptron output input value

What weights make XOR? No combination of weights works Perceptrons can only represent linearly separable functions

Linear Separability x2 + - OR x1

Linear Separability x2 - + AND x1 - -

Linear Separability x2 + - XOR x1 - +

Perceptron Training Rule Converges to the correct classification IF Cases are linearly separable Learning rate is slow enough Proved by Minsky and Papert in 1969 Killed widespread interest in perceptrons till the 80’s

XOR x0 0.6 x0 x1 XOR 1 -0.6 x0 -0.6 1 x2 0.6

What’s wrong with perceptrons? You can always plug multiple perceptrons together to calculate any function. BUT…who decides what the weights are? Assignment of error to parental inputs becomes a problem…. This is because of the threshold…. Who contributed the error?

Problem: Assignment of error x0 ? Perceptron Threshold Step function ? x1 ? x2 Hard to tell from the output who contributed what Stymies multi-layer weight learning

Solution: Differentiable Function x0 ? Simple linear function ? x1 ? x2 Varying any input a little creates a perceptible change in the output This lets us propagate error to prior nodes

Measuring error for linear units Output Function Error Measure: target value linear unit output data

Gradient Descent Training rule: Gradient:

Gradient Descent Rule Update Rule:

Back to XOR x0 x0 x1 XOR x0 x2

Gradient Descent for Multiple Layers x0 x0 x1 XOR x0 We can compute: x2

Gradient Descent vs. Perceptrons Perceptron Rule & Threshold Units Learner converges on an answer ONLY IF data is linearly separable Can’t assign proper error to parent nodes Gradient Descent Minimizes error even if examples are not linearly separable Works for multi-layer networks But…linear units only make linear decision surfaces (can’t learn XOR even with many layers) And the step function isn’t differentiable…

A compromise function Perceptron Linear Sigmoid (Logistic)

The sigmoid (logistic) unit Has differentiable function Allows gradient descent Can be used to learn non-linear functions ? x1 ? x2

Logistic function S Inputs Output Age 34 0.6 Gender 1 4 Stage .5 0.6 .4 S Gender 1 “Probability of beingAlive” .8 4 Stage Independent variables Coefficients Prediction

Neural Network Model S Inputs Output Age 34 0.6 Gender 2 4 Stage .4 S .2 S 0.6 .1 .5 Gender 2 .2 .3 S .8 .7 “Probability of beingAlive” 4 Stage .2 Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN Inputs Output Age 34 .6 .5 S 0.6 .1 Gender 2 .7 .8 “Probability of beingAlive” 4 Stage Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN Inputs Output Age 34 .5 .2 S 0.6 Gender 2 .3 “Probability of beingAlive” .8 4 Stage .2 Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Getting an answer from a NN Inputs Output Age 34 .6 .5 .2 S 0.6 .1 Gender 1 .3 .7 “Probability of beingAlive” .8 4 Stage .2 Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Minimizing the Error w w initial error Error surface negative derivative final error local minimum initial trained w w positive change

Differentiability is key! Sigmoid is easy to differentiate For gradient descent on multiple layers, a little dynamic programming can help: Compute errors at each output node Use these to compute errors at each hidden node Use these to compute errors at each input node

The Backpropagation Algorithm

Getting an answer from a NN Inputs Output Age 34 .6 .5 .2 S 0.6 .1 Gender 1 .3 .7 “Probability of beingAlive” .8 4 Stage .2 Dependent variable Prediction Independent variables Weights HiddenLayer Weights

Expressive Power of ANNs Universal Function Approximator: Given enough hidden units, can approximate any continuous function f Need 2+ hidden units to learn XOR Why not use millions of hidden units? Efficiency (neural network training is slow) Overfitting

Overfitting Real Distribution Overfitted Model

Combating Overfitting in Neural Nets Many techniques Two popular ones: Early Stopping Use “a lot” of hidden units Just don’t over-train Cross-validation Choose the “right” number of hidden units

Early Stopping error Overfitted model error a = validation set min (D ) b = training set error b Epochs Stopping criterion

Cross-validation Cross-validation: general-purpose technique for model selection E.g., “how many hidden units should I use?” More extensive version of validation-set approach.

Cross-validation Break training set into k sets For each model M For i=1…k Train M on all but set i Test on set i Output M with highest average test score, trained on full training set

Summary of Neural Networks Non-linear regression technique that is trained with gradient descent. Question: How important is the biological metaphor?

Summary of Neural Networks When are Neural Networks useful? Instances represented by attribute-value pairs Particularly when attributes are real valued The target function is Discrete-valued Real-valued Vector-valued Training examples may contain errors Fast evaluation times are necessary When not? Fast training times are necessary Understandability of the function is required

Advanced Topics in Neural Nets Batch Move vs. incremental Hidden Layer Representations Hopfield Nets Neural Networks on Silicon Neural Network language models

Incremental vs. Batch Mode

Incremental vs. Batch Mode In Batch Mode we minimize: Same as computing: Then setting

Advanced Topics in Neural Nets Batch Move vs. incremental Hidden Layer Representations Hopfield Nets Neural Networks on Silicon Neural Network language models

Hidden Layer Representations Input->Hidden Layer mapping: representation of input vectors tailored to the task Can also be exploited for dimensionality reduction Form of unsupervised learning in which we output a “more compact” representation of input vectors <x1, …,xn> -> <x’1, …,x’m> where m < n Useful for visualization, problem simplification, data compression, etc.

Dimensionality Reduction Model: Function to learn:

Dimensionality Reduction: Example

Dimensionality Reduction: Example

Dimensionality Reduction: Example

Dimensionality Reduction: Example

Advanced Topics in Neural Nets Batch Move vs. incremental Hidden Layer Representations Hopfield Nets Neural Networks on Silicon Neural Network language models

Advanced Topics in Neural Nets Batch Move vs. incremental Hidden Layer Representations Hopfield Nets Neural Networks on Silicon Neural Network language models

Neural Networks on Silicon Currently: Simulation of continuous device physics (neural networks) Digital computational model (thresholding) Why not skip this? Continuous device physics (voltage)

Example: Silicon Retina Simulates function of biological retina Single-transistor synapses adapt to luminance, temporal contrast Modeling retina directly on chip => requires 100x less power!

Example: Silicon Retina Synapses modeled with single transistors

Luminance Adaptation

Comparison with Mammal Data Real: Artificial:

Graphics and results taken from:

General NN learning in silicon? Seems less in-vogue than in late-90s Interest has turned somewhat to implementing Bayesian techniques in analog silicon

Advanced Topics in Neural Nets Batch Move vs. incremental Hidden Layer Representations Hopfield Nets Neural Networks on Silicon Neural Network language models

Neural Network Language Models Statistical Language Modeling: Predict probability of next word in sequence I was headed to Madrid , ____ P(___ = “Spain”) = 0.5, P(___ = “but”) = 0.2, etc. Used in speech recognition, machine translation, (recently) information extraction

Formally Estimate:

Optimizations Key idea – learn simultaneously: Short-lists vector representations for each word (50 dim) a predictor of the next word Short-lists Much complexity in hidden->output layer Number of possible next words is large Only predict a subset of words Use a standard probabilistic model for the rest

Design Decisions (1) Number of hidden units Almost no difference…

Design Decisions (2) Word representation (# of dimensions) They chose 120

Comparison vs. state of the art