INTRODUCTION TO Machine Learning 3rd Edition

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Multi-Layer Perceptron (MLP)
Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Artificial Neural Networks
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Classification Neural Networks 1
Machine Learning Neural Networks
Lecture 14 – Neural Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Chapter 5 NEURAL NETWORKS
Back-Propagation Algorithm
Artificial Neural Networks
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
Artificial Neural Networks
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Non-Bayes classifiers. Linear discriminants, neural networks.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Chapter 6 Neural Network.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Machine Learning 12. Local Models.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Neural networks and support vector machines
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Learning Amin Sobhani.
Multilayer Perceptrons
CHAPTER 11: Multilayer Perceptrons
ECE 5424: Introduction to Machine Learning
第 3 章 神经网络.
COMP24111: Machine Learning and Optimisation
Neural Networks CS 446 Machine Learning.
Intelligent Information System Lab
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
Neural Networks and Backpropagation
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Intelligent Leaning -- A Brief Introduction to Artificial Neural Networks Chiung-Yao Fang.
Classification Neural Networks 1
Synaptic DynamicsII : Supervised Learning
Intelligent Leaning -- A Brief Introduction to Artificial Neural Networks Chiung-Yao Fang.
Artificial Intelligence Chapter 3 Neural Networks
Neural Networks Geoff Hulten.
Artificial Intelligence Chapter 3 Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Artificial Intelligence Chapter 3 Neural Networks
Review for test #2 Fundamentals of ANN Dimensionality reduction
Artificial Intelligence Chapter 3 Neural Networks
Seminar on Machine Learning Rada Mihalcea
CS621: Artificial Intelligence Lecture 22-23: Sigmoid neuron, Backpropagation (Lecture 20 and 21 taken by Anup on Graphical Models) Pushpak Bhattacharyya.
Introduction to Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

INTRODUCTION TO Machine Learning 3rd Edition Lecture Slides for INTRODUCTION TO Machine Learning 3rd Edition ETHEM ALPAYDIN © The MIT Press, 2014 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml3e Modified by Prof. Carolina Ruiz for CS539 Machine Learning at WPI

CHAPTER 11: Multilayer Perceptrons

Neural Networks - Inspired by the human brain Networks of processing units (neurons) with connections (synapses) between them Large number of neurons: 1010 (other estimates say1011) Large connectitivity: 105 connections per neuron: synapses Parallel processing Distributed computation/memory Robust to noise, failures

Understanding the Brain Levels of analysis (Marr, 1982) Computational theory Representation and algorithm Hardware implementation Reverse engineering: From hardware to theory Parallel processing: SIMD (single instruction, multiple data machines) vs MIMD (multiple instruction, multiple data machines) Neural net: SIMD with modifiable local memory Learning: Update by training/experience

Perceptron = one “neuron” or computational unit output weights (Rosenblatt, 1962) inputs

What a Perceptron Does Classification:y=1(wx+w0>0) Regression: y=wx+w0 y y y s w0 w0 w w x w0 x x x0=+1

K Outputs Regression: Classification: softmax softmax function or normalized exponential function converts a real vector into one with values in the range (0, 1) that add up to 1

Training (i.e., learning the right weights) Online (instances seen one by one) vs batch (whole sample) learning: In online (= incremental) learning: No need to store the whole sample Problem may change in time Wear and degradation in system components Stochastic gradient-descent (= incremental gradient descent): Update after a single data instance Generic update rule (LMS rule):

Training a Perceptron: Regression Regression (Linear output):

Classification Single sigmoid output K>2 softmax outputs cross-entropy cost function: used instead of the quadratic error function to speed up learning cross-entropy cost function see Chapter 3 of Michael Nielsen’s “Neural Networks and Deep Learning” online texbook for an explanation of cross-entropy

Learning Boolean AND

XOR No w0, w1, w2 satisfy: (Minsky and Papert, 1969)

Multilayer Perceptrons outputs hidden layer (Rumelhart, Hinton, and Williams, 1986) inputs

x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)

Backpropagation (Rumelhart, Hinton & Williams, 1986) gradient of the error function

Regression Backward Forward x

Regression with Multiple Outputs yi vih zh whj xj

Error Back Propagation Algorithm using Stochastic Gradient Descent small random values (an epoch) (h: index over hidden units) forward propagation of inputs (i: index over output units) backward propagation of errors weights update let’s talk about termination criteria

+: training instances …: f(x) = sin(6x) 100,200, 300: Fit of an MLP with two hidden units after 100, 200, and 300 epochs

hyperplanes of two hidden units weights on 1st layer hidden unit outputs multiplied by weights on 2nd layer hidden unit outputs whx+w0 vhzh zh MLP fit is formed as the sum of the outputs of the hidden units

Two-Class Discrimination One sigmoid output yt for P(C1|xt) and P(C2|xt) ≡ 1-yt

K>2 Classes softmax

Multiple Hidden Layers MLP (multi-layer perceptrons) with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks

Improving Convergence Momentum Adaptive learning rate where t is time (= epoch) and  is a constant, say between 0.5 and 1 if error rate is decreasing, increase learning rate by a constant amount; otherwise, decrease learning rate geometrically

Overfitting/Overtraining Number of weights: H (d+1)+(H+1)K MLP with d inputs, H hidden units, and K outputs: has Hx(d+1) weights in 1st layer and Kx(H+1) in 2nd layer

Fully Connected, Layered MLP Architecture Taken from Chapter 6 of Michael Nielsen’s “Neural Networks and Deep Learning” online texbook

Expressive Capabilities of ANNs (Taken from Tom Mitchell’s “Machine Learning” Textbook) Boolean Functions: Every Boolean function can be represented by a network with a single hidden layer but might require exponential (in number of inputs) hidden units Continuous Functions: Every bounded continuous function can be approximated with arbitrary small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Other ANN Architectures: Structured MLP Convolutional networks (Deep learning) Useful on data that has “local structure” (e.g., images; sound) each unit is connected to a local group of units, and intends to capture a particular aspect of the data (e.g., an object in an image; a speech phoneme in a speech signal) (Le Cun et al, 1989) see Chapter 6 of Michael Nielsen’s “Neural Networks and Deep Learning” online texbook for an explanation of convolutional networks

These weights are updated after each epoch. Weight Sharing three translations over the data of the same “filter” Each “filter” is applied on overlapping segments of the data, and uses the same weights each time that is applied during the same epoch (“translation invariance”). These weights are updated after each epoch.

Hints Invariance to translation, rotation, size Virtual examples Augmented error: E’=E+λhEh If x’ and x are the “same”: Eh=[g(x|θ)- g(x’|θ)]2 Approximation hint: the “A” remains an “A” (Abu-Mostafa, 1995)

Convolutional Neural Networks in Practice Taken from Chapter 6 of Michael Nielsen’s “Neural Networks and Deep Learning” online texbook Taken from ”Understanding Convolutional Neural Networks for NLP”

Tuning the Network Size Destructive Weight decay: Constructive Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989)

Bayesian Learning Consider weights wi as random vars, prior p(wi) Weight decay, ridge regression, regularization cost=data-misfit + λ complexity More about Bayesian methods in chapter 14

Dimensionality Reduction Autoencoder networks

Autoencoder: An Example Figure from Hinton and Salakhutdinov, Science, 2006

Learning Time Applications: Network architectures Sequence recognition: Speech recognition Sequence reproduction: Time-series prediction Sequence association Network architectures Time-delay networks (Waibel et al., 1989) Recurrent networks (Rumelhart et al., 1986)

Time-Delay Neural Networks

Recurrent Networks

Unfolding in Time

Deep Networks Layers of feature extraction units Can have local receptive fields as in convolution networks, or can be fully connected Can be trained layer by layer using an autoencoder in an unsupervised manner No need to craft the right features or the right basis functions or the right dimensionality reduction method; learns multiple layers of abstraction all by itself given a lot of data and a lot of computation Applications in vision, language processing, ...