Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 15 February 2008 William.

Slides:

Advertisements

Similar presentations

Artificial Neural Networks

Advertisements

Multi-Layer Perceptron (MLP)

Slides from: Doug Gray, David Poole

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Artificial Neural Networks

Support Vector Machines

SVM—Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Tuomas Sandholm Carnegie Mellon University Computer Science Department

CS 4700: Foundations of Artificial Intelligence

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

Machine Learning Neural Networks

Lecture 14 – Neural Networks

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

The back-propagation training algorithm

Back-Propagation Algorithm

Chapter 6: Multilayer Neural Networks

Artificial Neural Networks

MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Artificial Neural Networks

LOGO Classification III Lecturer: Dr. Bo Yuan

CS 4700: Foundations of Artificial Intelligence

CS 484 – Artificial Intelligence

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Radial Basis Function Networks

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences

Artificial Neural Networks

Classification Part 3: Artificial Neural Networks

Computer Science and Engineering

1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 08 February 2007.

Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.

CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.

Machine Learning Chapter 4. Artificial Neural Networks

11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering

Kansas State University Department of Computing and Information Sciences CIS 798: Intelligent Systems and Machine Learning Thursday, September 16, 1999.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 37 of 42 Monday, 20 November.

Machine Learning Using Support Vector Machines (Paper Review) Presented to: Prof. Dr. Mohamed Batouche Prepared By: Asma B. Al-Saleh Amani A. Al-Ajlan.

Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.

Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, February 9, 2000.

Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 15 February 2008 William.

SUPERVISED LEARNING NETWORK

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.

Computing & Information Sciences Kansas State University Friday, 17 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 36 of 42 Friday, 17 November.

Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.

Artificial Neural Networks (Cont.) Chapter 4 Perceptron Gradient Descent Multilayer Networks Backpropagation Algorithm 1.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

Artificial Neural Network

EEE502 Pattern Recognition

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Wednesday, 14 February 2007.

Computing & Information Sciences Kansas State University Monday, 20 Nov 2006CIS 490 / 730: Artificial Intelligence Lecture 37 of 42 Monday, 20 November.

Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.

Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Lecture 13 Multi-Layer Perceptrons and Backpropagation of Error

Machine Learning Supervised Learning Classification and Regression

Artificial Neural Networks

Machine Learning Today: Reading: Maria Florina Balcan

Seminar on Machine Learning Rada Mihalcea

Presentation transcript:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Friday, 15 February 2008 William H. Hsu Department of Computing and Information Sciences, KSU Readings: Today: Section 6.7, Han & Kamber 2e – Support Vector Machines (SVM) Next week: 6.8 – Association Rules Artificial Neural Networks: Backprop (continued) and Intro to SVM Lecture 11 of 42

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Review: Derivation of Backprop Recall: Gradient of Error Function Gradient of Sigmoid Activation Function But We Know: So:

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Review: Backprop, Feedforward Intuitive Idea: Distribute Blame for Error to Previous Layers Algorithm Train-by-Backprop (D, r) –Each training example is a pair of the form, where x is the vector of input values and t(x) is the output value. r is the learning rate (e.g., 0.05) –Initialize all weights w i to (small) random values –UNTIL the termination condition is met, DO FOR each in D, DO Input the instance x to the unit and compute the output o(x) =  (net(x)) FOR each output unit k, DO FOR each hidden unit j, DO Update each w = u i,j (a = h j ) or w = v j,k (a = o k ) w start-layer, end-layer  w start-layer, end-layer +  w start-layer, end-layer  w start-layer, end-layer  r  end-layer a end-layer –RETURN final u, v x1x1 x2x2 x3x3 Input Layer u 11 h1h1 h2h2 h3h3 h4h4 Hidden Layer o1o1 o2o2 v 42 Output Layer

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Backpropagation and Local Optima Gradient Descent in Backprop –Performed over entire network weight vector –Easily generalized to arbitrary directed graphs –Property: Backprop on feedforward ANNs will find a local (not necessarily global) error minimum Backprop in Practice –Local optimization often works well (can run multiple times) –Often include weight momentum  –Minimizes error over training examples - generalization to subsequent instances? –Training often very slow: thousands of iterations over D (epochs) –Inference (applying network after training) typically very fast Classification Control

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Feedforward ANNs: Representational Power and Bias Representational (i.e., Expressive) Power –Backprop presented for feedforward ANNs with single hidden layer (2-layer) –2-layer feedforward ANN Any Boolean function (simulate a 2-layer AND-OR network) Any bounded continuous function (approximate with arbitrarily small error) [Cybenko, 1989; Hornik et al, 1989] –Sigmoid functions: set of basis functions; used to compose arbitrary functions –3-layer feedforward ANN: any function (approximate with arbitrarily small error) [Cybenko, 1988] –Functions that ANNs are good at acquiring: Network Efficiently Representable Functions (NERFs) - how to characterize? [Russell and Norvig, 1995] Inductive Bias of ANNs –n-dimensional Euclidean space (weight space) –Continuous (error function smooth with respect to weight parameters) –Preference bias: “smooth interpolation” among positive examples –Not well understood yet (known to be computationally hard)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Hidden Units and Feature Extraction –Training procedure: hidden unit representations that minimize error E –Sometimes backprop will define new hidden features that are not explicit in the input representation x, but which capture properties of the input instances that are most relevant to learning the target function t(x) –Hidden units express newly constructed features –Change of representation to linearly separable D’ A Target Function (Sparse aka 1-of-C, Coding) –Can this be learned? (Why or why not?) Learning Hidden Layer Representations

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Training: Evolution of Error and Hidden Unit Encoding error D (o k ) h j ( ), 1  j  3

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Input-to-Hidden Unit Weights and Feature Extraction –Changes in first weight layer values correspond to changes in hidden layer encoding and consequent output squared errors –w 0 (bias weight, analogue of threshold in LTU) converges to a value near 0 –Several changes in first 1000 epochs (different encodings) Training: Weight Evolution u i1, 1  i  8

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Convergence of Backpropagation No Guarantee of Convergence to Global Optimum Solution –Compare: perceptron convergence (to best h  H, provided h  H; i.e., LS) –Gradient descent to some local error minimum (perhaps not global minimum…) –Possible improvements on backprop (BP) Momentum term (BP variant with slightly different weight update rule) Stochastic gradient descent (BP algorithm variant) Train multiple nets with different initial weights; find a good mixture –Improvements on feedforward networks Bayesian learning for ANNs (e.g., simulated annealing) - later Other global optimization methods that integrate over multiple networks Nature of Convergence –Initialize weights near zero –Therefore, initial network near-linear –Increasingly non-linear functions possible as training progresses

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Overtraining in ANNs Error versus epochs (Example 2) Recall: Definition of Overfitting –h’ worse than h on D train, better on D test Overtraining: A Type of Overfitting –Due to excessive iterations –Avoidance: stopping criterion (cross-validation: holdout, k-fold) –Avoidance: weight decay Error versus epochs (Example 1)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Overfitting in ANNs Other Causes of Overfitting Possible –Number of hidden units sometimes set in advance –Too few hidden units (“underfitting”) ANNs with no growth Analogy: underdetermined linear system of equations (more unknowns than equations) –Too many hidden units ANNs with no pruning Analogy: fitting a quadratic polynomial with an approximator of degree >> 2 Solution Approaches –Prevention: attribute subset selection (using pre-filter or wrapper) –Avoidance Hold out cross-validation (CV) set or split k ways (when to stop?) Weight decay: decrease each weight by some factor on each epoch –Detection/recovery: random restarts, addition and deletion of weights, units

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition 90% Accurate Learning Head Pose, Recognizing 1-of-20 Faces Example: Neural Nets for Face Recognition 30 x 32 Inputs Left Straight Right Up Hidden Layer Weights after 1 Epoch Hidden Layer Weights after 25 Epochs Output Layer Weights (including w 0 =  ) after 1 Epoch

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Example: NetTalk Sejnowski and Rosenberg, 1987 Early Large-Scale Application of Backprop –Learning to convert text to speech Acquired model: a mapping from letters to phonemes and stress marks Output passed to a speech synthesizer –Good performance after training on a vocabulary of ~1000 words Very Sophisticated Input-Output Encoding –Input: 7-letter window; determines the phoneme for the center letter and context on each side; distributed (i.e., sparse) representation: 200 bits –Output: units for articulatory modifiers (e.g., “voiced”), stress, closest phoneme; distributed representation –40 hidden units; weights total Experimental Results –Vocabulary: trained on 1024 of 1463 (informal) and 1000 of (dictionary) –78% on informal, ~60% on dictionary

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Alternative Error Functions Penalize Large Weights (with Penalty Factor w p ) Train on Both Target Slopes and Values Tie Together Weights –e.g., in phoneme recognition network –See: Connectionist Speech Recognition [Bourlard and Morgan, 1994]

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Some Current Issues and Open Problems in ANN Research Hybrid Approaches –Incorporating knowledge and analytical learning into ANNs Knowledge-based neural networks [Flann and Dietterich, 1989] Explanation-based neural networks [Towell et al, 1990; Thrun, 1996] –Combining uncertain reasoning and ANN learning and inference Probabilistic ANNs Bayesian networks [Pearl, 1988; Heckerman, 1996; Hinton et al, 1997] - later Global Optimization with ANNs –Markov chain Monte Carlo (MCMC) [Neal, 1996] - e.g., simulated annealing –Relationship to genetic algorithms - later Understanding ANN Output –Knowledge extraction from ANNs Rule extraction Other decision surfaces –Decision support and KDD applications [Fayyad et al, 1996] Many, Many More Issues (Robust Reasoning, Representations, etc.)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Some ANN Applications Diagnosis –Closest to pure concept learning and classification –Some ANNs can be post-processed to produce probabilistic diagnoses Prediction and Monitoring –aka prognosis (sometimes forecasting) –Predict a continuation of (typically numerical) data Decision Support Systems –aka recommender systems –Provide assistance to human “subject matter” experts in making decisions Design (manufacturing, engineering) Therapy (medicine) Crisis management (medical, economic, military, computer security) Control Automation –Mobile robots –Autonomic sensors and actuators Many, Many More (ANNs for Automated Reasoning, etc.)

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition NeuroSolutions Demo

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Outline: Support Vector Machines A brief history of SVM Large-margin linear classifier –Linear separable –Nonlinear separable Creating nonlinear classifiers: kernel trick A simple example Discussion on SVM Conclusion

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition History of SVM SVM is related to statistical learning theory [3] SVM was first introduced in 1992 [1] SVM becomes popular because of its success in handwritten digit recognition –1.1% test error rate for SVM. This is the same as the error rates of a carefully constructed neural network, LeNet 4. –See Section 5.11 in [2] or the discussion in [3] for details SVM is now regarded as an important example of “kernel methods”, one of the key area in machine learning –Note: the meaning of “kernel” is different from the “kernel” function for Parzen windows [1] B.E. Boser et al. A Training Algorithm for Optimal Margin Classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory , Pittsburgh, [2] L. Bottou et al. Comparison of classifier methods: a case study in handwritten digit recognition. Proceedings of the 12th IAPR International Conference on Pattern Recognition, vol. 2, pp [3] V. Vapnik. The Nature of Statistical Learning Theory. 2 nd edition, Springer, 1999.

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition What is a good Decision Boundary? Consider a two-class, linearly separable classification problem Many decision boundaries! –The Perceptron algorithm can be used to find such a boundary –Different algorithms have been proposed (DHS ch. 5) Are all decision boundaries equally good? Class 1 Class 2

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Examples of Bad Decision Boundaries Class 1 Class 2 Class 1 Class 2

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Large-margin Decision Boundary The decision boundary should be as far away from the data of both classes as possible –We should maximize the margin, m –Distance between the origin and the line w t x=k is k/||w|| Class 1 Class 2 m

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Summary Points Multi-Layer ANNs –Focused on feedforward MLPs –Backpropagation of error: distributes penalty (loss) function throughout network –Gradient learning: takes derivative of error surface with respect to weights Error is based on difference between desired output (t) and actual output (o) Actual output (o) is based on activation function Must take partial derivative of   choose one that is easy to differentiate Two  definitions: sigmoid (aka logistic) and hyperbolic tangent (tanh) Overfitting in ANNs –Prevention: attribute subset selection –Avoidance: cross-validation, weight decay ANN Applications: Face Recognition, Text-to-Speech Open Problems Recurrent ANNs: Can Express Temporal Depth (Non-Markovity) Next: Support Vector Machines and Kernel Functions

Kansas State University Department of Computing and Information Sciences CIS 732: Machine Learning and Pattern Recognition Terminology Multi-Layer ANNs –Focused on one species: (feedforward) multi-layer perceptrons (MLPs) –Input layer: an implicit layer containing x i –Hidden layer: a layer containing input-to-hidden unit weights and producing h j –Output layer: a layer containing hidden-to-output unit weights and producing o k –n-layer ANN: an ANN containing n - 1 hidden layers –Epoch: one training iteration –Basis function: set of functions that span H Overfitting –Overfitting: h does better than h’ on training data and worse on test data –Overtraining: overfitting due to training for too many epochs –Prevention, avoidance, and recovery techniques Prevention: attribute subset selection Avoidance: stopping (termination) criteria (CV-based), weight decay Recurrent ANNs: Temporal ANNs with Directed Cycles