Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:

Slides:



Advertisements
Similar presentations
Support Vector Machine
Advertisements

Beyond Linear Separability
Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Multilayer Perceptrons 1. Overview  Recap of neural network theory  The multi-layered perceptron  Back-propagation  Introduction to training  Uses.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Artificial Neural Networks
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Kostas Kontogiannis E&CE
Lecture 13 – Perceptrons Machine Learning March 16, 2010.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
Support Vector Machines
Simple Neural Nets For Pattern Classification
The back-propagation training algorithm
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
CS 4700: Foundations of Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Artificial Neural Network
Radial Basis Function Networks
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Presentation on Neural Networks.. Basics Of Neural Networks Neural networks refers to a connectionist model that simulates the biophysical information.
Computer Science and Engineering
Artificial Neural Networks
Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Classification / Regression Neural Networks 2
Artificial Intelligence Techniques Multilayer Perceptrons.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Non-Bayes classifiers. Linear discriminants, neural networks.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter 2 Single Layer Feedforward Networks
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
EEE502 Pattern Recognition
Chapter 8: Adaptive Networks
Neural Networks 2nd Edition Simon Haykin
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Bab 5 Classification: Alternative Techniques Part 4 Artificial Neural Networks Based Classifer.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Neural networks and support vector machines
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Learning with Perceptrons and Neural Networks
Real Neurons Cell structures Cell body Dendrites Axon
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Collaborative Filtering Matrix Factorization Approach
Synaptic DynamicsII : Supervised Learning
Neuro-Computing Lecture 4 Radial Basis Function Network
Machine Learning: Lecture 4
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Machine Learning: UNIT-2 CHAPTER-1
Neural networks (1) Traditional multi-layer perceptrons
Presentation transcript:

Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage: The ALPHA Lab:

2 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

Notations

Gradient

Hessian

Example 1

Property 1

Example 2

Property 2

Proof (cont)

Example 3

Property 3

Principles

Algorithm

Principles

Algorithm

18 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

19 The Neuron The neuron is the basic information processing unit of a NN. It consists of: 1A set of synapses or connecting links, each link characterized by a weight: W 1, W 2, …, W m 2An adder function (linear combiner) which computes the weighted sum of the inputs: 3Activation function (squashing function) for limiting the amplitude of the output of the neuron.

20 The Neuron Input signal Synaptic weights Summing function Bias b Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1

21 Bias of a Neuron Bias b has the effect of applying an affine transformation to u v = u + b v is the induced field of the neuron v u

22 Bias As An Extra Input Input signal Synaptic weights Summing function Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 w0w0 x 0 = 1

23 Activation Function 1.Linear function f(x)=ax 2. Step function 3. Ramp function

24 Activation Function 4. Logistic function 5. Hyperbolic tangent 6. Gaussian function

25 Activation function

26 Definitions x j denotes the j-th item in the n -dimensional input vector w j denotes the j-th item in the weight vector f(x) denotes the output from the neuron when presented with input x α is a constant where 0< α <1 (learning rate) There must be a training set as follows:

27 Perceptron: Learning Rule Err = y – f(x) – y is the desired output – f(x) is the actual output w j  w j + α * Err*x j – w j = w j + α * ( y – f(x))*x j – α is a constant called the learning rate

28 Least Mean Square learning LMS = Least Mean Square learning Systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all training examples, P. O is the raw output, as calculated by E.g. if we have two patterns and T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8) 2 +(0-0.5) 2 ]=.145 We want to minimize the LMS: E W W(old) W(new) C-learning rate

29

30

31 Procedure Initialize w For i=1 to N{ For each point p{ Compute the error e(p) Update w with e(p) } If the termination criteria has been satisfied return }

32 Example Training data x(n) Labels y(n) (0, 0) T 1 (0, 1) T 1 (1, 0) T (1, 1) T 1 Learning rate = 0.5, the initial value of w is(0, 1), the the activity function is sgn(x), and the threshold value is -1.

33 (0) n=0 (1) n = 1 , (2) n = 2 ,

34 (3) n = 3 , (4) n = 4 ,

35 (5) n = 5 , (6) n = 6 ,

36 (7) n = 7 , (6) n = 8 ,

37 (6) n = 9 , (6) n = 10 ,

38 (6) n = 11 , (6) n = 12 ,

39 The Classifier Training data x(n) Labels y(n) (0, 0) T 1 (0, 1) T 1 (1, 0) T (1, 1) T 1

40 Perceptron Classifier For example, suppose there are 4 training data points (with 2 positive examples of the class and 2 negative examples) The initial random value of the weights will probably not divide these points accurately X1X2Class

41 Perceptron Classifier But during training the weight values are changed, based on the reduction of ‘error’ Eventually a line can be found that does divide the points and solve the classification task eg 4X X 2 = 0

42 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

Perceptron Input signal Synaptic weights Summing function Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 w0w0 x 0 = 1 Error

44 Neural Network Inputs are put through a ‘Hidden Layer’ before the output layer All nodes connected between layers

45 Network Architecture With one hidden layer Inputs Outputs

46 Network Architecture With three hidden layers

47 Learning Rule Measure error Reduce that error – By appropriately adjusting each of the weights in the network

48 BP Network Details Forward Pass: – Error is calculated from outputs – Used to update output weights Backward Pass: – Error at hidden nodes is calculated by back propagating the error at the outputs through the new weights – Hidden weights updated

49 Two basic signal flows in BP Illustration of the directions of two basic signal flows in a multilayer perceptron: forward propagation of function signals and back propagation of error signals.

50 Two basic signal flows in BP Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j.

51 BP Network Details The error function of the jth neuron: BP-1 The current mean squared error (MSE): BP-2 For N training instances: BP-3

52 Batch learning vs. on-line learning In the batch learning, adjustments to synaptic weights of the multilayer perceptron are performed after the presentation of all the N training samples. This training process that all the N samples are represented one time is called one epoch of training. So the cost function for batch learning is defined by the average error energy E av. Advantages – Accurate estimation the gradient vector – Parallelization Disadvantage – More storage requirements

53 Batch learning vs. on-line learning In the on-line learning, adjustments to weights are performed on an example- by-example method. So the function to be minimized is therefore the total instantaneous error energy E(n) Given the training examples are presented to the network in a random manner, on-line learning makes the search in the weights space stochastic in nature. So this method is referred to as a stochastic method. Advantage – It can take advantage of redundant data. – It is able to track small changes in the training data. – It is simple to implement. – It provides effective solutions to large-scale and difficult pattern- classification problems. Disadvantages – Works against the parallelization of on-line learning.

54 BP Network Details Given a training dataset, BP Network tries to minimize E(n). In the nth iteration, the output of the jth neuron can be computed as follows: BP-4 BP-5

55 BP Network Details According to the update rule of the gradient descent: Since According to BP-4: If we define

56 BP Network Details If the jth neuron is an output one, according to BP-1 and 2:

57 BP Network Details If the jth neuron is NOT an output one: Since errors are transferred from the kth neurons If the kth neuron is an output one:

58 Activation functions The activation function must be continuous. Two commonly used in multi-layer perceptron is sigmoidal nonlinearity, two forms of which are: – Logistic Function – Hyperbolic tangent function

59 Signal-flow graph Signal-flow graph of a part of the adjoint system pertaining to back- propagation of error signals.

60 Two passes of computation Forward pass Backward pass – This pass stats at the output layer by passing the error signals leftward through the network, layer by layer, and recursively computing delta(local gradient) for each neuron.

61 BP Algorithm Learning Procedure: 1. Initialization, including weights and other paramters. 2. Present inputs from training data. 3.Forward computation 4.Backward computation. 5.Iteration

 The learning rate should not be too large or too small.  In order to avoid the danger of instability, a momentum term can be introduced into the equation.  Add a penalty for each weight 62 Rate of Learning

63 Stopping Criteria In general, the BP cannot be shown to converge, and there are no well-defined criteria for stopping its operation. However, there are some reasonable criteria that can be used to terminate the weight adjustments, e.g. – When the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold. – When the average squared error per epoch is sufficiently small. Usually, it is in the range of 0.1 to 1 percent per epoch, or as small as 0.01 percent.

64 Back propagation Strengths of BP learning – Great representation power – Wide practical applicability – Easy to implement – Good generalization power Problems of BP learning – Learning often takes a long time to converge – The net is essentially a black box – Gradient descent approach only guarantees a local minimum error – Not every function that is representable can be learned – Generalization is not guaranteed even if the error is reduced to zero – No well-founded way to assess the quality of BP learning – Network paralysis may occur (learning is stopped) – Selection of learning parameters can only be done by trial-and- error – BP learning is non-incremental (to include new training samples, the network must be re-trained with all old and new samples)

65 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

Linear Classification For a classification task Input: Output:

Linear Discriminant Function x1x1 x2x2 How would you classify these points using a linear discriminant function in order to minimize the error rate? denotes +1 denotes -1 Infinite number of answers! Which one is the best?

Margin For data points With a scale transformation on both w and b x1x1 x2x2 denotes +1 denotes -1

We know that The margin width is: x1x1 x2x2 denotes +1 denotes -1 Margin w T x + b = 0 w T x + b = -1 w T x + b = 1 x+x+ x+x+ x-x- Support Vectors w Margin

are called support vectors! SVM: Large Margin Linear Classifier If separable, the loss function can be: How to optimize it?

Lagrangian For an optimization problem The Lagrangian is

Hence The Primal Problem Consider the optimization problem: Here The Primal Problem

Optimization? For the primal problem How to optimize it? Nothing changes!

Dual Problem Consider a dual problem Exactly the same as our primal problem except that the order of the “max” and the “min” are exchanged

if the KKT conditions are satisfied: KKT Conditions Let

Optimization for SVM Construct the Lagrangian and the dual problem

Optimization

0

Thus, the optimization problem becomes

Optimization

Non- Separable Case We release the constraint Errors!

The Lagrange Dual Problem Prove yourself!

Optimization: SMO Let Repeat till convergence { 1. Select some pair α i and α j to update next (pick the two that will allow us to make the biggest progress towards the global maximum). 2. Reoptimize L(α) with respect to α i and α j, while holding all the other α k ( k != i, j ) fixed. }

The SMO algorithm When Let α 1 and α 2 be the two selected variables When

The SMO algorithm The following quadratic equation is easy to optimize The update step:

Non-linear Classification Cannot separate it with a linear classifier!

Observation Observation: Let Linear Separable!

Observation Classifier: Step 1: Map to Step 2: Utilize the linear SVM algorithm for optimization!

Let the mapping function be Then Since the optimization process involves many inner product operations Kernels Define Kernel Function!

Some Kernels Polynomial kernel function: Gaussian kernel function

SVM with Kernels Loss function

SVM with Kernels Optimization function Classifier:

93 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

Logistic Function Logistic Function:

Property Property 1 Property 2

Property (cont) Property 3

Logistic Regression (LR) For a classification task Input: Let

Logistic Regression (LR) Since Thus Output: w, b

Optimization for LR According to maximum likelihood estimation

Optimization for LR (cont’) Let Then

Optimization for LR (cont’) According to gradient descent

Regularization L1 norm: L2 norm: AccuracySparsity

LR with Regulation Loss function: Optimization: Gradient descent

LR vs. SVM Logistic Regression: SVM: All the same! Hinge Loss Log Loss