Download presentation
Presentation is loading. Please wait.
Published byKristina Strickland Modified over 8 years ago
1
Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage: http://alpha.sdufe.edu.cn/swang/http://alpha.sdufe.edu.cn/swang/ The ALPHA Lab: http://alpha.sdufe.edu.cn/http://alpha.sdufe.edu.cn/ shqiang.wang@gmail.com
2
2 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression
3
Notations
4
Gradient
5
Hessian
6
Example 1
7
Property 1
8
Example 2
9
Property 2
10
Proof (cont)
12
Example 3
13
Property 3
14
Principles
15
Algorithm
16
Principles
17
Algorithm
18
18 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression
19
19 The Neuron The neuron is the basic information processing unit of a NN. It consists of: 1A set of synapses or connecting links, each link characterized by a weight: W 1, W 2, …, W m 2An adder function (linear combiner) which computes the weighted sum of the inputs: 3Activation function (squashing function) for limiting the amplitude of the output of the neuron.
20
20 The Neuron Input signal Synaptic weights Summing function Bias b Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1
21
21 Bias of a Neuron Bias b has the effect of applying an affine transformation to u v = u + b v is the induced field of the neuron v u
22
22 Bias As An Extra Input Input signal Synaptic weights Summing function Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 w0w0 x 0 = 1
23
23 Activation Function 1.Linear function f(x)=ax 2. Step function 3. Ramp function
24
24 Activation Function 4. Logistic function 5. Hyperbolic tangent 6. Gaussian function
25
25 Activation function
26
26 Definitions x j denotes the j-th item in the n -dimensional input vector w j denotes the j-th item in the weight vector f(x) denotes the output from the neuron when presented with input x α is a constant where 0< α <1 (learning rate) There must be a training set as follows:
27
27 Perceptron: Learning Rule Err = y – f(x) – y is the desired output – f(x) is the actual output w j w j + α * Err*x j – w j = w j + α * ( y – f(x))*x j – α is a constant called the learning rate
28
28 Least Mean Square learning LMS = Least Mean Square learning Systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all training examples, P. O is the raw output, as calculated by E.g. if we have two patterns and T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8) 2 +(0-0.5) 2 ]=.145 We want to minimize the LMS: E W W(old) W(new) C-learning rate
29
29
30
30
31
31 Procedure Initialize w For i=1 to N{ For each point p{ Compute the error e(p) Update w with e(p) } If the termination criteria has been satisfied return }
32
32 Example Training data x(n) Labels y(n) (0, 0) T 1 (0, 1) T 1 (1, 0) T (1, 1) T 1 Learning rate = 0.5, the initial value of w is(0, 1), the the activity function is sgn(x), and the threshold value is -1.
33
33 (0) n=0 (1) n = 1 , (2) n = 2 ,
34
34 (3) n = 3 , (4) n = 4 ,
35
35 (5) n = 5 , (6) n = 6 ,
36
36 (7) n = 7 , (6) n = 8 ,
37
37 (6) n = 9 , (6) n = 10 ,
38
38 (6) n = 11 , (6) n = 12 ,
39
39 The Classifier Training data x(n) Labels y(n) (0, 0) T 1 (0, 1) T 1 (1, 0) T (1, 1) T 1
40
40 Perceptron Classifier For example, suppose there are 4 training data points (with 2 positive examples of the class and 2 negative examples) The initial random value of the weights will probably not divide these points accurately X1X2Class 340 611 411 120
41
41 Perceptron Classifier But during training the weight values are changed, based on the reduction of ‘error’ Eventually a line can be found that does divide the points and solve the classification task eg 4X 1 + -3.5X 2 = 0
42
42 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression
43
Perceptron Input signal Synaptic weights Summing function Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 w0w0 x 0 = 1 Error
44
44 Neural Network Inputs are put through a ‘Hidden Layer’ before the output layer All nodes connected between layers
45
45 Network Architecture With one hidden layer Inputs Outputs
46
46 Network Architecture With three hidden layers
47
47 Learning Rule Measure error Reduce that error – By appropriately adjusting each of the weights in the network
48
48 BP Network Details Forward Pass: – Error is calculated from outputs – Used to update output weights Backward Pass: – Error at hidden nodes is calculated by back propagating the error at the outputs through the new weights – Hidden weights updated
49
49 Two basic signal flows in BP Illustration of the directions of two basic signal flows in a multilayer perceptron: forward propagation of function signals and back propagation of error signals.
50
50 Two basic signal flows in BP Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j.
51
51 BP Network Details The error function of the jth neuron: BP-1 The current mean squared error (MSE): BP-2 For N training instances: BP-3
52
52 Batch learning vs. on-line learning In the batch learning, adjustments to synaptic weights of the multilayer perceptron are performed after the presentation of all the N training samples. This training process that all the N samples are represented one time is called one epoch of training. So the cost function for batch learning is defined by the average error energy E av. Advantages – Accurate estimation the gradient vector – Parallelization Disadvantage – More storage requirements
53
53 Batch learning vs. on-line learning In the on-line learning, adjustments to weights are performed on an example- by-example method. So the function to be minimized is therefore the total instantaneous error energy E(n) Given the training examples are presented to the network in a random manner, on-line learning makes the search in the weights space stochastic in nature. So this method is referred to as a stochastic method. Advantage – It can take advantage of redundant data. – It is able to track small changes in the training data. – It is simple to implement. – It provides effective solutions to large-scale and difficult pattern- classification problems. Disadvantages – Works against the parallelization of on-line learning.
54
54 BP Network Details Given a training dataset, BP Network tries to minimize E(n). In the nth iteration, the output of the jth neuron can be computed as follows: BP-4 BP-5
55
55 BP Network Details According to the update rule of the gradient descent: Since According to BP-4: If we define
56
56 BP Network Details If the jth neuron is an output one, according to BP-1 and 2:
57
57 BP Network Details If the jth neuron is NOT an output one: Since errors are transferred from the kth neurons If the kth neuron is an output one:
58
58 Activation functions The activation function must be continuous. Two commonly used in multi-layer perceptron is sigmoidal nonlinearity, two forms of which are: – Logistic Function – Hyperbolic tangent function
59
59 Signal-flow graph Signal-flow graph of a part of the adjoint system pertaining to back- propagation of error signals.
60
60 Two passes of computation Forward pass Backward pass – This pass stats at the output layer by passing the error signals leftward through the network, layer by layer, and recursively computing delta(local gradient) for each neuron.
61
61 BP Algorithm Learning Procedure: 1. Initialization, including weights and other paramters. 2. Present inputs from training data. 3.Forward computation 4.Backward computation. 5.Iteration
62
The learning rate should not be too large or too small. In order to avoid the danger of instability, a momentum term can be introduced into the equation. Add a penalty for each weight 62 Rate of Learning
63
63 Stopping Criteria In general, the BP cannot be shown to converge, and there are no well-defined criteria for stopping its operation. However, there are some reasonable criteria that can be used to terminate the weight adjustments, e.g. – When the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold. – When the average squared error per epoch is sufficiently small. Usually, it is in the range of 0.1 to 1 percent per epoch, or as small as 0.01 percent.
64
64 Back propagation Strengths of BP learning – Great representation power – Wide practical applicability – Easy to implement – Good generalization power Problems of BP learning – Learning often takes a long time to converge – The net is essentially a black box – Gradient descent approach only guarantees a local minimum error – Not every function that is representable can be learned – Generalization is not guaranteed even if the error is reduced to zero – No well-founded way to assess the quality of BP learning – Network paralysis may occur (learning is stopped) – Selection of learning parameters can only be done by trial-and- error – BP learning is non-incremental (to include new training samples, the network must be re-trained with all old and new samples)
65
65 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression
66
Linear Classification For a classification task Input: Output:
67
Linear Discriminant Function x1x1 x2x2 How would you classify these points using a linear discriminant function in order to minimize the error rate? denotes +1 denotes -1 Infinite number of answers! Which one is the best?
68
Margin For data points With a scale transformation on both w and b x1x1 x2x2 denotes +1 denotes -1
69
We know that The margin width is: x1x1 x2x2 denotes +1 denotes -1 Margin w T x + b = 0 w T x + b = -1 w T x + b = 1 x+x+ x+x+ x-x- Support Vectors w Margin
70
are called support vectors! SVM: Large Margin Linear Classifier If separable, the loss function can be: How to optimize it?
71
Lagrangian For an optimization problem The Lagrangian is
72
Hence The Primal Problem Consider the optimization problem: Here The Primal Problem
73
Optimization? For the primal problem How to optimize it? Nothing changes!
74
Dual Problem Consider a dual problem Exactly the same as our primal problem except that the order of the “max” and the “min” are exchanged
75
if the KKT conditions are satisfied: KKT Conditions Let
76
Optimization for SVM Construct the Lagrangian and the dual problem
77
Optimization
78 0
79
Thus, the optimization problem becomes
80
Optimization
81
Non- Separable Case We release the constraint Errors!
82
The Lagrange Dual Problem Prove yourself!
83
Optimization: SMO Let Repeat till convergence { 1. Select some pair α i and α j to update next (pick the two that will allow us to make the biggest progress towards the global maximum). 2. Reoptimize L(α) with respect to α i and α j, while holding all the other α k ( k != i, j ) fixed. }
84
The SMO algorithm When Let α 1 and α 2 be the two selected variables When
85
The SMO algorithm The following quadratic equation is easy to optimize The update step:
86
Non-linear Classification Cannot separate it with a linear classifier!
87
Observation Observation: Let Linear Separable!
88
Observation Classifier: Step 1: Map to Step 2: Utilize the linear SVM algorithm for optimization!
89
Let the mapping function be Then Since the optimization process involves many inner product operations Kernels Define Kernel Function!
90
Some Kernels Polynomial kernel function: Gaussian kernel function
91
SVM with Kernels Loss function
92
SVM with Kernels Optimization function Classifier:
93
93 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression
94
Logistic Function Logistic Function:
95
Property Property 1 Property 2
96
Property (cont) Property 3
97
Logistic Regression (LR) For a classification task Input: Let
98
Logistic Regression (LR) Since Thus Output: w, b
99
Optimization for LR According to maximum likelihood estimation
100
Optimization for LR (cont’) Let Then
101
Optimization for LR (cont’) According to gradient descent
102
Regularization L1 norm: L2 norm: AccuracySparsity
103
LR with Regulation Loss function: Optimization: Gradient descent
104
LR vs. SVM Logistic Regression: SVM: All the same! Hinge Loss Log Loss
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.