Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:

Similar presentations


Presentation on theme: "Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:"— Presentation transcript:

1 Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage: http://alpha.sdufe.edu.cn/swang/http://alpha.sdufe.edu.cn/swang/ The ALPHA Lab: http://alpha.sdufe.edu.cn/http://alpha.sdufe.edu.cn/ shqiang.wang@gmail.com

2 2 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

3 Notations

4 Gradient

5 Hessian

6 Example 1

7 Property 1

8 Example 2

9 Property 2

10 Proof (cont)

11

12 Example 3

13 Property 3

14 Principles

15 Algorithm

16 Principles

17 Algorithm

18 18 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

19 19 The Neuron The neuron is the basic information processing unit of a NN. It consists of: 1A set of synapses or connecting links, each link characterized by a weight: W 1, W 2, …, W m 2An adder function (linear combiner) which computes the weighted sum of the inputs: 3Activation function (squashing function) for limiting the amplitude of the output of the neuron.

20 20 The Neuron Input signal Synaptic weights Summing function Bias b Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1

21 21 Bias of a Neuron Bias b has the effect of applying an affine transformation to u v = u + b v is the induced field of the neuron v u

22 22 Bias As An Extra Input Input signal Synaptic weights Summing function Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 w0w0 x 0 = 1

23 23 Activation Function 1.Linear function f(x)=ax 2. Step function 3. Ramp function

24 24 Activation Function 4. Logistic function 5. Hyperbolic tangent 6. Gaussian function

25 25 Activation function

26 26 Definitions x j denotes the j-th item in the n -dimensional input vector w j denotes the j-th item in the weight vector f(x) denotes the output from the neuron when presented with input x α is a constant where 0< α <1 (learning rate) There must be a training set as follows:

27 27 Perceptron: Learning Rule Err = y – f(x) – y is the desired output – f(x) is the actual output w j  w j + α * Err*x j – w j = w j + α * ( y – f(x))*x j – α is a constant called the learning rate

28 28 Least Mean Square learning LMS = Least Mean Square learning Systems, more general than the previous perceptron learning rule. The concept is to minimize the total error, as measured over all training examples, P. O is the raw output, as calculated by E.g. if we have two patterns and T1=1, O1=0.8, T2=0, O2=0.5 then D=(0.5)[(1-0.8) 2 +(0-0.5) 2 ]=.145 We want to minimize the LMS: E W W(old) W(new) C-learning rate

29 29

30 30

31 31 Procedure Initialize w For i=1 to N{ For each point p{ Compute the error e(p) Update w with e(p) } If the termination criteria has been satisfied return }

32 32 Example Training data x(n) Labels y(n) (0, 0) T 1 (0, 1) T 1 (1, 0) T (1, 1) T 1 Learning rate = 0.5, the initial value of w is(0, 1), the the activity function is sgn(x), and the threshold value is -1.

33 33 (0) n=0 (1) n = 1 , (2) n = 2 ,

34 34 (3) n = 3 , (4) n = 4 ,

35 35 (5) n = 5 , (6) n = 6 ,

36 36 (7) n = 7 , (6) n = 8 ,

37 37 (6) n = 9 , (6) n = 10 ,

38 38 (6) n = 11 , (6) n = 12 ,

39 39 The Classifier Training data x(n) Labels y(n) (0, 0) T 1 (0, 1) T 1 (1, 0) T (1, 1) T 1

40 40 Perceptron Classifier For example, suppose there are 4 training data points (with 2 positive examples of the class and 2 negative examples) The initial random value of the weights will probably not divide these points accurately X1X2Class 340 611 411 120

41 41 Perceptron Classifier But during training the weight values are changed, based on the reduction of ‘error’ Eventually a line can be found that does divide the points and solve the classification task eg 4X 1 + -3.5X 2 = 0

42 42 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

43 Perceptron Input signal Synaptic weights Summing function Activation function Local Field v Output y x1x1 x2x2 xmxm w2w2 wmwm w1w1 w0w0 x 0 = 1 Error

44 44 Neural Network Inputs are put through a ‘Hidden Layer’ before the output layer All nodes connected between layers

45 45 Network Architecture With one hidden layer Inputs Outputs

46 46 Network Architecture With three hidden layers

47 47 Learning Rule Measure error Reduce that error – By appropriately adjusting each of the weights in the network

48 48 BP Network Details Forward Pass: – Error is calculated from outputs – Used to update output weights Backward Pass: – Error at hidden nodes is calculated by back propagating the error at the outputs through the new weights – Hidden weights updated

49 49 Two basic signal flows in BP Illustration of the directions of two basic signal flows in a multilayer perceptron: forward propagation of function signals and back propagation of error signals.

50 50 Two basic signal flows in BP Signal-flow graph highlighting the details of output neuron k connected to hidden neuron j.

51 51 BP Network Details The error function of the jth neuron: BP-1 The current mean squared error (MSE): BP-2 For N training instances: BP-3

52 52 Batch learning vs. on-line learning In the batch learning, adjustments to synaptic weights of the multilayer perceptron are performed after the presentation of all the N training samples. This training process that all the N samples are represented one time is called one epoch of training. So the cost function for batch learning is defined by the average error energy E av. Advantages – Accurate estimation the gradient vector – Parallelization Disadvantage – More storage requirements

53 53 Batch learning vs. on-line learning In the on-line learning, adjustments to weights are performed on an example- by-example method. So the function to be minimized is therefore the total instantaneous error energy E(n) Given the training examples are presented to the network in a random manner, on-line learning makes the search in the weights space stochastic in nature. So this method is referred to as a stochastic method. Advantage – It can take advantage of redundant data. – It is able to track small changes in the training data. – It is simple to implement. – It provides effective solutions to large-scale and difficult pattern- classification problems. Disadvantages – Works against the parallelization of on-line learning.

54 54 BP Network Details Given a training dataset, BP Network tries to minimize E(n). In the nth iteration, the output of the jth neuron can be computed as follows: BP-4 BP-5

55 55 BP Network Details According to the update rule of the gradient descent: Since According to BP-4: If we define

56 56 BP Network Details If the jth neuron is an output one, according to BP-1 and 2:

57 57 BP Network Details If the jth neuron is NOT an output one: Since errors are transferred from the kth neurons If the kth neuron is an output one:

58 58 Activation functions The activation function must be continuous. Two commonly used in multi-layer perceptron is sigmoidal nonlinearity, two forms of which are: – Logistic Function – Hyperbolic tangent function

59 59 Signal-flow graph Signal-flow graph of a part of the adjoint system pertaining to back- propagation of error signals.

60 60 Two passes of computation Forward pass Backward pass – This pass stats at the output layer by passing the error signals leftward through the network, layer by layer, and recursively computing delta(local gradient) for each neuron.

61 61 BP Algorithm Learning Procedure: 1. Initialization, including weights and other paramters. 2. Present inputs from training data. 3.Forward computation 4.Backward computation. 5.Iteration

62  The learning rate should not be too large or too small.  In order to avoid the danger of instability, a momentum term can be introduced into the equation.  Add a penalty for each weight 62 Rate of Learning

63 63 Stopping Criteria In general, the BP cannot be shown to converge, and there are no well-defined criteria for stopping its operation. However, there are some reasonable criteria that can be used to terminate the weight adjustments, e.g. – When the Euclidean norm of the gradient vector reaches a sufficiently small gradient threshold. – When the average squared error per epoch is sufficiently small. Usually, it is in the range of 0.1 to 1 percent per epoch, or as small as 0.01 percent.

64 64 Back propagation Strengths of BP learning – Great representation power – Wide practical applicability – Easy to implement – Good generalization power Problems of BP learning – Learning often takes a long time to converge – The net is essentially a black box – Gradient descent approach only guarantees a local minimum error – Not every function that is representable can be learned – Generalization is not guaranteed even if the error is reduced to zero – No well-founded way to assess the quality of BP learning – Network paralysis may occur (learning is stopped) – Selection of learning parameters can only be done by trial-and- error – BP learning is non-incremental (to include new training samples, the network must be re-trained with all old and new samples)

65 65 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

66 Linear Classification For a classification task Input: Output:

67 Linear Discriminant Function x1x1 x2x2 How would you classify these points using a linear discriminant function in order to minimize the error rate? denotes +1 denotes -1 Infinite number of answers! Which one is the best?

68 Margin For data points With a scale transformation on both w and b x1x1 x2x2 denotes +1 denotes -1

69 We know that The margin width is: x1x1 x2x2 denotes +1 denotes -1 Margin w T x + b = 0 w T x + b = -1 w T x + b = 1 x+x+ x+x+ x-x- Support Vectors w Margin

70 are called support vectors! SVM: Large Margin Linear Classifier If separable, the loss function can be: How to optimize it?

71 Lagrangian For an optimization problem The Lagrangian is

72 Hence The Primal Problem Consider the optimization problem: Here The Primal Problem

73 Optimization? For the primal problem How to optimize it? Nothing changes!

74 Dual Problem Consider a dual problem Exactly the same as our primal problem except that the order of the “max” and the “min” are exchanged

75 if the KKT conditions are satisfied: KKT Conditions Let

76 Optimization for SVM Construct the Lagrangian and the dual problem

77 Optimization

78 0

79 Thus, the optimization problem becomes

80 Optimization

81 Non- Separable Case We release the constraint Errors!

82 The Lagrange Dual Problem Prove yourself!

83 Optimization: SMO Let Repeat till convergence { 1. Select some pair α i and α j to update next (pick the two that will allow us to make the biggest progress towards the global maximum). 2. Reoptimize L(α) with respect to α i and α j, while holding all the other α k ( k != i, j ) fixed. }

84 The SMO algorithm When Let α 1 and α 2 be the two selected variables When

85 The SMO algorithm The following quadratic equation is easy to optimize The update step:

86 Non-linear Classification Cannot separate it with a linear classifier!

87 Observation Observation: Let Linear Separable!

88 Observation Classifier: Step 1: Map to Step 2: Utilize the linear SVM algorithm for optimization!

89 Let the mapping function be Then Since the optimization process involves many inner product operations Kernels Define Kernel Function!

90 Some Kernels Polynomial kernel function: Gaussian kernel function

91 SVM with Kernels Loss function

92 SVM with Kernels Optimization function Classifier:

93 93 Outlines Optimization Perception Neural Networks Support Vector Machines (SVM) Logistic Regression

94 Logistic Function Logistic Function:

95 Property Property 1 Property 2

96 Property (cont) Property 3

97 Logistic Regression (LR) For a classification task Input: Let

98 Logistic Regression (LR) Since Thus Output: w, b

99 Optimization for LR According to maximum likelihood estimation

100 Optimization for LR (cont’) Let Then

101 Optimization for LR (cont’) According to gradient descent

102 Regularization L1 norm: L2 norm: AccuracySparsity

103 LR with Regulation Loss function: Optimization: Gradient descent

104 LR vs. SVM Logistic Regression: SVM: All the same! Hinge Loss Log Loss

105


Download ppt "Chapter 6 – Classification (Advanced) Shuaiqiang Wang ( 王帅强 ) School of Computer Science and Technology Shandong University of Finance and Economics Homepage:"

Similar presentations


Ads by Google