Download presentation
Presentation is loading. Please wait.
1
Linear machines 28/02/2017
2
Decision surface for Bayes classifier with Normal densites
3
Decison surfaces We focus on the decision surfaces
Linear machines = linear decision surface Non-optimal solution but tractable model
4
Decision tree and decision regions
5
Linear discriminant function
two category classifier: choose 1 if g(x) > 0 else choose 2 if g(x) < 0 If g(x) = 0 the decision is undefined. g(x)=0 defines the decision surface Linear machine = linear discriminant function: g(x) = wtx + w0 w weight vector w0 constant bias
7
More than 2 categories c linear discriminant function:
i is predicted if gi(x) > gj(x) j i; i.e. pairwise decision surfaces defines the decision regions
9
Expression power of linear machines
It is proved that linear machines can only define convex regions, i.e. concave regions cannot be learnt. Moreover the decision boundaries can be higher order surfaces (like elliptoids)…
10
Homogen coordinates
11
Training linear machines
10 Training linear machines
12
Training linear machines
11 Training linear machines Searching for the values of w which separates classes Usually a goodness function is utilised as objective function, e.g.
13
Two categories - normalisation
12 Two categories - normalisation if yi belongs to ω2 replace yi by -yi then search for a which atyi>0 (normalised version) There isn’t any unique solution.
14
Iterative optimalisation
13 Iterative optimalisation The solution minimalises J(a) Iterative improvement of J(a) a(k+1) Step direction Learning rate a(k)
15
14 Gradient descent Learning rate is a function of k, i.e. it describes a cooling strategy
16
15 Gradient descent
17
Learning rate? 16
18
17 Perceptron rule
19
Perceptron rule Y(a): the set of training samples misclassified by a
18 Perceptron rule Y(a): the set of training samples misclassified by a If Y(a) is empty Jp(a)=0; else Jp(a)>0
20
19 Perceptron rule Using Jp(a) in the gradient descent:
21
20 Misclassified training samples by a(k) Perceptron convergence theorem: If the training dataset is linearly separable the batch perceptron algorithm finds a solution in finete steps.
22
Stochastic gradient desent:
21 η(k)=1 online learning Stochastic gradient desent: Estimate the gradient based on a few trainging examples
23
Online vs offline learning
Online learning algorithms: The modell is updated by each training instance (or by a small batch) Offline learning algorithms: The training dataset is processed as a whole Advantages of online learning: Update is straightforward The training dataset can be streamed Implicit adaptation Disadvantages of online learning: - Its accuracy migth be lower
24
SVM
25
24 Which one to prefer?
26
25 Margin: the gap around the decision surface. It is defined by the training instances closest to the decision survey (support vectors)
27
26
28
Support Vector Machine (SVM)
27 SVM is a linear machine where the objective function incorporates the maximalisation of the margin! This provides generalisation ability
29
Linearly separable case
SVM Linearly separable case
30
Linear SVM: linearly separable case
29 Training database: Searching for w s.t. or
31
Linear SVM: linearly separable case
30 Note the size of the margin by ρ Linearly separable: We prefer a unique solution: argmax ρ = argmin
32
Linear SVM: linearly separable case
31 Convex quadratic optimisation problem…
33
Linear SVM: linearly separable case
32 The form of the solution: bármely t-ből xt is a support vector iff Weighted avearge of training instances only support vectors count
35
not linearly separable case
SVM not linearly separable case
36
Linear SVM: not linearly separable case
35 Linear SVM: not linearly separable case ξ slack variable enables incorrect classifications („soft margin”): ξt=0 if the classification is correct, else it is the distance from the margin C is a metaparameter for the trade-off between the margin size and incorrect classifications
38
SVM non-linear case
39
Generalised linear discriminant functions
E.g. quadratic decision surface: Generalised linear discriminant functions: yi: Rd → R arbitrary functions g(x) is not linear in x, but is is linear in yi (it is a hyperplane in the y-space)
40
Example
43
42 Non-linear SVM
44
Non-linear SVM Φ is a mapping into a higher dimensional (k) space:
43 Non-linear SVM Φ is a mapping into a higher dimensional (k) space: There exists a mapping into a higher dimensional space for any dataset where the dataset will be linearly separable in the new space.
45
44 The kernel trick g(x)= The calculation of mappings into high dimensional space can be omited if the kernel of to x can be computed
46
Example: polinomial kernel
45 Example: polinomial kernel K(x,y)=(x y) p d=256 (original dimensions) p=4 h= (high dimensional space) on the other hand K(x,y) is known and feasible to calculate while the inner product in high dimensions is not
47
46 Kernels in practice No rule of thumbs for selecting the appropiate kernel
48
47 The XOR example
49
48 The XOR example
50
49 The XOR example
51
50 The XOR example
52
51 Notes on SVM Training is a global optimalisation problem (exact optimalisation). The performance of SVM is highly dependent on the choice of the kernel and its parameters Finding the appropriate kernel for a particular task is „magic”
53
52 Notes on SVM Complexity depends on the number of support vectors but not on the dimensionality of the feature space In practice, it gaines good enogh generalisation ability even with a small training database
54
Summary Linear machines Gradient descent Perceptron SVM
Linearly separable case Not separable case Non-linear SVM
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.