Presentation is loading. Please wait.

Presentation is loading. Please wait.

Linear machines márc. 9.. 1 Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but.

Similar presentations


Presentation on theme: "Linear machines márc. 9.. 1 Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but."— Presentation transcript:

1 Linear machines márc. 9.

2 1 Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but tractable model

3 Decision surface for Bayes classifier with Normal densites (  i =  esete)

4 Decision tree and decision regions

5 4 Linear discriminant function two category classifier: choose  1 if g(x) > 0 else choose  2 if g(x) < 0 If g(x) = 0 the decision is undefined. g(x)=0 defines the decision surface Linear machine = linear discriminant function: g(x) = w t x + w 0 w weight vector w 0 constant bias

6 5

7 6 c linear discriminant function:  i is predicted if g i (x) > g j (x)  j  i; i.e. pairwise decision surfaces defines the decision regions More than 2 categories

8 7

9 8 It is proved that linear machines can only define convex regions, i.e. concave regions cannot be learnt. Moreover the decision boundaries can be higher order surfaces (like elliptoids)… Expression power of linear machines

10 Homogen coordinates

11 10 Training linear machines

12 11 Lineáris gépek tanulása Searching for the values of w which separates classes Usually a goodness function is utilised as objective function, e.g.

13 12 Two categories - normalisation (normalised version) if y i belongs to ω 2 replace y i by -y i then search for a which a t y i >0 There isn’t any unique solution.

14 13 Iterative optimalisation The solution minimalises J(a ) Iterative improvement of J(a) a(k) a(k+1) Step direction Learning rate

15 14 Gradient descent Learning rate is a function of k, i.e. it describes a cooling strategy

16 15 Gradient descent

17 16 Learning rate?

18 17 Perceptron rule

19 18 Perceptron szabály Y(a): the set of training samples misclassified by a If Y(a) is empty J p (a)=0; else J p (a)>0

20 19 Perceptron rule –Using J p (a) in the gradient descent:

21 20 Misclassified training samples by a(k) Perceptron convergence theorem: If the training dataset is linearly separable the batch perceptron algorithm finds a solution in finete steps.

22 21 η(k)=1 online learning Stochastic gradient desent: Estimate the gradient based on a few trainging examples

23 Online learning algorithms: The modell is updated by each training instance (or by a small batch) Offline learning algorithms: The training dataset is processed as a whole Advantages of online learning: -Update is straightforward -The training dataset can be streamed -Implicit adaptation Disadvantages of online learning: - Its accuracy migth be lower Online vs offline learning

24 23 Not linearly separable case –Change the loss function, it should count each training example e.g. the directed distance from the decision surface

25 SVM

26 25 Which one to prefer?

27 26 Margin: the gap around the decision surface. It is defined by the training instances closest to the decision survey (support vectors)

28 27

29 28 Support Vector Machine (SVM) SVM is a linear machine where the objective function incorporates the maximalisation of the margin! This provides generalisation ability

30 SVM Linearly separable case

31 30 Linear SVM: linearly separable case Training database: Searching for w s.t. or

32 31 Note the size of the margin by ρ Linearly separable: We prefer a unique solution: argmax ρ = argmin Linear SVM: linearly separable case

33 32 Linear SVM: linearly separable case Convex quadratic optimisation problem…

34 33 The form of the solution: bármely t-ből x t támasztóvektor iff only support vectors count only support vectors count Weighted avearge of training instances Linear SVM: linearly separable case

35

36 SVM not linearly separable case

37 36 Linear SVM: not linearly separable case ξ slack variable enables incorrect classifications („soft margin”) : ξ t =0 if the classification is correct, else it is the distance from the margin C is a metaparameter for the trade-off between the margin size and incorrect classifications

38

39 SVM non-linear case

40 Generalised linear discriminant functions E.g. quadratic decision surface: Generalised linear discriminant functions: y i : R d → R arbitrary functions g(x) is not linear in x, but is is linear in y i (it is a hyperplane in the y-space)

41 Example

42

43

44 43 Non-linear SVM

45 Φ is a mapping into a higher dimensional (k) space: There exists a mapping into a higher dimensional space for any dataset where the dataset will be linearly separable in the new space. 44 Non-linear SVM

46 45 The kernel trick g(x)= The calculation of mappings into high dimensional space can be omited if the kernel of to x can be computed

47 46 Example: polinomial kernel K(x,y)=(x y) p d=256 (original dimensions) p=4 h=183 181 376 (high dimensional space) on the other hand K(x,y) is known and feasible to calculate while the inner product in high dimensions is not

48 47 Kernels in practice No rule of thumbs for selecting the appropiate kernel

49 48 The XOR example

50 49 The XOR example

51 50 The XOR example

52 51 The XOR example

53 52 Notes on SVM Training is a global optimalisation problem (exact optimalisation). The performance of SVM is highly dependent on the choice of the kernel and its parameters Finding the appropriate kernel for a particular task is „magic”

54 53 Notes on SVM Complexity depends on the number of support vectors but not on the dimensionality of the feature space In practice, it gaines good enogh generalisation ability even with a small training database

55 Summary Linear machines Gradient descent Perceptron SVM –Linearly separable case –Not separable case –Non-linear SVM


Download ppt "Linear machines márc. 9.. 1 Decison surfaces We focus now on the decision surfaces Linear machines = linear decision surface Non-optimal solution but."

Similar presentations


Ads by Google