Presentation is loading. Please wait.

Presentation is loading. Please wait.

Peter Fox and Greg Hughes

Similar presentations


Presentation on theme: "Peter Fox and Greg Hughes"— Presentation transcript:

1 Peter Fox and Greg Hughes
Dimension Reduction (DR) and Multi-Dimensional Scaling (MDS), Support Vector Machines (SVM) Peter Fox and Greg Hughes Data Analytics – 4600/6600 Group 3 Module 8, February 27, 2017

2 Dimension reduction.. Principle component analysis (PCA) and metaPCA (in R) Singular Value Decomposition Feature selection, reduction Built into a lot of clustering Why? Curse of dimensionality – or – some subset of the data should not be used as it adds noise What is it? Various methods to reach an optimal subset

3 Simple example

4 More dimensions

5 Feature selection The goodness of a feature/feature subset is dependent on measures Various measures Information measures Distance measures Dependence measures Consistency measures Accuracy measures

6 Multidimensional Scaling
Visual representation ~ 2-D plot - patterns of proximity in a lower dimensional space "Similar" to PCA/DR but uses dissimilarity as input -> dissimilarity matrix An MDS algorithm aims to place each object in N-dimensional space such that the between-object distances are preserved as well as possible. Each object is then assigned coordinates in each of the N dimensions. The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. Choosing N=2 optimizes the object locations for a two-dimensional scatterplot

7 Four types of MDS Classical multidimensional scaling
Also known as Principal Coordinates Analysis, Torgerson Scaling or Torgerson–Gower scaling. Takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain. Metric multidimensional scaling A superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization.

8 Four types of MDS ctd Non-metric multidimensional scaling
In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. The relationship is typically found using isotonic regression. Generalized multidimensional scaling An extension of metric multidimensional scaling, in which the target space is an arbitrary smooth non-Euclidean space. In cases where the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another.

9 In R function (library) cmdscale() (stats) smacofSym() (smacof)
wcmdscale() (vegan) pco() (ecodist) pco() (labdsv) pcoa() (ape) Only stats is loaded by default, and the rest are not installed by default

10 cmdscale() cmdscale(d, k = 2, eig = FALSE, add = FALSE, x.ret = FALSE) d - a distance structure such as that returned by dist or a full symmetric matrix containing the dissimilarities. k - the maximum dimension of the space which the data are to be represented in; must be in {1, 2, …, n-1}. eig - indicates whether eigenvalues should be returned. add - logical indicating if an additive constant c* should be computed, and added to the non-diagonal dissimilarities such that the modified dissimilarities are Euclidean. x.ret - indicates whether the doubly centred symmetric distance matrix should be returned.

11 Distances between Australian cities
row.names(dist.au) <- dist.au[, 1] dist.au <- dist.au[, -1] dist.au ## A AS B D H M P S ## A ## AS ## B ## D ## H ## M ## P ## S

12 Distances between Australian cities
fit <- cmdscale(dist.au, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x, y, pch = 19, xlim = range(x) + c(0, 600)) city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", "Melbourne", "Perth", "Sydney") text(x, y, pos = 4, labels = city.names)

13

14 R – many ways (of course)
library(igraph) g <- graph.full(nrow(dist.au)) V(g)$label <- city.names layout <- layout.mds(g, dist = as.matrix(dist.au)) plot(g, layout = layout, vertex.size = 3)

15

16 Support Vector Machine
Conceptual theory, formulae… see Reading! SVM - general (nonlinear) classification, regression and outlier detection with an intuitive model representation Hyperplanes separate the classification spaces (can be multi-dimensional) Kernel functions can play a key role

17 Schematically

18 Schematically Support Vectors b=bias term, b=0 (unbiased)

19 Construction Construct an optimization objective function that is inherently subject to some constraints Like minimizing least square error (quadratic) Most important: the classifier gets the points right by “at least” the margin Support Vectors can then be defined as those points in the dataset that have "non zero” Lagrange multipliers*. make a classification on a new point by using only the support vectors – why?

20 Support vectors Support the “plane”

21 What about the “machine” part
Ignore it – somewhat leftover from the “machine learning” era It is trained and then Classifies

22 No clear separation = no hyperplane?
Soft-margins… Non-linearity or transformation

23 Feature space Mapping (transformation) using a function, i.e. a kernel
goal is – linear separability

24 Kernels or “non-linearity”…
the kernel function, represents a dot product of input data points mapped into the higher dimensional feature space by transformation phi + note presence of “gamma” parameter

25 Best Linear Separator: Supporting Plane Method
Maximize distance Between two parallel supporting planes Distance = “Margin” =

26 Soft Margin SVM Just add non-negative error vector z.

27 Method 2: Find Closest Points in Convex Hulls

28 Plane Bisects Closest Points
d c

29 Find using quadratic program
Many existing and new QP solvers.

30 Dual of Closest Points Method is Support Plane Method
Solution only depends on support vectors:

31 One bad example? Convex Hulls Intersect! Same argument won’t work.

32 Don’t trust a single point!
Each point must depend on at least two actual data points.

33 Depend on >= two points
Each point must depend on at least two actual data points.

34 Depend on >= two points
Each point must depend on at least two actual data points.

35 Depend on >= two points
Each point must depend on at least two actual data points.

36 Depend on >= two points
Each point must depend on at least two actual data points.

37 Final Reduced/Robust Set
Each point must depend on at least two actual data points. Called Reduced Convex Hull

38 Reduced Convex Hulls Don’t Intersect
Reduce by adding upper bound D

39 Find Closest Points Then Bisect
No change except for D. D determines number of Support Vectors.

40 Dual of Closest Points Method is Soft Margin Method
Solution only depends on support vectors:

41 What will linear SVM do?

42 Linear SVM Fails

43 High Dimensional Mapping trick

44

45 Nonlinear Classification: Map to higher dimensional space
IDEA: Map each point to higher dimensional feature space and construct linear discriminant in the higher dimensional space. Dual SVM becomes:

46 Kernel Calculates Inner Product

47 Final Classification via Kernels
The Dual SVM becomes:

48 Generalized Inner Product
By Hilbert-Schmidt Kernels (Courant and Hilbert 1953) for certain  and K, e.g. Also kernels for nonvector data like strings, histograms, dna,…

49 Final SVM Algorithm Solve Dual SVM QP Recover primal variable b
Classify new x Solution only depends on support vectors:

50 S5: Recal linear solution
A point (x,y) is misclassified If 𝑦(𝑥∙𝑤−𝑏)≤0 x∙𝑤=𝑏 x∙𝑤=𝑏−1 x∙𝑤=𝑏+1

51 RBF results on Sample Data

52 Have to pick parameters Effect of C

53 Effect of RBF parameter

54 General Kernel methodology
Pick a learning task Start with linear function and data Define loss function Define regularization Formulate optimization problem in dual space/inner product space Construct an appropriate kernel Solve problem in dual space

55 kernlab, svmpath and klaR
Work through the examples (lab) Familiar datasets and samples procedures from 4 libraries (these are the most used) kernlab e1071 svmpath klaR Karatzoglou et al. 2006

56 Application of SVM Classification, outlier, regression…
Can produce labels or probabilities (and when used with tree partitioning can produce decision values) Different minimizations functions subject to different constraints (Lagrange multipliers) Observe the effect of changing the C parameter and the kernel See Karatzoglou et al. 2006

57 Types of SVM (names) Classification SVM Type 1 (also known as C-SVM classification) Classification SVM Type 2 (also known as nu-SVM classification) Regression SVM Type 1 (also known as epsilon-SVM regression) Regression SVM Type 2 (also known as nu-SVM regression)

58 More kernels Karatzoglou et al. 2006

59 Timing Karatzoglou et al. 2006

60 Library capabilities Karatzoglou et al. 2006

61 Extensions Many Inference Tasks Regression
One-class Classification, novelty detection Ranking Clustering Multi-Task Learning Learning Kernels Canonical Correlation Analysis Principal Component Analysis

62 Algorithms Algorithms Types: General Purpose solvers CPLEX by ILOG Matlab optimization toolkit Special purpose solvers exploit structure of the problem Best linear SVM take time linear in the number of training data points. Best kernel SVM solvers take time quadratic in the number of training data points. Good news since convex, algorithm doesn’t really matter as long as solvable.

63 Hallelujah! BUT… Generalization theory and practice meet
General methodology for many types of inference problems Same Program + New Kernel = New method No problems with local minima Few model parameters. Avoids overfitting Robust optimization methods. Applicable to non-vector problems. Easy to use and tune Successful Applications BUT…

64 Catches Will SVMs beat my best hand-tuned method Z on problem X?
Do SVMs scale to massive datasets? How to chose C and Kernel? How to transform data? How to incorporate domain knowledge? How to interpret results? Are linear methods enough?


Download ppt "Peter Fox and Greg Hughes"

Similar presentations


Ads by Google