1 Peter Fox Data Analytics – 4600/6600 Week 9a, March 29, 2016 Dimension reduction and MD scaling, Support Vector Machines.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine

Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Support vector machine
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
MMLD1 Support Vector Machines: Hype or Hallelujah? Kristin Bennett Math Sciences Dept Rensselaer Polytechnic Inst.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
An Introduction to Kernel-Based Learning Algorithms K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda and B. Scholkopf Presented by: Joanna Giforos CS8980: Topics.
Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.
October 2-4, 2000M20001 Support Vector Machines: Hype or Hallelujah? Kristin Bennett Math Sciences Dept Rensselaer Polytechnic Inst.
Support Vector Machines Kernel Machines
Support Vector Machines
2806 Neural Computation Support Vector Machines Lecture Ari Visa.
Support Vector Machines
An Introduction to Support Vector Machines Martin Law.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 10a, April 1, 2014 Support Vector Machines.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
An Introduction to Support Vector Machine (SVM)
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support vector machine LING 572 Fei Xia Week 8: 2/23/2010 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A 1.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
MMLD1 Support Vector Machines: Hype or Hallelujah? Kristin Bennett Math Sciences Dept Rensselaer Polytechnic Inst.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
Support Vector Machines
Peter Fox and Greg Hughes
PREDICT 422: Practical Machine Learning
Geometrical intuition behind the dual problem
Support Vector Machines
Machine Learning Basics
Support Vector Machines Introduction to Data Mining, 2nd Edition by
Support Vector Machines
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

1 Peter Fox Data Analytics – 4600/6600 Week 9a, March 29, 2016 Dimension reduction and MD scaling, Support Vector Machines

Paid opportunity… 2pm Fri A small medical practice in Troy is considering opening a branch office in the Capital Region. They would like some GIS analysis to assist them in evaluating options for the location of the 2nd office. Considerations would include demographics (with weighting by age), location of competitors (approximately 6), and drive-time analysis. They are open to considering other layers that might be suggested by the person(s) setting up the model. 2

Dimension reduction.. Principle component analysis (PCA) and metaPCA (in R) Singular Value Decomposition Feature selection, reduction Built into a lot of clustering Why? –Curse of dimensionality – or – some subset of the data should not be used as it adds noise What is it? –Various methods to reach an optimal subset 3

Simple example 4

More dimensions 5

Feature selection The goodness of a feature/feature subset is dependent on measures Various measures –Information measures –Distance measures –Dependence measures –Consistency measures –Accuracy measures 6

On your own… library(EDR) # effective dimension reduction library(dr) library(clustrd) install.packages("edrGraphicalTools") library(edrGraphicalTools) demo(edr_ex1) demo(edr_ex2) demo(edr_ex3) demo(edr_ex4) 7

Some examples Lab8b_dr1_2016.R Lab8b_dr2_2016.R Lab8b_dr3_2016.R Lab8b_dr4_2016.R 8

Multidimensional Scaling Visual representation ~ 2-D plot - patterns of proximity in a lower dimensional space "Similar" to PCA/DR but uses dissimilarity as input - > dissimilarity matrix –An MDS algorithm aims to place each object in N- dimensional space such that the between-object distances are preserved as well as possible. –Each object is then assigned coordinates in each of the N dimensions. –The number of dimensions of an MDS plot N can exceed 2 and is specified a priori. –Choosing N=2 optimizes the object locations for a two- dimensional scatterplot 9

Four types of MDS Classical multidimensional scaling –Also known as Principal Coordinates Analysis, Torgerson Scaling or Torgerson–Gower scaling. Takes an input matrix giving dissimilarities between pairs of items and outputs a coordinate matrix whose configuration minimizes a loss function called strain. Metric multidimensional scaling –A superset of classical MDS that generalizes the optimization procedure to a variety of loss functions and input matrices of known distances with weights and so on. A useful loss function in this context is called stress, which is often minimized using a procedure called stress majorization. 10

Four types of MDS ctd Non-metric multidimensional scaling –In contrast to metric MDS, non-metric MDS finds both a non-parametric monotonic relationship between the dissimilarities in the item-item matrix and the Euclidean distances between items, and the location of each item in the low-dimensional space. The relationship is typically found using isotonic regression. Generalized multidimensional scaling –An extension of metric multidimensional scaling, in which the target space is an arbitrary smooth non-Euclidean space. In cases where the dissimilarities are distances on a surface and the target space is another surface, GMDS allows finding the minimum-distortion embedding of one surface into another. 11

In R function (library) cmdscale() (stats) smacofSym() (smacof) wcmdscale() (vegan) pco() (ecodist) pco() (labdsv) pcoa() (ape) Only stats is loaded by default, and the rest are not installed by default 12

cmdscale() cmdscale(d, k = 2, eig = FALSE, add = FALSE, x.ret = FALSE) d - a distance structure such as that returned by dist or a full symmetric matrix containing the dissimilarities. k - the maximum dimension of the space which the data are to be represented in; must be in {1, 2, …, n-1}. eig - indicates whether eigenvalues should be returned. add - logical indicating if an additive constant c* should be computed, and added to the non-diagonal dissimilarities such that the modified dissimilarities are Euclidean. x.ret - indicates whether the doubly centred symmetric distance matrix should be returned. 13

Distances between Australian cities row.names(dist.au) <- dist.au[, 1] dist.au <- dist.au[, -1] dist.au ## A AS B D H M P S ## A ## AS ## B ## D ## H ## M ## P ## S

Distances between Australian cities fit <- cmdscale(dist.au, eig = TRUE, k = 2) x <- fit$points[, 1] y <- fit$points[, 2] plot(x, y, pch = 19, xlim = range(x) + c(0, 600)) city.names <- c("Adelaide", "Alice Springs", "Brisbane", "Darwin", "Hobart", "Melbourne", "Perth", "Sydney") text(x, y, pos = 4, labels = city.names) 15

16

R – many ways (of course) library(igraph) g <- graph.full(nrow(dist.au)) V(g)$label <- city.names layout <- layout.mds(g, dist = as.matrix(dist.au)) plot(g, layout = layout, vertex.size = 3) 17

18

On your own (Friday) Lab8b_mds1_2016.R Lab8b_mds2_2016.R Lab8b_mds3_2016.R lhttp:// l to/2013/01/23/MDS-in-R.htmlhttp://gastonsanchez.com/blog/how- to/2013/01/23/MDS-in-R.html 19

Support Vector Machine Conceptual theory, formulae… SVM - general (nonlinear) classification, regression and outlier detection with an intuitive model representation Hyperplanes separate the classification spaces (can be multi-dimensional) Kernel functions can play a key role 20

Schematically 21

Schematically 22 b=bias term, b=0 (unbiased) Support Vectors

Construction Construct an optimization objective function that is inherently subject to some constraints –Like minimizing least square error (quadratic) Most important: the classifier gets the points right by “at least” the margin Support Vectors can then be defined as those points in the dataset that have "non zero” Lagrange multipliers*. –make a classification on a new point by using only the support vectors – why? 23

Support vectors Support the “plane” 24

What about the “machine” part Ignore it – somewhat leftover from the “machine learning” era –It is trained and then –Classifies 25

No clear separation = no hyperplane? 26 Soft-margins…Non-linearity or transformation

Feature space 27 Mapping (transformation) using a function, i.e. a kernel  goal is – linear separability

Kernels or “non-linearity”… 28 the kernel function, represents a dot product of input data points mapped into the higher dimensional feature space by transformation phi + note presence of “gamma” parameter

Best Linear Separator: Supporting Plane Method Maximize distance Between two parallel supporting planes Distance = “Margin” =

Soft Margin SVM Just add non-negative error vector z.

Method 2: Find Closest Points in Convex Hulls c d

Plane Bisects Closest Points d c

Find using quadratic program Many existing and new QP solvers.

Dual of Closest Points Method is Support Plane Method Solution only depends on support vectors:

One bad example? Convex Hulls Intersect! Same argument won’t work.

Don’t trust a single point! Each point must depend on at least two actual data points.

Depend on >= two points Each point must depend on at least two actual data points.

Depend on >= two points Each point must depend on at least two actual data points.

Depend on >= two points Each point must depend on at least two actual data points.

Depend on >= two points Each point must depend on at least two actual data points.

Final Reduced/Robust Set Each point must depend on at least two actual data points. Called Reduced Convex Hull

Reduced Convex Hulls Don’t Intersect Reduce by adding upper bound D

Find Closest Points Then Bisect No change except for D. D determines number of Support Vectors.

Dual of Closest Points Method is Soft Margin Method Solution only depends on support vectors:

What will linear SVM do?

Linear SVM Fails

High Dimensional Mapping trick arma/svm

Nonlinear Classification: Map to higher dimensional space IDEA: Map each point to higher dimensional feature space and construct linear discriminant in the higher dimensional space. Dual SVM becomes:

Kernel Calculates Inner Product

Final Classification via Kernels The Dual SVM becomes:

Generalized Inner Product By Hilbert-Schmidt Kernels (Courant and Hilbert 1953) for certain  and K, e.g. Also kernels for nonvector data like strings, histograms, dna,…

Solve Dual SVM QP Recover primal variable b Classify new x Final SVM Algorithm Solution only depends on support vectors :

SVM AMPL DUAL MODEL

S5: Recal linear solution

RBF results on Sample Data

Have to pick parameters Effect of C

Effect of RBF parameter

General Kernel methodology Pick a learning task Start with linear function and data Define loss function Define regularization Formulate optimization problem in dual space/inner product space Construct an appropriate kernel Solve problem in dual space

kernlab, svmpath and klaR Work through the examples –Familiar datasets and samples procedures from 4 libraries (these are the most used) –kernlab –e1071 –svmpath –klaR 61 Karatzoglou et al. 2006

Application of SVM Classification, outlier, regression… Can produce labels or probabilities (and when used with tree partitioning can produce decision values) Different minimizations functions subject to different constraints (Lagrange multipliers) Observe the effect of changing the C parameter and the kernel 62 See Karatzoglou et al. 2006

Types of SVM (names) Classification SVM Type 1 (also known as C- SVM classification) Classification SVM Type 2 (also known as nu-SVM classification) Regression SVM Type 1 (also known as epsilon-SVM regression) Regression SVM Type 2 (also known as nu- SVM regression) 63

More kernels 64 Karatzoglou et al. 2006

Timing 65 Karatzoglou et al. 2006

Library capabilities 66 Karatzoglou et al. 2006

Extensions Many Inference Tasks –Regression –One-class Classification, novelty detection –Ranking –Clustering –Multi-Task Learning –Learning Kernels –Cannonical Correlation Analysis –Principal Component Analysis

Algorithms Algorithms Types: General Purpose solvers –CPLEX by ILOG –Matlab optimization toolkit Special purpose solvers exploit structure of the problem –Best linear SVM take time linear in the number of training data points. –Best kernel SVM solvers take time quadratic in the number of training data points. Good news since convex, algorithm doesn’t really matter as long as solvable.

Hallelujah! Generalization theory and practice meet General methodology for many types of inference problems Same Program + New Kernel = New method No problems with local minima Few model parameters. Avoids overfitting Robust optimization methods. Applicable to non-vector problems. Easy to use and tune Successful Applications BUT…

Catches Will SVMs beat my best hand-tuned method Z on problem X? Do SVMs scale to massive datasets? How to chose C and Kernel? How to transform data? How to incorporate domain knowledge? How to interpret results? Are linear methods enough?