KERNELS AND PERCEPTRONS. The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1.

Slides:

Advertisements

Similar presentations

CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.

Advertisements

Lecture 9 Support Vector Machines

Linear Regression.

Input Space versus Feature Space in Kernel- Based Methods Scholkopf, Mika, Burges, Knirsch, Muller, Ratsch, Smola presented by: Joe Drish Department of.

Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.

Linear Separators.

Nonlinear Dimension Reduction Presenter: Xingwei Yang The powerpoint is organized from: 1.Ronald R. Coifman et al. (Yale University) 2. Jieping Ye, (Arizona.

SVM—Support Vector Machines

Pattern Recognition and Machine Learning: Kernel Methods.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Support Vector Machines

Linear Discriminant Functions

Support Vector Machines

Support Vector Machine

Pattern Recognition and Machine Learning

Support Vector Machines and Kernel Methods

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Support Vector Machines and The Kernel Trick William Cohen

1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)

Sketched Derivation of error bound using VC-dimension (1) Bound our usual PAC expression by the probability that an algorithm has 0 error on the training.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Support Vector Machines and Kernel Methods

Support Vector Machines

The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.

Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.

Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.

Nycomed Chair for Bioinformatics and Information Mining Kernel Methods for Classification From Theory to Practice 14. Sept 2009 Iris Adä, Michael Berthold,

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Support Vector Machine & Image Classification Applications

Support Vector Machine (SVM) Based on Nello Cristianini presentation

INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

An Introduction to Support Vector Machines (M. Law)

1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Fast sampling for LDA William Cohen. MORE LDA SPEEDUPS FIRST - RECAP LDA DETAILS.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

Support Vector Machines and Kernel Methods Machine Learning March 25, 2010.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Survey of Kernel Methods by Jinsan Yang. (c) 2003 SNU Biointelligence Lab. Introduction Support Vector Machines Formulation of SVM Optimization Theorem.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Support Vector Machines Tao Department of computer science University of Illinois.

Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,

Scaling up LDA (Monday’s lecture). What if you try and parallelize? Split document/term matrix randomly and distribute to p processors.. then run “Approximate.

ADVANCED TOPIC: KERNELS 1. The kernel trick where i 1,…,i k are the mistakes… so: Remember in our alternate perceptron:

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Kernels and Margins Maria Florina Balcan 10/13/2011.

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Irena Váňová. B A1A1. A2A2. A3A3. repeat until no sample is misclassified … labels of classes Perceptron algorithm for i=1...N if then end * * * * *

1 Kernel-class Jan Recap: Feature Spaces non-linear mapping to F 1. high-D space 2. infinite-D countable space : 3. function space (Hilbert.

Efficient Logistic Regression with Stochastic Gradient Descent William Cohen 1.

Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:

Neuro-Computing Lecture 4 Radial Basis Function Network

CSSE463: Image Recognition Day 14

CSSE463: Image Recognition Day 14

Recitation 6: Kernel SVM

CSSE463: Image Recognition Day 14

CSSE463: Image Recognition Day 14

Support Vector Machines and Kernels

Parallel Perceptrons and Iterative Parameter Mixing

SVMs for Document Ranking

Introduction to Machine Learning

Presentation transcript:

KERNELS AND PERCEPTRONS

The perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i x is a vector y is -1 or +1 u -u 2γ2γ v1v1 +x2+x2 v2v2 >γ>γ Depends on how easy the learning problem is, not dimension of vectors x Fairly intuitive: “Similarity” of v to u looks like ( v.u )/| v.v | ( v.u ) grows by >= γ after mistake ( v.v ) grows by <= R 2 2

The kernel perceptron A B instance x i Compute: y i = sign(v k. x i ) ^ y i ^ If mistake: v k+1 = v k + y i x i Mathematically the same as before … but allows use of the kernel trick 3

The kernel perceptron A B instance x i Compute: y i = v k. x i ^ y i ^ If mistake: v k+1 = v k + y i x i Mathematically the same as before … but allows use of the “kernel trick” Other kernel methods (SVM, Gaussian processes) aren’t constrained to limited set (+1/-1/0) of weights on the K(x,v) values. 4

Some common kernels Linear kernel: Polynomial kernel: Gaussian kernel: More later…. 5

Kernels 101 Duality –and computational properties –Reproducing Kernel Hilbert Space (RKHS) Gram matrix Positive semi-definite Closure properties 6

Kernels 101 Duality: two ways to look at this Two different computational ways of getting the same behavior Explicitly map from x to φ(x) – i.e. to the point corresponding to x in the Hilbert space Implicitly map from x to φ(x) by changing the kernel function K 7

Kernels 101 Duality Gram matrix: K: k ij = K(x i,x j ) K(x,x’) = K(x’,x)  Gram matrix is symmetric 8

Kernels 101 Duality Gram matrix: K: k ij = K(x i,x j ) K(x,x’) = K(x’,x)  Gram matrix is symmetric K is “positive semi-definite”  …  z T K z >= 0 for all z Fun fact: Gram matrix positive semi-definite  K(x i,x j )= φ(x i ), φ(x j ) for some φ Proof: φ(x) uses the eigenvectors of K to represent x 9

HASH KERNELS AND “THE HASH TRICK” 10

Question Most of the weights in a classifier are – small and not important – So we can use the “hash trick” 11

12

The hash trick as a kernel Usually we implicitly map from x to φ (x) – All computations of learner are in terms of K(x,x’) = – Because φ (x) is large In this case we explicitly map from x to φ (x) – φ (x) is small

Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion) 14

Some details Slightly different hash to avoid systematic bias m is the number of buckets you hash into (R in my discussion) 15

Some details I.e. – a hashed vector is probably close to the original vector 16

Some details I.e. the inner products between x and x’ are probably not changed too much by the hash function: a classifier will probably still work 17

Some details 18

The hash kernel: implementation One problem: debugging is harder – Features are no longer meaningful – There’s a new way to ruin a classifier Change the hash function  You can separately compute the set of all words that hash to h and guess what features mean – Build an inverted index h  w1,w2,…, 19

ADAPTIVE GRADIENTS

Motivation What’s the best learning rate? – If a feature is rare, but relevant, it should be high, else learning will be slow Regularization makes this better/worse? – But then you could overshoot the local minima when you train

Motivation What’s the best learning rate? – Depends on typical gradient for a feature Small  fast rate Large  slow rate – Sadly we can’t afford to ignore rare features We could have a lot of them

Motivation What’s the best learning rate? – Let’s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd (j ) -1

Motivation What’s the best learning rate? – Let’s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd (j ) -1 – Ignore co-variances for efficiency

Motivation What’s the best learning rate? – Let’s pretend our observed gradients are from a zero-mean Gaussian and find variances, then scale dimension j by sd (j ) -1 – Ignore co-variances for efficiency

Adagrad Gradient at time τ- covariances η= 1 is usually ok

ALL-REDUCE

Introduction Common pattern: – do some learning in parallel – aggregate local changes from each processor to shared parameters – distribute the new shared parameters back to each processor – and repeat…. AllReduce implemented in MPI, recently in VW code (John Langford) in a Hadoop/compatible scheme MAP ALLREDUCE

Gory details of VW Hadoop- AllReduce Spanning-tree server: – Separate process constructs a spanning tree of the compute nodes in the cluster and then acts as a server Worker nodes (“fake” mappers): – Input for worker is locally cached – Workers all connect to spanning-tree server – Workers all execute the same code, which might contain AllReduce calls: Workers synchronize whenever they reach an allreduce

Hadoop AllReduce don’t wait for duplicate jobs

Second-order method - like Newton’s method

2 24 features ~=100 nonzeros/example 2.3B examples example is user/page/ad and conjunctions of these, positive if there was a click-thru on the ad

50M examples explicitly constructed kernel  11.7M features 3,300 nonzeros/example old method: SVM, 3 days: reporting time to get to fixed test error