CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

INTRODUCTION TO Machine Learning 2nd Edition
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Crash Course on Machine Learning Part III
Support Vector Machines and Kernels Adapted from slides by Tim Oates Cognition, Robotics, and Learning (CORAL) Lab University of Maryland Baltimore County.
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Separating Hyperplanes
Lecture 4: Embedded methods
Support Vector Machines
Lecture 14 – Neural Networks
Support Vector Machines (and Kernel Methods in general)
1-norm Support Vector Machines Good for Feature Selection  Solve the quadratic program for some : min s. t.,, denotes where or membership. Equivalent.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Reformulated - SVR as a Constrained Minimization Problem subject to n+1+2m variables and 2m constrains minimization problem Enlarge the problem size and.
Binary Classification Problem Learn a Classifier from the Training Set
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Oregon State University – Intelligent Systems Group 8/22/2003ICML Giorgio Valentini Dipartimento di Scienze dell Informazione Università degli Studi.
An Introduction to Support Vector Machines Martin Law.
Machine learning Image source:
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Model Selection via Bilevel Optimization Kristin P. Bennett, Jing Hu, Xiaoyun Ji, Gautam Kunapuli and Jong-Shi Pang Department of Mathematical Sciences.
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.
Mathematical formulation XIAO LIYING. Mathematical formulation.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Lecture 2: Learning without Over-learning Isabelle Guyon
An Introduction to Support Vector Machines (M. Law)
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
Linear Discrimination Reading: Chapter 2 of textbook.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
CS 189 Brian Chu Slides at: brianchu.com/ml/
CS 189 Brian Chu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge)
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
CS 189 Brian Chu Slides at: brianchu.com/ml/ brianchu.com/ml/ Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge)
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Today’s Topics 11/17/15CS Fall 2015 (Shavlik©), Lecture 23, Week 111 Review of the linear SVM with Slack Variables Kernels (for non-linear models)
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Non-separable SVM's, and non-linear classification using kernels Jakob Verbeek December 16, 2011 Course website:
CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com.
Neural networks and support vector machines
Support vector machines
Dan Roth Department of Computer and Information Science
Classification with Perceptrons Reading:
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
CSCI B609: “Foundations of Data Science”
Kai-Wei Chang University of Virginia
Lecture 08: Soft-margin SVM
Support vector machines
Support Vector Machines
Support Vector Machine I
Support vector machines
Introduction to Machine Learning
Presentation transcript:

CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com

Agenda Tips on #winning HW Lecture clarifications Worksheet me for slides.

HW 1 Slow code? Vectorize everything. – In Python, use numpy slicing. MATLAB, use array slicing – Use matrix operations as much as possible. This will be much more important for neural nets.

Examples A = np.array([1, 2, 3, 4]) B = np.array([1, 2, 3, 5]) Find # of differences between A and B: np.count_nonzero(A == B)

How to Win Kaggle Feature engineering! Spam: add more word frequencies – Other tricks: bag of words, tf-idf MNIST: add your own visual features – Histogram of oriented gradients – Another trick that is amazing and will guarantee you win the digits competition every time so I won’t tell you it.

The gradient is a linear operator R[w] = (1/N)  k=1:N L( f(x k, w), y k ) True (total) gradient: w i  w i -   R/  w i w  w -  ∇ w R ∇ w R =[  R/  w i ] Stochastic gradient: w i  w i -   L/  w i w  w -  ∇ w L ∇ w L =[  L/  w i ] ∇ w R[w] = (1/N)  k=1:N ∇ w L( f(x k, w), y k ) 6 The total gradient is the average over the gradients of single samples, but…

Example: the Perceptron algorithm f(x) =  i w i x i z = y f(x) =  i w i y x i  z/  w i = y x i L perceptron = max(0, -z)  w i = -  L/  w i = -  L/  z.  z/  w i  w i = Like Hebb’s rule but for misclassified examples only. Rosenblatt, 1957  y x i, if z<0 (misclassified example) 0 otherwise z=y f(x) L(f(x), y) Decision boundary well classifiedmissclassified Perceptron loss max(0, -z) 0 8

Concept Regularization – any method penalizing model complexity, at expense of more training error – Does not have to be (but is often) explicitly part of loss function Model complexity = how complex of a model your ML algorithm will be able to match – number / magnitude of parameters – how insane of a kernel you use – Etc.

Concept Assuming x is in R d L p -norm of x = ||x|| p = (|x 1 | p + |x 2 | p + …) 1/p L 0 -norm of x = ||x|| 0 = # of non-zero components (not really a norm) L 1 -norm of x = ||x|| 1 = (|x 1 | + |x 2 | + …) = |x| L2 -norm of x = ||x|| 2 = sqrt(x x … x d 2 )

SRM Example (linear model) Rank with ǁwǁ 2 =  i w i 2 S k = { w | ǁw ǁ 2 <  k 2 },  1 <  2 <…<  n Minimization under constraint: min R train [f] s.t. ǁw ǁ 2 <  k 2 Lagrangian: R reg [f,  ] = R train [f] +  ǁwǁ 2 R capacity S 1  S 2  … S N “L2 – regularization”

Multiple Structures Shrinkage (weight decay, ridge regression, SVM): S k = { w | ǁwǁ 2 <  k },  1 <  2 <…<  k  1 >  2 >  3 >… >  k (  is the ridge ) Feature selection: S k = { w | ǁwǁ 0 < k }, 1 < 2 <…< k ( is the number of features ) Kernel parameters k(s, t) = (s  t + 1) q : q 1 <q 2 <…<q k (q is the polynomial degree) k(s, t) = exp(-ǁs-tǁ 2 /  2 )  1 >  2 >  3 >… >  k (  is the kernel width)

Equivalent formulations 13 x1x1 x2x2 ǁwǁ = 1 f(x) = 0 f(x) = 1 f(x) = -1 w/ ǁ w ǁ x1x1 x2x2 f(x) = 0 f(x) = M f(x) = -M w M opt = argmax w (min k (y k f(x k )) M = 1/ ǁwǁ M opt = max (1/ ǁwǁ) s.t. min k (y k f(x k )) = 1 ⇔

Optimum margin 14 x1x1 x2x2 Hard margin f(x) = 0 f(x) = 1 f(x) = -1 w x1x1 x2x2 Soft margin min R reg [f] = R train [f] +  ǁwǁ 2 f(x) = 0 f(x) = 1 f(x) = -1 w M = 1/ ǁwǁ M opt = max (1/ ǁwǁ) s.t. min k (y k f(x k )) = 1 Use the hinge loss