Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com.

Similar presentations


Presentation on theme: "CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com."— Presentation transcript:

1 CS 189 Brian Chu brian.c@berkeley.edu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) twitter: @brrrianchu brianchu.com

2 Agenda Tips on #winning HW Lecture clarifications Worksheet Email me for slides.

3 HW 1 Slow code? Vectorize everything. – In Python, use numpy slicing. MATLAB, use array slicing – Use matrix operations as much as possible. This will be much more important for neural nets.

4 Examples A = np.array([1, 2, 3, 4]) B = np.array([1, 2, 3, 5]) Find # of differences between A and B: np.count_nonzero(A == B)

5 How to Win Kaggle Feature engineering! Spam: add more word frequencies – Other tricks: bag of words, tf-idf MNIST: add your own visual features – Histogram of oriented gradients – Another trick that is amazing and will guarantee you win the digits competition every time so I won’t tell you it.

6 The gradient is a linear operator R[w] = (1/N)  k=1:N L( f(x k, w), y k ) True (total) gradient: w i  w i -   R/  w i w  w -  ∇ w R ∇ w R =[  R/  w i ] Stochastic gradient: w i  w i -   L/  w i w  w -  ∇ w L ∇ w L =[  L/  w i ] ∇ w R[w] = (1/N)  k=1:N ∇ w L( f(x k, w), y k ) 6 The total gradient is the average over the gradients of single samples, but…

7

8 Example: the Perceptron algorithm f(x) =  i w i x i z = y f(x) =  i w i y x i  z/  w i = y x i L perceptron = max(0, -z)  w i = -  L/  w i = -  L/  z.  z/  w i  w i = Like Hebb’s rule but for misclassified examples only. Rosenblatt, 1957  y x i, if z<0 (misclassified example) 0 otherwise z=y f(x) L(f(x), y) Decision boundary well classifiedmissclassified Perceptron loss max(0, -z) 0 8

9 Concept Regularization – any method penalizing model complexity, at expense of more training error – Does not have to be (but is often) explicitly part of loss function Model complexity = how complex of a model your ML algorithm will be able to match – number / magnitude of parameters – how insane of a kernel you use – Etc.

10 Concept Assuming x is in R d L p -norm of x = ||x|| p = (|x 1 | p + |x 2 | p + …) 1/p L 0 -norm of x = ||x|| 0 = # of non-zero components (not really a norm) L 1 -norm of x = ||x|| 1 = (|x 1 | + |x 2 | + …) = |x| L2 -norm of x = ||x|| 2 = sqrt(x 1 2 + x 2 2 + … x d 2 )

11 SRM Example (linear model) Rank with ǁwǁ 2 =  i w i 2 S k = { w | ǁw ǁ 2 <  k 2 },  1 <  2 <…<  n Minimization under constraint: min R train [f] s.t. ǁw ǁ 2 <  k 2 Lagrangian: R reg [f,  ] = R train [f] +  ǁwǁ 2 R capacity S 1  S 2  … S N “L2 – regularization”

12 Multiple Structures Shrinkage (weight decay, ridge regression, SVM): S k = { w | ǁwǁ 2 <  k },  1 <  2 <…<  k  1 >  2 >  3 >… >  k (  is the ridge ) Feature selection: S k = { w | ǁwǁ 0 < k }, 1 < 2 <…< k ( is the number of features ) Kernel parameters k(s, t) = (s  t + 1) q : q 1 <q 2 <…<q k (q is the polynomial degree) k(s, t) = exp(-ǁs-tǁ 2 /  2 )  1 >  2 >  3 >… >  k (  is the kernel width)

13 Equivalent formulations 13 x1x1 x2x2 ǁwǁ = 1 f(x) = 0 f(x) = 1 f(x) = -1 w/ ǁ w ǁ x1x1 x2x2 f(x) = 0 f(x) = M f(x) = -M w M opt = argmax w (min k (y k f(x k )) M = 1/ ǁwǁ M opt = max (1/ ǁwǁ) s.t. min k (y k f(x k )) = 1 ⇔

14 Optimum margin 14 x1x1 x2x2 Hard margin f(x) = 0 f(x) = 1 f(x) = -1 w x1x1 x2x2 Soft margin min R reg [f] = R train [f] +  ǁwǁ 2 f(x) = 0 f(x) = 1 f(x) = -1 w M = 1/ ǁwǁ M opt = max (1/ ǁwǁ) s.t. min k (y k f(x k )) = 1 Use the hinge loss


Download ppt "CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com."

Similar presentations


Ads by Google