CS 189 Brian Chu Office Hours: Cory 246, 6-7p Mon. (hackerspace lounge) brianchu.com
Agenda Tips on #winning HW Lecture clarifications Worksheet me for slides.
HW 1 Slow code? Vectorize everything. – In Python, use numpy slicing. MATLAB, use array slicing – Use matrix operations as much as possible. This will be much more important for neural nets.
Examples A = np.array([1, 2, 3, 4]) B = np.array([1, 2, 3, 5]) Find # of differences between A and B: np.count_nonzero(A == B)
How to Win Kaggle Feature engineering! Spam: add more word frequencies – Other tricks: bag of words, tf-idf MNIST: add your own visual features – Histogram of oriented gradients – Another trick that is amazing and will guarantee you win the digits competition every time so I won’t tell you it.
The gradient is a linear operator R[w] = (1/N) k=1:N L( f(x k, w), y k ) True (total) gradient: w i w i - R/ w i w w - ∇ w R ∇ w R =[ R/ w i ] Stochastic gradient: w i w i - L/ w i w w - ∇ w L ∇ w L =[ L/ w i ] ∇ w R[w] = (1/N) k=1:N ∇ w L( f(x k, w), y k ) 6 The total gradient is the average over the gradients of single samples, but…
Example: the Perceptron algorithm f(x) = i w i x i z = y f(x) = i w i y x i z/ w i = y x i L perceptron = max(0, -z) w i = - L/ w i = - L/ z. z/ w i w i = Like Hebb’s rule but for misclassified examples only. Rosenblatt, 1957 y x i, if z<0 (misclassified example) 0 otherwise z=y f(x) L(f(x), y) Decision boundary well classifiedmissclassified Perceptron loss max(0, -z) 0 8
Concept Regularization – any method penalizing model complexity, at expense of more training error – Does not have to be (but is often) explicitly part of loss function Model complexity = how complex of a model your ML algorithm will be able to match – number / magnitude of parameters – how insane of a kernel you use – Etc.
Concept Assuming x is in R d L p -norm of x = ||x|| p = (|x 1 | p + |x 2 | p + …) 1/p L 0 -norm of x = ||x|| 0 = # of non-zero components (not really a norm) L 1 -norm of x = ||x|| 1 = (|x 1 | + |x 2 | + …) = |x| L2 -norm of x = ||x|| 2 = sqrt(x x … x d 2 )
SRM Example (linear model) Rank with ǁwǁ 2 = i w i 2 S k = { w | ǁw ǁ 2 < k 2 }, 1 < 2 <…< n Minimization under constraint: min R train [f] s.t. ǁw ǁ 2 < k 2 Lagrangian: R reg [f, ] = R train [f] + ǁwǁ 2 R capacity S 1 S 2 … S N “L2 – regularization”
Multiple Structures Shrinkage (weight decay, ridge regression, SVM): S k = { w | ǁwǁ 2 < k }, 1 < 2 <…< k 1 > 2 > 3 >… > k ( is the ridge ) Feature selection: S k = { w | ǁwǁ 0 < k }, 1 < 2 <…< k ( is the number of features ) Kernel parameters k(s, t) = (s t + 1) q : q 1 <q 2 <…<q k (q is the polynomial degree) k(s, t) = exp(-ǁs-tǁ 2 / 2 ) 1 > 2 > 3 >… > k ( is the kernel width)
Equivalent formulations 13 x1x1 x2x2 ǁwǁ = 1 f(x) = 0 f(x) = 1 f(x) = -1 w/ ǁ w ǁ x1x1 x2x2 f(x) = 0 f(x) = M f(x) = -M w M opt = argmax w (min k (y k f(x k )) M = 1/ ǁwǁ M opt = max (1/ ǁwǁ) s.t. min k (y k f(x k )) = 1 ⇔
Optimum margin 14 x1x1 x2x2 Hard margin f(x) = 0 f(x) = 1 f(x) = -1 w x1x1 x2x2 Soft margin min R reg [f] = R train [f] + ǁwǁ 2 f(x) = 0 f(x) = 1 f(x) = -1 w M = 1/ ǁwǁ M opt = max (1/ ǁwǁ) s.t. min k (y k f(x k )) = 1 Use the hinge loss