Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu.

Similar presentations


Presentation on theme: "CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu."— Presentation transcript:

1 CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu

2 Outline Announcements Perceptron Extensions Next 9/11/18 Yao-Liang Yu

3 Outline Announcements Perceptron Extensions Next 9/11/18 Yao-Liang Yu

4 Announcements Assignment 1 unleashed tonight Projects
NIPS: ICML: COLT: AISTATS: ICLR: JMLR: Be ambitious and work hard 9/11/18 Yao-Liang Yu

5 Outline Announcements Perceptron Extensions Next 9/11/18 Yao-Liang Yu

6 Supervised learning 9/11/18 Yao-Liang Yu

7 Spam filtering example
Training set (X = [x1, x2, … , xn], y=[y1, y2, … , yn]) xi in X = Rd: instance i with d dimensional features yi in Y = {-1, 1}: instance i is spam or ham? Good feature representation is of uttermost importance and viagra the of nigeria y x1 1 +1 x2 -1 x3 x4 x5 x6 9/11/18 Yao-Liang Yu

8 Thought Experiment Repeat the following game
Observe instance xi Predict its label 𝑦 𝑖 (in whatever way you like) Reveal the true label 𝑦 𝑖 Suffer a mistake if 𝑦 𝑖 ≠ 𝑦 𝑖 How many mistakes will you make? Predict first, reveal next: no peek into the future! 9/11/18 Yao-Liang Yu

9 Batch vs. Online Batch learning Online learning
Interested in performance on test set X’ Training set (X, y) is just a means Statistical assumption on X and X’ Online learning Data comes one by one (streaming) Need to predict y before knowing its true value Interested in making as few mistakes as possible Compare against some baseline Online to Batch conversion 9/11/18 Yao-Liang Yu

10 Linear threshold function
Find (w, b) such that for all i: yi = sign(<w, xi> + b) w in Rd: weight vector for the separating hyperplane b in R: offset (threshold, bias) of the separating hyperplane sign: thresholding function 9/11/18 Yao-Liang Yu

11 Perceptron [Rosenblatt’58]
Nonlinear activation function x1 x2 x3 x4 >=0? w1 w2 w3 w4 Linear superposition 9/11/18 Yao-Liang Yu

12 History 1969 9/11/18 Yao-Liang Yu

13 The perceptron algorithm
prediction truth Typically δ = 0, w0 = 0, b0 = 0 y(<x, w>+b) > 0 implies y = sign(<x, w>+b) y(<x, w>+b) < 0 implies y ≠ sign(<x, w>+b) y(<x, w>+b) = 0 implies … Lazy update: if it ain’t broke, don’t fix it 9/11/18 Yao-Liang Yu

14 Does it work? w0 = [0, 0, 0, 0, 0], b0 = 0, pred -1 on x1, wrong
and viagra the of nigeria y x1 1 +1 x2 -1 x3 x4 x5 x6 w0 = [0, 0, 0, 0, 0], b0 = 0, pred -1 on x1, wrong w1 = [1, 1, 0, 1, 1], b1 = 1, pred 1 on x2, wrong w2 = [1, 1, -1, 0, 1], b2 = 0, pred -1 on x3, wrong w3 = [1, 2, 0, 0, 1], b3 = 1, pred 1 on x4, wrong w4 = [0, 2, 0, -1, 1], b4 = 0, pred 1 on x5, correct w4 = [0, 2, 0, -1, 1], b4 = 0, pred -1 on x6, correct 9/11/18 Yao-Liang Yu

15 Simplification: Linear Feasibility
Padding constant 1 to the end of each x denote as z Pre-multiply x with its label y denote as a Find z such that ATz > 0, where A = [a1, a2, …, an] 9/11/18 Yao-Liang Yu

16 Perceptron Convergence Theorem
Theorem (Block’62; Novikoff’62). Assume there exists some z such that ATz > 0, then the perceptron algorithm converges to some z*. If each column of A is selected indefinitely often, then ATz* > δ. Corollary. Let δ = 0 and z0 = 0. Then perceptron converges after at most (R/γ)2 steps, where 9/11/18 Yao-Liang Yu

17 The Margin ∃z s.t. ATz > 0 iff for some hence all s>0 ∃z s.t. ATz > s1 From the proof, perceptron convergence depends on: The larger the margin is, the more (linearly) separable the data is, hence the faster perceptron learns. 9/11/18 Yao-Liang Yu

18 What does the bound mean?
Corollary. Let δ = 0 and z0 = 0. Then perceptron converges after at most (R/γ)2 steps, where Treating R and γ as constants, then Otherwise may need exponential time… Can we improve the bound? # of mistakes independent of n and d! 9/11/18 Yao-Liang Yu

19 But The larger the margin, the faster perceptron converges
But perceptron stops at an arbitrary linear separator… Which one do you prefer? Support vector machines 9/11/18 Yao-Liang Yu

20 What if non-separable? Find a better feature representation
Use a deeper model Soft margin: can replace the following in the perceptron proof 9/11/18 Yao-Liang Yu

21 Perceptron Boundedness Theorem
Perceptron convergence requires the existence of a separating hyperplane how to check this assumption in practice? What if it fails? (trust me, it will) Theorem (Minsky and Papert’67; Block and Levin’70). The iterate z = (w; b) of the perceptron algorithm is always bounded. In particular, if there is no separating hyperplane, then perceptron cycles. “…proof of this theorem is complicated and obscure. So are the other proofs we have since seen...” --- Minsky and Papert, 1987 9/11/18 Yao-Liang Yu

22 When to stop perceptron?
Online learning: never Batch learning Maximum number of iteration reached or run out of time Training error stops changing Validation error stops decreasing Weights stopped changing much, if using a diminishing step size ηt, i.e., wt+1  wt + ηt yi xi 9/11/18 Yao-Liang Yu

23 Outline Announcements Perceptron Extensions Next 9/11/18 Yao-Liang Yu

24 Multiclass Perceptron
One vs. all Class c as positive All other classes as negative Highest activation wins: pred = argmaxc wcTx One vs. one Class c’ as negative Voting 9/11/18 Yao-Liang Yu

25 Winnow (Littlestone’88)
Multiplicative vs. additive Theorem (Littlestone’88). Assume there exists some nonnegative z such that ATz > 0, then winnow converges to some z*. If each column of A is selected indefinitely often, then ATz* > δ. 9/11/18 Yao-Liang Yu

26 Outline Announcements Perceptron Extensions Next 9/11/18 Yao-Liang Yu

27 Linear regression What if y can be any real number?
Perceptron is a linear rule for classification y in a finite set, e.g., {+1, -1} What if y can be any real number? Known as regression Again, there is a linear rule Regularization to avoid overfitting Cross-validation to select hyperparameter empirical risk regularization hyperparameter 9/11/18 Yao-Liang Yu

28 Questions? 9/11/18 Yao-Liang Yu

29 Another example: pricing
Selling a product Z to user x with price y = f(x, w) If y is too high, user won’t buy, update w If y is acceptable, user buy [update w] How to measure our performance? 9/11/18 Yao-Liang Yu


Download ppt "CS480/680: Intro to ML Lecture 01: Perceptron 9/11/18 Yao-Liang Yu."

Similar presentations


Ads by Google