Softmax Classifier.

Slides:



Advertisements
Similar presentations
Linear Regression.
Advertisements

Pattern Recognition and Machine Learning
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Indian Statistical Institute Kolkata
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li
Collaborative Filtering Matrix Factorization Approach
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Model representation Linear regression with one variable
Andrew Ng Linear regression with one variable Model representation Machine Learning.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
The problem of overfitting
M Machine Learning F# and Accord.net.
Logistic Regression William Cohen.
CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.
Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Convolutional Neural Network
Deep Learning.
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
Lecture 07: Soft-margin SVM
Machine Learning – Regression David Fenyő
CSE 4705 Artificial Intelligence
Lecture 24: Convolutional neural networks
Machine Learning I & II.
10701 / Machine Learning.
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Softmax Classifier + Generalization
CS 188: Artificial Intelligence
CS 2750: Machine Learning Linear Regression
Machine Learning – Regression David Fenyő
Neural Networks and Backpropagation
Recognition - III.
Logistic Regression Classification Machine Learning.
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
CSCI B609: “Foundations of Data Science”
Lecture 08: Soft-margin SVM
Logistic Regression.
CS 1675: Intro to Machine Learning Regression and Overfitting
Biointelligence Laboratory, Seoul National University
Deep Learning for Non-Linear Control
The loss function, the normal equation,
Support Vector Machine I
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Mathematical Foundations of BME Reza Shadmehr
Neural networks (1) Traditional multi-layer perceptrons
Backpropagation David Kauchak CS159 – Fall 2019.
Image Classification & Training of Neural Networks
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Shih-Yang Su Virginia Tech
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Introduction to Neural Networks
Image recognition.
Linear regression with one variable
Machine Learning.
Logistic Regression Geoff Hulten.
Overall Introduction for the Lecture
Presentation transcript:

Softmax Classifier

Today’s Class Softmax Classifier Inference / Making Predictions / Test Time Training a Softmax Classifier Stochastic Gradient Descent (SGD)

Supervised Learning - Classification Training Data Test Data cat dog cat . . bear

Supervised Learning - Classification Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear

Supervised Learning - Classification Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth [0.85 0.10 0.05] [0.40 0.45 0.15] [0.20 0.70 0.10] [0.40 0.25 0.35] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log⁡( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log⁡( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?

Gradient Descent (GD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

Gradient Descent (GD) (idea) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

Gradient Descent (GD) (idea) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 𝑤

Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2

Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2 𝐿(𝑊,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑊,𝑏)

Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2 −𝑙𝑜𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔 𝑤 1 , 𝑤 2 ,.., 𝑤 12 , 𝑥 𝑛 𝑙𝑎𝑏𝑒 𝑙 𝑛

Gradient Descent (GD) expensive 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

(mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do end

Source: Andrew Ng

(mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do for |B| = 1 end

Computing Analytic Gradients This is what we have:

Computing Analytic Gradients This is what we have: 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 Reminder:

Computing Analytic Gradients This is what we have:

Computing Analytic Gradients This is what we have: This is what we need: for each for each

Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus

Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus Let’s do these first

Computing Analytic Gradients 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 3 = 𝜕 𝜕 𝑤 𝑖, 3 ( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 𝑥 2 + 𝑤 𝑖,3 𝑥 3 + 𝑤 𝑖,4 𝑥 4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 3 = 𝑥 3 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗

Computing Analytic Gradients 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 = 𝜕 𝜕 𝑏 𝑖 ( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 𝑥 2 + 𝑤 𝑖,3 𝑥 3 + 𝑤 𝑖,4 𝑥 4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1

Computing Analytic Gradients 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1

Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus Now let’s do this one (same for both!)

Computing Analytic Gradients In our cat, dog, bear classification example: i = {0, 1, 2}

Computing Analytic Gradients In our cat, dog, bear classification example: i = {0, 1, 2} 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 1 𝜕ℓ 𝜕 𝑎 2 Let’s say: label = 1 We need:

Computing Analytic Gradients 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑖

Remember this slide? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

Computing Analytic Gradients 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑖

Computing Analytic Gradients 𝜕ℓ 𝜕 𝑎 1 = 𝑦 𝑖 −1

Computing Analytic Gradients label = 1 𝜕ℓ 𝜕 𝑎 0 = 𝑦 0 𝜕ℓ 𝜕 𝑎 1 = 𝑦 1 −1 𝜕ℓ 𝜕 𝑎 1 = 𝑦 2 𝜕ℓ 𝜕𝑎 = 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 1 𝜕ℓ 𝜕 𝑎 2 = 𝑦 0 𝑦 1 −1 𝑦 2 = 𝑦 0 𝑦 1 𝑦 2 − 0 1 0 = 𝑦 −𝑦 𝜕ℓ 𝜕 𝑎 𝑖 = 𝑦 𝑖 − 𝑦 𝑖

Computing Analytic Gradients 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1 𝜕ℓ 𝜕 𝑎 𝑖 = 𝑦 𝑖 − 𝑦 𝑖 𝜕ℓ 𝜕 𝑤 𝑖, 𝑗 = 𝑦 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝜕ℓ 𝜕 𝑏 𝑖 = 𝑦 𝑖 − 𝑦 𝑖

Supervised Learning –Softmax Classifier 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] Get predictions 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] Extract features 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) Run features through classifier

𝑓 is a polynomial of degree 9 Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic 𝐿𝑜𝑠𝑠 𝑤 is high 𝐿𝑜𝑠𝑠 𝑤 is low 𝐿𝑜𝑠𝑠 𝑤 is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

More … Regularization Momentum updates Hinge Loss, Least Squares Loss, Logistic Regression Loss

Assignment 2 – Linear Margin-Classifier Training Data inputs targets / labels / ground truth [4.3 -1.3 1.1] [3.3 3.5 1.1] [0.5 5.6 -4.2] [1.1 -5.3 -9.4] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑓 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑓 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑓 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗≠𝑙𝑎𝑏𝑒𝑙 max⁡( 0, 𝑦 𝑖𝑗 − 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 +Δ) Why?

Questions?