Softmax Classifier.

Slides:

Advertisements

Similar presentations

Linear Regression.

Advertisements

Pattern Recognition and Machine Learning

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Indian Statistical Institute Kolkata

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Collaborative Filtering Matrix Factorization Approach

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Model representation Linear regression with one variable

Andrew Ng Linear regression with one variable Model representation Machine Learning.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

The problem of overfitting

M Machine Learning F# and Accord.net.

Logistic Regression William Cohen.

CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.

Learning by Loss Minimization. Machine learning: Learn a Function from Examples Function: Examples: – Supervised: – Unsupervised: – Semisuprvised:

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural networks and support vector machines

Convolutional Neural Network

ECE 5424: Introduction to Machine Learning

Computer Science and Engineering, Seoul National University

Lecture 07: Soft-margin SVM

Machine Learning – Regression David Fenyő

CSE 4705 Artificial Intelligence

Lecture 24: Convolutional neural networks

Machine Learning I & II.

10701 / Machine Learning.

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

ECE 5424: Introduction to Machine Learning

Softmax Classifier + Generalization

CS 188: Artificial Intelligence

CS 2750: Machine Learning Linear Regression

Machine Learning – Regression David Fenyő

Neural Networks and Backpropagation

Recognition - III.

Logistic Regression Classification Machine Learning.

Collaborative Filtering Matrix Factorization Approach

Lecture 07: Soft-margin SVM

CSCI B609: “Foundations of Data Science”

Lecture 08: Soft-margin SVM

Logistic Regression.

CS 1675: Intro to Machine Learning Regression and Overfitting

Biointelligence Laboratory, Seoul National University

Deep Learning for Non-Linear Control

The loss function, the normal equation,

Support Vector Machine I

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Mathematical Foundations of BME Reza Shadmehr

Neural networks (1) Traditional multi-layer perceptrons

Backpropagation David Kauchak CS159 – Fall 2019.

Image Classification & Training of Neural Networks

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Shih-Yang Su Virginia Tech

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

Introduction to Neural Networks

Image recognition.

Linear regression with one variable

Machine Learning.

Logistic Regression Geoff Hulten.

Overall Introduction for the Lecture

Presentation transcript:

Softmax Classifier

Today’s Class Softmax Classifier Inference / Making Predictions / Test Time Training a Softmax Classifier Stochastic Gradient Descent (SGD)

Supervised Learning - Classification Training Data Test Data cat dog cat . . bear

Supervised Learning - Classification Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear

Supervised Learning - Classification Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth [0.85 0.10 0.05] [0.40 0.45 0.15] [0.20 0.70 0.10] [0.40 0.25 0.35] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log⁡( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log⁡( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?

Gradient Descent (GD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

Gradient Descent (GD) (idea) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

Gradient Descent (GD) (idea) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=8 𝑤

Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2

Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2 𝐿(𝑊,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑊,𝑏)

Our function L(w) 𝐿 𝑤 = 3+(1 − 𝑤) 2 −𝑙𝑜𝑔𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑔 𝑤 1 , 𝑤 2 ,.., 𝑤 12 , 𝑥 𝑛 𝑙𝑎𝑏𝑒 𝑙 𝑛

Gradient Descent (GD) expensive 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

(mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do end

Source: Andrew Ng

(mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do for |B| = 1 end

Computing Analytic Gradients This is what we have:

Computing Analytic Gradients This is what we have: 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 Reminder:

Computing Analytic Gradients This is what we have:

Computing Analytic Gradients This is what we have: This is what we need: for each for each

Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus

Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus Let’s do these first

Computing Analytic Gradients 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 3 = 𝜕 𝜕 𝑤 𝑖, 3 ( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 𝑥 2 + 𝑤 𝑖,3 𝑥 3 + 𝑤 𝑖,4 𝑥 4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 3 = 𝑥 3 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗

Computing Analytic Gradients 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝑎 𝑖 =( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 + 𝑤 𝑖,3 + 𝑤 𝑖,4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 = 𝜕 𝜕 𝑏 𝑖 ( 𝑤 𝑖,1 𝑥 1 + 𝑤 𝑖,2 𝑥 2 + 𝑤 𝑖,3 𝑥 3 + 𝑤 𝑖,4 𝑥 4 )+ 𝑏 𝑖 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1

Computing Analytic Gradients 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1

Computing Analytic Gradients This is what we have: Step 1: Chain Rule of Calculus Now let’s do this one (same for both!)

Computing Analytic Gradients In our cat, dog, bear classification example: i = {0, 1, 2}

Computing Analytic Gradients In our cat, dog, bear classification example: i = {0, 1, 2} 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 1 𝜕ℓ 𝜕 𝑎 2 Let’s say: label = 1 We need:

Computing Analytic Gradients 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑖

Remember this slide? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

Computing Analytic Gradients 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 2 = 𝑦 𝑖

Computing Analytic Gradients 𝜕ℓ 𝜕 𝑎 1 = 𝑦 𝑖 −1

Computing Analytic Gradients label = 1 𝜕ℓ 𝜕 𝑎 0 = 𝑦 0 𝜕ℓ 𝜕 𝑎 1 = 𝑦 1 −1 𝜕ℓ 𝜕 𝑎 1 = 𝑦 2 𝜕ℓ 𝜕𝑎 = 𝜕ℓ 𝜕 𝑎 0 𝜕ℓ 𝜕 𝑎 1 𝜕ℓ 𝜕 𝑎 2 = 𝑦 0 𝑦 1 −1 𝑦 2 = 𝑦 0 𝑦 1 𝑦 2 − 0 1 0 = 𝑦 −𝑦 𝜕ℓ 𝜕 𝑎 𝑖 = 𝑦 𝑖 − 𝑦 𝑖

Computing Analytic Gradients 𝜕 𝑎 𝑖 𝜕 𝑤 𝑖, 𝑗 = 𝑥 𝑗 𝜕 𝑎 𝑖 𝜕 𝑏 𝑖 =1 𝜕ℓ 𝜕 𝑎 𝑖 = 𝑦 𝑖 − 𝑦 𝑖 𝜕ℓ 𝜕 𝑤 𝑖, 𝑗 = 𝑦 𝑖 − 𝑦 𝑖 𝑥 𝑗 𝜕ℓ 𝜕 𝑏 𝑖 = 𝑦 𝑖 − 𝑦 𝑖

Supervised Learning –Softmax Classifier 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] Get predictions 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] Extract features 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) Run features through classifier

𝑓 is a polynomial of degree 9 Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic 𝐿𝑜𝑠𝑠 𝑤 is high 𝐿𝑜𝑠𝑠 𝑤 is low 𝐿𝑜𝑠𝑠 𝑤 is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

More … Regularization Momentum updates Hinge Loss, Least Squares Loss, Logistic Regression Loss

Assignment 2 – Linear Margin-Classifier Training Data inputs targets / labels / ground truth [4.3 -1.3 1.1] [3.3 3.5 1.1] [0.5 5.6 -4.2] [1.1 -5.3 -9.4] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑓 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑓 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑓 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗≠𝑙𝑎𝑏𝑒𝑙 max⁡( 0, 𝑦 𝑖𝑗 − 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 +Δ) Why?

Questions?