Softmax Classifier + Generalization

Slides:



Advertisements
Similar presentations
Neural networks Introduction Fitting neural networks
Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Pattern Recognition and Machine Learning
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Indian Statistical Institute Kolkata
Recognition: A machine learning approach
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 6: Multilayer Neural Networks
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Collaborative Filtering Matrix Factorization Approach
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
PATTERN RECOGNITION AND MACHINE LEARNING
Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.
CS 2750: Machine Learning Bias-Variance Trade-off (cont’d) + Image Representations Prof. Adriana Kovashka University of Pittsburgh January 20, 2016.
CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Neural networks and support vector machines
Introduction to Classification & Clustering
Deep Learning.
Goodfellow: Chap 5 Machine Learning Basics
CSE 4705 Artificial Intelligence
Machine Learning I & II.
10701 / Machine Learning.
ECE 5424: Introduction to Machine Learning
LECTURE 28: NEURAL NETWORKS
Basic machine learning background with Python scikit-learn
CS 2750: Machine Learning Linear Regression
Corners and Interest Points
Machine Learning – Regression David Fenyő
Recognition - III.
Logistic Regression Classification Machine Learning.
Machine Learning Crash Course
CS 2770: Computer Vision Intro to Visual Recognition
CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off
Hyperparameters, bias-variance tradeoff, validation
Collaborative Filtering Matrix Factorization Approach
Neuro-Computing Lecture 4 Radial Basis Function Network
CS 1675: Intro to Machine Learning Regression and Overfitting
Deep Learning for Non-Linear Control
ML – Lecture 3B Deep NN.
Overfitting and Underfitting
LECTURE 28: NEURAL NETWORKS
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
The loss function, the normal equation,
Support Vector Machine I
Mathematical Foundations of BME Reza Shadmehr
Softmax Classifier.
Machine learning overview
Image Classification & Training of Neural Networks
Machine Learning – a Probabilistic Perspective
Shih-Yang Su Virginia Tech
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
Introduction to Neural Networks
Image recognition.
Machine Learning.
Overall Introduction for the Lecture
Presentation transcript:

Softmax Classifier + Generalization Various slides from previous courses by: D.A. Forsyth (Berkeley / UIUC), I. Kokkinos (Ecole Centrale / UCL). S. Lazebnik (UNC / UIUC), S. Seitz (MSR / Facebook), J. Hays (Brown / Georgia Tech), A. Berg (Stony Brook / UNC), D. Samaras (Stony Brook) . J. M. Frahm (UNC), V. Ordonez (UVA), Steve Seitz (UW).

Last Class Introduction to Machine Learning Unsupervised Learning: Clustering (e.g. k-means clustering) Supervised Learning: Classification (e.g. k-nearest neighbors)

Today’s Class Softmax Classifier (Linear Classifiers) Generalization / Overfitting / Regularization Global Features

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear Classification Clustering

Supervised Learning – k-Nearest Neighbors cat k=3 dog bear cat, cat, dog cat cat dog bear dog bear

Supervised Learning – k-Nearest Neighbors cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear

Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric?

Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)

Supervised Learning - Classification Training Data Test Data dog cat bear dog cat bear cat dog bear

Supervised Learning - Classification Training Data Test Data cat dog cat . . bear

Supervised Learning - Classification Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear

Supervised Learning - Classification Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning –Softmax Classifier Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning –Softmax Classifier Training Data inputs targets / labels / ground truth [0.85 0.10 0.05] [0.40 0.45 0.05] [0.20 0.70 0.10] [0.40 0.25 0.35] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning –Softmax Classifier 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following function L: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log⁡( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log⁡( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?

How do we find a good w and b? Problem statement: Find 𝑤 and 𝑏 such that 𝐿 𝑤,𝑏 is minimal. 𝜕 𝜕𝑤 𝐿 𝑤,𝑏 =0 Solution from calculus. and solve for 𝑤 𝜕 𝜕𝑏 𝐿 𝑤,𝑏 =0 𝑏

https://courses. lumenlearning https://courses.lumenlearning.com/businesscalc1/chapter/reading-curve-sketching/

How do we find a good w and b? Problem statement: Find 𝑤 and 𝑏 such that 𝐿 𝑤,𝑏 is minimal. 𝜕 𝜕𝑤 𝐿 𝑤,𝑏 =0 Solution from calculus. and solve for 𝑤 𝜕 𝜕𝑏 𝐿 𝑤,𝑏 =0 𝑏

Problems with this approach: Some functions L(w, b) are very complicated and compositions of many functions. So finding its analytical derivative is tedious. Even if the function is simple to derivate, it might not be easy to solve for w. e.g. 𝜕 𝜕𝑤 𝐿 𝑤,𝑏 = 𝑒 𝑤 +𝑤= 0 How do you find w in that equation?

Solution: Iterative Approach: Gradient Descent (GD) 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

Solution: Iterative Approach: Gradient Descent (GD) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

Gradient Descent (GD) Problem: expensive! 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

Solution: (mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) B is a small set of training examples. for b = 0, num_batches do end

Source: Andrew Ng

Three more things How to compute the gradient Regularization Momentum updates

SGD Gradient for the Softmax Function

SGD Gradient for the Softmax Function

SGD Gradient for the Softmax Function

Supervised Learning –Softmax Classifier 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] Get predictions 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] Extract features 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) Run features through classifier

Supervised Machine Learning Steps Training Training Labels Training Images Image Features Training Learned model Learned model Testing Image Features Prediction Test Image Slide credit: D. Hoiem

Test set (labels unknown) Generalization Generalization refers to the ability to correctly classify never before seen examples Can be controlled by turning “knobs” that affect the complexity of the model Test set (labels unknown) Training set (labels known)

𝑓 is a polynomial of degree 9 Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic 𝐿𝑜𝑠𝑠 𝑤 is high 𝐿𝑜𝑠𝑠 𝑤 is low 𝐿𝑜𝑠𝑠 𝑤 is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

Questions?