Softmax Classifier + Generalization

Slides:

Advertisements

Similar presentations

Neural networks Introduction Fitting neural networks

Advertisements

Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.

Pattern Recognition and Machine Learning

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Indian Statistical Institute Kolkata

Recognition: A machine learning approach

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Chapter 6: Multilayer Neural Networks

Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.

Collaborative Filtering Matrix Factorization Approach

Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

PATTERN RECOGNITION AND MACHINE LEARNING

Computer Vision CS 776 Spring 2014 Recognition Machine Learning Prof. Alex Berg.

Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.

CS 2750: Machine Learning The Bias-Variance Tradeoff Prof. Adriana Kovashka University of Pittsburgh January 13, 2016.

CS 2750: Machine Learning Bias-Variance Trade-off (cont’d) + Image Representations Prof. Adriana Kovashka University of Pittsburgh January 20, 2016.

CS 2750: Machine Learning Linear Regression Prof. Adriana Kovashka University of Pittsburgh February 10, 2016.

Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Neural networks and support vector machines

Introduction to Classification & Clustering

Goodfellow: Chap 5 Machine Learning Basics

CSE 4705 Artificial Intelligence

Machine Learning I & II.

10701 / Machine Learning.

ECE 5424: Introduction to Machine Learning

LECTURE 28: NEURAL NETWORKS

Basic machine learning background with Python scikit-learn

CS 2750: Machine Learning Linear Regression

Corners and Interest Points

Machine Learning – Regression David Fenyő

Recognition - III.

Logistic Regression Classification Machine Learning.

Machine Learning Crash Course

CS 2770: Computer Vision Intro to Visual Recognition

CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off

Hyperparameters, bias-variance tradeoff, validation

Collaborative Filtering Matrix Factorization Approach

Neuro-Computing Lecture 4 Radial Basis Function Network

CS 1675: Intro to Machine Learning Regression and Overfitting

Deep Learning for Non-Linear Control

ML – Lecture 3B Deep NN.

Overfitting and Underfitting

LECTURE 28: NEURAL NETWORKS

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

The loss function, the normal equation,

Support Vector Machine I

Mathematical Foundations of BME Reza Shadmehr

Softmax Classifier.

Machine learning overview

Image Classification & Training of Neural Networks

Machine Learning – a Probabilistic Perspective

Shih-Yang Su Virginia Tech

Multiple features Linear Regression with multiple variables

Multiple features Linear Regression with multiple variables

Derek Hoiem CS 598, Spring 2009 Jan 27, 2009

Introduction to Neural Networks

Image recognition.

Machine Learning.

Overall Introduction for the Lecture

Presentation transcript:

Softmax Classifier + Generalization Various slides from previous courses by: D.A. Forsyth (Berkeley / UIUC), I. Kokkinos (Ecole Centrale / UCL). S. Lazebnik (UNC / UIUC), S. Seitz (MSR / Facebook), J. Hays (Brown / Georgia Tech), A. Berg (Stony Brook / UNC), D. Samaras (Stony Brook) . J. M. Frahm (UNC), V. Ordonez (UVA), Steve Seitz (UW).

Last Class Introduction to Machine Learning Unsupervised Learning: Clustering (e.g. k-means clustering) Supervised Learning: Classification (e.g. k-nearest neighbors)

Today’s Class Softmax Classifier (Linear Classifiers) Generalization / Overfitting / Regularization Global Features

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear Classification Clustering

Supervised Learning – k-Nearest Neighbors cat k=3 dog bear cat, cat, dog cat cat dog bear dog bear

Supervised Learning – k-Nearest Neighbors cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear

Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric?

Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)

Supervised Learning - Classification Training Data Test Data dog cat bear dog cat bear cat dog bear

Supervised Learning - Classification Training Data Test Data cat dog cat . . bear

Supervised Learning - Classification Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear

Supervised Learning - Classification Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning –Softmax Classifier Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning –Softmax Classifier Training Data inputs targets / labels / ground truth [0.85 0.10 0.05] [0.40 0.45 0.05] [0.20 0.70 0.10] [0.40 0.25 0.35] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning –Softmax Classifier 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following function L: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log⁡( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log⁡( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?

How do we find a good w and b? Problem statement: Find 𝑤 and 𝑏 such that 𝐿 𝑤,𝑏 is minimal. 𝜕 𝜕𝑤 𝐿 𝑤,𝑏 =0 Solution from calculus. and solve for 𝑤 𝜕 𝜕𝑏 𝐿 𝑤,𝑏 =0 𝑏

https://courses. lumenlearning https://courses.lumenlearning.com/businesscalc1/chapter/reading-curve-sketching/

How do we find a good w and b? Problem statement: Find 𝑤 and 𝑏 such that 𝐿 𝑤,𝑏 is minimal. 𝜕 𝜕𝑤 𝐿 𝑤,𝑏 =0 Solution from calculus. and solve for 𝑤 𝜕 𝜕𝑏 𝐿 𝑤,𝑏 =0 𝑏

Problems with this approach: Some functions L(w, b) are very complicated and compositions of many functions. So finding its analytical derivative is tedious. Even if the function is simple to derivate, it might not be easy to solve for w. e.g. 𝜕 𝜕𝑤 𝐿 𝑤,𝑏 = 𝑒 𝑤 +𝑤= 0 How do you find w in that equation?

Solution: Iterative Approach: Gradient Descent (GD) 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

Solution: Iterative Approach: Gradient Descent (GD) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

Gradient Descent (GD) Problem: expensive! 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

Solution: (mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) B is a small set of training examples. for b = 0, num_batches do end

Source: Andrew Ng

Three more things How to compute the gradient Regularization Momentum updates

SGD Gradient for the Softmax Function

SGD Gradient for the Softmax Function

SGD Gradient for the Softmax Function

Supervised Learning –Softmax Classifier 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] Get predictions 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] Extract features 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) Run features through classifier

Supervised Machine Learning Steps Training Training Labels Training Images Image Features Training Learned model Learned model Testing Image Features Prediction Test Image Slide credit: D. Hoiem

Test set (labels unknown) Generalization Generalization refers to the ability to correctly classify never before seen examples Can be controlled by turning “knobs” that affect the complexity of the model Test set (labels unknown) Training set (labels known)

𝑓 is a polynomial of degree 9 Overfitting 𝑓 is a polynomial of degree 9 𝑓 is linear 𝑓 is cubic 𝐿𝑜𝑠𝑠 𝑤 is high 𝐿𝑜𝑠𝑠 𝑤 is low 𝐿𝑜𝑠𝑠 𝑤 is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.

Questions?