Download presentation
Presentation is loading. Please wait.
1
Softmax Classifier + Generalization
Various slides from previous courses by: D.A. Forsyth (Berkeley / UIUC), I. Kokkinos (Ecole Centrale / UCL). S. Lazebnik (UNC / UIUC), S. Seitz (MSR / Facebook), J. Hays (Brown / Georgia Tech), A. Berg (Stony Brook / UNC), D. Samaras (Stony Brook) . J. M. Frahm (UNC), V. Ordonez (UVA), Steve Seitz (UW).
2
Last Class Introduction to Machine Learning
Unsupervised Learning: Clustering (e.g. k-means clustering) Supervised Learning: Classification (e.g. k-nearest neighbors)
3
Todayβs Class Softmax Classifier (Linear Classifiers)
Generalization / Overfitting / Regularization Global Features
4
Supervised Learning vs Unsupervised Learning
π₯ β π¦ π₯ cat dog bear
5
Supervised Learning vs Unsupervised Learning
π₯ β π¦ π₯ cat dog bear
6
Supervised Learning vs Unsupervised Learning
π₯ β π¦ π₯ cat dog bear Classification Clustering
7
Supervised Learning β k-Nearest Neighbors
cat k=3 dog bear cat, cat, dog cat cat dog bear dog bear
8
Supervised Learning β k-Nearest Neighbors
cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear
9
Supervised Learning β k-Nearest Neighbors
How do we choose the right K? How do we choose the right features? How do we choose the right distance metric?
10
Supervised Learning β k-Nearest Neighbors
How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a βTraining setβ and a βValidation setβ (also called βDevelopment setβ)
11
Supervised Learning - Classification
Training Data Test Data dog cat bear dog cat bear cat dog bear
12
Supervised Learning - Classification
Training Data Test Data cat dog cat . . bear
13
Supervised Learning - Classification
Training Data π¦ π =[ ] π¦ 3 =[ ] π¦ 2 =[ ] π¦ 1 =[ ] π₯ 1 =[ ] π₯ 2 =[ ] π₯ 3 =[ ] π₯ π =[ ] cat dog cat . bear
14
Supervised Learning - Classification
Training Data inputs targets / labels / ground truth π¦ π =π( π₯ π ;π) We need to find a function that maps x and y for any of them. 1 2 π¦ π = π¦ 3 = π¦ 2 = π¦ 1 = predictions π₯ 1 =[ π₯ π₯ π₯ π₯ 14 ] 1 2 3 π¦ π = π¦ 3 = π¦ 2 = π¦ 1 = π₯ 2 =[ π₯ π₯ π₯ π₯ 24 ] π₯ 3 =[ π₯ π₯ π₯ π₯ 34 ] How do we βlearnβ the parameters of this function? We choose ones that makes the following quantity small: π=1 π πΆππ π‘( π¦ π , π¦ π ) . π₯ π =[ π₯ π1 π₯ π2 π₯ π3 π₯ π4 ]
15
Supervised Learning βSoftmax Classifier
Training Data inputs targets / labels / ground truth π₯ 1 =[ π₯ π₯ π₯ π₯ 14 ] 1 2 3 π¦ π = π¦ 3 = π¦ 2 = π¦ 1 = π₯ 2 =[ π₯ π₯ π₯ π₯ 24 ] π₯ 3 =[ π₯ π₯ π₯ π₯ 34 ] . π₯ π =[ π₯ π1 π₯ π2 π₯ π3 π₯ π4 ]
16
Supervised Learning βSoftmax Classifier
Training Data inputs targets / labels / ground truth [ ] [ ] [ ] [ ] π¦ π = π¦ 3 = π¦ 2 = π¦ 1 = predictions π₯ 1 =[ π₯ π₯ π₯ π₯ 14 ] [ ] [ ] [ ] π¦ π = π¦ 3 = π¦ 2 = π¦ 1 = π₯ 2 =[ π₯ π₯ π₯ π₯ 24 ] π₯ 3 =[ π₯ π₯ π₯ π₯ 34 ] . π₯ π =[ π₯ π1 π₯ π2 π₯ π3 π₯ π4 ]
17
Supervised Learning βSoftmax Classifier
π₯ π =[ π₯ π1 π₯ π2 π₯ π3 π₯ π4 ] [ ] π¦ π = π¦ π = [ π π π π π π ] π π = π€ π1 π₯ π1 + π€ π2 π₯ π2 + π€ π3 π₯ π3 + π€ π4 π₯ π4 + π π π π = π€ π1 π₯ π1 + π€ π2 π₯ π2 + π€ π3 π₯ π3 + π€ π4 π₯ π4 + π π π π = π€ π1 π₯ π1 + π€ π2 π₯ π2 + π€ π3 π₯ π3 + π€ π4 π₯ π4 + π π π π = π π π / (π π π + π π π + π π π ) π π = π π π / (π π π + π π π + π π π ) π π = π π π / (π π π + π π π + π π π )
18
How do we find a good w and b?
π₯ π =[ π₯ π1 π₯ π2 π₯ π3 π₯ π4 ] [ ] π¦ π = π¦ π = [ π π (π€,π) π π (π€,π) π π (π€,π)] We need to find w, and b that minimize the following function L: πΏ π€,π = π=1 π π=1 3 β π¦ π,π logβ‘( π¦ π,π ) = π=1 π βlogβ‘( π¦ π,πππππ ) = π=1 π βlog π π,πππππ (π€,π) Why?
19
How do we find a good w and b?
Problem statement: Find π€ and π such that πΏ π€,π is minimal. π ππ€ πΏ π€,π =0 Solution from calculus. and solve for π€ π ππ πΏ π€,π =0 π
20
https://courses. lumenlearning
21
How do we find a good w and b?
Problem statement: Find π€ and π such that πΏ π€,π is minimal. π ππ€ πΏ π€,π =0 Solution from calculus. and solve for π€ π ππ πΏ π€,π =0 π
22
Problems with this approach:
Some functions L(w, b) are very complicated and compositions of many functions. So finding its analytical derivative is tedious. Even if the function is simple to derivate, it might not be easy to solve for w. e.g. π ππ€ πΏ π€,π = π π€ +π€= 0 How do you find w in that equation?
23
Solution: Iterative Approach: Gradient Descent (GD)
1. Start with a random value of w (e.g. w = 12) πΏ π€ 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w β lambda * (dL / dw) π€
24
Solution: Iterative Approach: Gradient Descent (GD)
πΏ π€ 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w β lambda * (dL / dw) w=10 π€
25
Gradient Descent (GD) Problem: expensive! π=0.01
for e = 0, num_epochs do end Initialize w and b randomly ππΏ(π€,π)/ππ€ ππΏ(π€,π)/ππ Compute: and Update w: Update b: π€=π€ βπ ππΏ(π€,π)/ππ€ π=π βπ ππΏ(π€,π)/ππ Print: πΏ(π€,π) // Useful to see if this is becoming smaller or not. πΏ(π€,π)= π=1 π βlog π π,πππππ (π€,π)
26
Solution: (mini-batch) Stochastic Gradient Descent (SGD)
π=0.01 for e = 0, num_epochs do end Initialize w and b randomly ππ(π€,π)/ππ€ ππ(π€,π)/ππ Compute: and Update w: Update b: π€=π€ βπ ππ(π€,π)/ππ€ π=π βπ ππ(π€,π)/ππ Print: π(π€,π) // Useful to see if this is becoming smaller or not. π(π€,π)= πβπ΅ βlog π π,πππππ (π€,π) B is a small set of training examples. for b = 0, num_batches do end
27
Source: Andrew Ng
28
Three more things How to compute the gradient Regularization
Momentum updates
29
SGD Gradient for the Softmax Function
30
SGD Gradient for the Softmax Function
31
SGD Gradient for the Softmax Function
32
Supervised Learning βSoftmax Classifier
π¦ π = [ π π π π π π ] Get predictions π₯ π =[ π₯ π1 π₯ π2 π₯ π3 π₯ π4 ] Extract features π π = π€ π1 π₯ π1 + π€ π2 π₯ π2 + π€ π3 π₯ π3 + π€ π4 π₯ π4 + π π π π = π€ π1 π₯ π1 + π€ π2 π₯ π2 + π€ π3 π₯ π3 + π€ π4 π₯ π4 + π π π π = π€ π1 π₯ π1 + π€ π2 π₯ π2 + π€ π3 π₯ π3 + π€ π4 π₯ π4 + π π π π = π π π / (π π π + π π π + π π π ) π π = π π π / (π π π + π π π + π π π ) π π = π π π / (π π π + π π π + π π π ) Run features through classifier
33
Supervised Machine Learning Steps
Training Training Labels Training Images Image Features Training Learned model Learned model Testing Image Features Prediction Test Image Slide credit: D. Hoiem
34
Test set (labels unknown)
Generalization Generalization refers to the ability to correctly classify never before seen examples Can be controlled by turning βknobsβ that affect the complexity of the model Test set (labels unknown) Training set (labels known)
35
π is a polynomial of degree 9
Overfitting π is a polynomial of degree 9 π is linear π is cubic πΏππ π π€ is high πΏππ π π€ is low πΏππ π π€ is zero! Overfitting Underfitting High Bias High Variance Credit: C. Bishop. Pattern Recognition and Mach. Learning.
36
Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.