Machine Learning I & II
About the Course CS6501: Computational Visual Recognition Instructor: Vicente Ordonez Email: vicente@virginia.edu Office Hour: 310 Rice Hall, Tuesday 3pm to 4pm. Only for Today: 5 to 6pm Website: http://www.cs.virginia.edu/~vicente/recognition Class Location: Olsson Hall, 005 Class Times: Tuesday-Thursday 11:00am – 12:15pm Piazza: http://piazza.com/virginia/fall2017/cs6501009/home
Teaching Assistants Tianlu Wang PhD Student Office Hours: Wednesdays 5 to 6pm. Location: Rice 430, desk 12 Siva Sitaraman MSc Student Office Hours: Fridays 3 to 4pm. Location: Rice 304 Piazza: http://piazza.com/virginia/fall2017/cs6501009/home
Grading Labs: 30% (4 labs [5%, 5%, 10%, 10%]) Paper presentation + Paper summaries: 10% Quiz: 20% Final project: 40% No Late Assignments / Honor Code Reminder
Objectives for Today Machine Learning Overview Supervised vs Unsupervised Learning Machine Learning as an Optimization Problem Softmax Classifier Stochastic Gradient Descent (SGD)
Machine Learning Machine learning is the subfield of computer science that gives "computers the ability to learn without being explicitly programmed.” - term coined by Arthur Samuel 1959 while at IBM The study of algorithms that can learn from data. In contrast to previous Artificial Intelligence systems based on Logic, e.g. ”Expert Systems”
Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear
Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear
Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear Classification Clustering
Supervised Learning Examples Classification cat Face Detection Language Parsing Structured Prediction
Supervised Learning Examples = 𝑓( ) cat = 𝑓( ) = 𝑓( )
Supervised Learning – k-Nearest Neighbors cat k=3 dog bear cat, cat, dog cat cat dog bear dog bear
Supervised Learning – k-Nearest Neighbors cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear
Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric?
Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)
Unsupervised Learning – k-means clustering 1. Initially assign all images to a random cluster
Unsupervised Learning – k-means clustering 2. Compute the mean image (in feature space) for each cluster
Unsupervised Learning – k-means clustering 3. Reassign images to clusters based on similarity to cluster means
Unsupervised Learning – k-means clustering 4. Keep repeating this process until convergence
Unsupervised Learning – k-Means clustering How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? How sensitive is this method with respect to the random assignment of clusters? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)
Supervised Learning - Classification Training Data Test Data dog cat bear dog cat bear cat dog bear
Supervised Learning - Classification Training Data Test Data cat dog cat . . bear
Supervised Learning - Classification Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear
Supervised Learning - Classification Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]
Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]
Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth [0.85 0.10 0.05] [0.40 0.45 0.05] [0.20 0.70 0.10] [0.40 0.25 0.35] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]
Supervised Learning – Linear Softmax 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )
How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?
Gradient Descent (GD) expensive 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)
Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤
Gradient Descent (GD) (idea) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤
(mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do end
Source: Andrew Ng
Three more things What is the form of the gradient Regularization Momentum updates
What is the form of the gradient? 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Let’s assume |B| = 1, then we are interested in the following: 𝑙 𝑤,𝑏 =−log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)
What is the form of the gradient? 𝑙 𝑤,𝑏 =−log 𝑓 𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) 𝑙 𝑤,𝑏 =−log 𝑒 𝑔 𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) 𝑒 𝑔 𝑖 (𝑤,𝑏) 𝑑 𝑑𝑤 𝑙 𝑤,𝑏 =( 𝑒 𝑔 𝑖 𝑤,𝑏 𝑒 𝑔 𝑖 𝑤,𝑏 − 𝑦 𝑖 ) 𝑑 𝑑𝑤 𝑔(𝑤, 𝑏)
Regularization 𝑙(𝑤,𝑏)= 𝑖∈𝐵 𝑙 𝑤,𝑏 + 1 2 𝑤 2
Momentum updates 𝑤=𝑤 −𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 instead of this 𝑣=𝜌𝑣+ 𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 𝑤=𝑤 −𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 instead of this 𝑣=𝜌𝑣+ 𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 we use this with rho typically 0.8-0.9 𝑤=𝑤 −𝑣 See also: https://github.com/karpathy/neuraltalk2/blob/master/misc/optim_updates.lua 38