Machine Learning I & II.

Slides:



Advertisements
Similar presentations
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Advertisements

Machine learning continued Image source:
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Indian Statistical Institute Kolkata
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Introduction to machine learning
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
Collaborative Filtering Matrix Factorization Approach
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Machine Learning Introduction Study on the Coursera All Right Reserved : Andrew Ng Lecturer:Much Database Lab of Xiamen University Aug 12,2014.
Model representation Linear regression with one variable
Andrew Ng Linear regression with one variable Model representation Machine Learning.
Building high-level features using large-scale unsupervised learning Anh Nguyen, Bay-yuan Hsu CS290D – Data Mining (Spring 2014) University of California,
CS 445/545 Machine Learning Winter, 2012 Course overview: –Instructor Melanie Mitchell –Textbook Machine Learning: An Algorithmic Approach by Stephen Marsland.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Week 1 - An Introduction to Machine Learning & Soft Computing
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
Clustering Unsupervised learning introduction Machine Learning.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
CS 445/545 Machine Learning Winter, 2014 See syllabus at
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Neural networks and support vector machines
CSSE463: Image Recognition Day 14
CS6501 Advanced Topics in Information Retrieval Course Policy
Introduction to Classification & Clustering
Semi-Supervised Clustering
Convolutional Neural Network
Deep Feedforward Networks
Deep Learning.
Computer Science and Engineering, Seoul National University
CS5040: Data Structures and Algorithms
Lecture 24: Convolutional neural networks
10701 / Machine Learning.
Introduction to Data Science Lecture 7 Machine Learning Overview
Introductory Seminar on Research: Fall 2017
Classification with Perceptrons Reading:
Basic machine learning background with Python scikit-learn
Softmax Classifier + Generalization
CS 188: Artificial Intelligence
CS 2750: Machine Learning Linear Regression
Neural Networks and Backpropagation
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
CSSE463: Image Recognition Day 20
Collaborative Filtering Matrix Factorization Approach
Lecture 00: Introduction
CSSE463: Image Recognition Day 13
Softmax Classifier.
Machine Learning Algorithms – An Overview
Backpropagation David Kauchak CS159 – Fall 2019.
Junheng, Shengming, Yunsheng 11/09/2018
Unsupervised Learning: Clustering
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Machine learning CS 229 / stats 229
Linear regression with one variable
CS249: Neural Language Model
Overall Introduction for the Lecture
Presentation transcript:

Machine Learning I & II

About the Course CS6501: Computational Visual Recognition Instructor: Vicente Ordonez Email: vicente@virginia.edu Office Hour: 310 Rice Hall, Tuesday 3pm to 4pm. Only for Today: 5 to 6pm Website: http://www.cs.virginia.edu/~vicente/recognition Class Location: Olsson Hall, 005 Class Times: Tuesday-Thursday 11:00am – 12:15pm Piazza: http://piazza.com/virginia/fall2017/cs6501009/home

Teaching Assistants Tianlu Wang PhD Student Office Hours: Wednesdays 5 to 6pm. Location: Rice 430, desk 12 Siva Sitaraman MSc Student Office Hours: Fridays 3 to 4pm. Location: Rice 304 Piazza: http://piazza.com/virginia/fall2017/cs6501009/home

Grading Labs: 30% (4 labs [5%, 5%, 10%, 10%]) Paper presentation + Paper summaries: 10% Quiz: 20% Final project: 40% No Late Assignments / Honor Code Reminder

Objectives for Today Machine Learning Overview Supervised vs Unsupervised Learning Machine Learning as an Optimization Problem Softmax Classifier Stochastic Gradient Descent (SGD)

Machine Learning Machine learning is the subfield of computer science that gives "computers the ability to learn without being explicitly programmed.” - term coined by Arthur Samuel 1959 while at IBM The study of algorithms that can learn from data. In contrast to previous Artificial Intelligence systems based on Logic, e.g. ”Expert Systems”

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear

Supervised Learning vs Unsupervised Learning 𝑥 → 𝑦 𝑥 cat dog bear Classification Clustering

Supervised Learning Examples Classification cat Face Detection Language Parsing Structured Prediction

Supervised Learning Examples = 𝑓( ) cat = 𝑓( ) = 𝑓( )

Supervised Learning – k-Nearest Neighbors cat k=3 dog bear cat, cat, dog cat cat dog bear dog bear

Supervised Learning – k-Nearest Neighbors cat dog k=3 bear cat bear, dog, dog cat dog bear dog bear

Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric?

Supervised Learning – k-Nearest Neighbors How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)

Unsupervised Learning – k-means clustering 1. Initially assign all images to a random cluster

Unsupervised Learning – k-means clustering 2. Compute the mean image (in feature space) for each cluster

Unsupervised Learning – k-means clustering 3. Reassign images to clusters based on similarity to cluster means

Unsupervised Learning – k-means clustering 4. Keep repeating this process until convergence

Unsupervised Learning – k-Means clustering How do we choose the right K? How do we choose the right features? How do we choose the right distance metric? How sensitive is this method with respect to the random assignment of clusters? Answer: Just choose the one combination that works best! BUT not on the test data. Instead split the training data into a ”Training set” and a ”Validation set” (also called ”Development set”)

Supervised Learning - Classification Training Data Test Data dog cat bear dog cat bear cat dog bear

Supervised Learning - Classification Training Data Test Data cat dog cat . . bear

Supervised Learning - Classification Training Data 𝑦 𝑛 =[ ] 𝑦 3 =[ ] 𝑦 2 =[ ] 𝑦 1 =[ ] 𝑥 1 =[ ] 𝑥 2 =[ ] 𝑥 3 =[ ] 𝑥 𝑛 =[ ] cat dog cat . bear

Supervised Learning - Classification Training Data inputs targets / labels / ground truth 𝑦 𝑖 =𝑓( 𝑥 𝑖 ;𝜃) We need to find a function that maps x and y for any of them. 1 2 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] How do we ”learn” the parameters of this function? We choose ones that makes the following quantity small: 𝑖=1 𝑛 𝐶𝑜𝑠𝑡( 𝑦 𝑖 , 𝑦 𝑖 ) . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] 1 2 3 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax Training Data inputs targets / labels / ground truth [0.85 0.10 0.05] [0.40 0.45 0.05] [0.20 0.70 0.10] [0.40 0.25 0.35] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = predictions 𝑥 1 =[ 𝑥 11 𝑥 12 𝑥 13 𝑥 14 ] [1 0 0] [0 1 0] [0 0 1] 𝑦 𝑛 = 𝑦 3 = 𝑦 2 = 𝑦 1 = 𝑥 2 =[ 𝑥 21 𝑥 22 𝑥 23 𝑥 24 ] 𝑥 3 =[ 𝑥 31 𝑥 32 𝑥 33 𝑥 34 ] . 𝑥 𝑛 =[ 𝑥 𝑛1 𝑥 𝑛2 𝑥 𝑛3 𝑥 𝑛4 ]

Supervised Learning – Linear Softmax 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 𝑓 𝑑 𝑓 𝑏 ] 𝑔 𝑐 = 𝑤 𝑐1 𝑥 𝑖1 + 𝑤 𝑐2 𝑥 𝑖2 + 𝑤 𝑐3 𝑥 𝑖3 + 𝑤 𝑐4 𝑥 𝑖4 + 𝑏 𝑐 𝑔 𝑑 = 𝑤 𝑑1 𝑥 𝑖1 + 𝑤 𝑑2 𝑥 𝑖2 + 𝑤 𝑑3 𝑥 𝑖3 + 𝑤 𝑑4 𝑥 𝑖4 + 𝑏 𝑑 𝑔 𝑏 = 𝑤 𝑏1 𝑥 𝑖1 + 𝑤 𝑏2 𝑥 𝑖2 + 𝑤 𝑏3 𝑥 𝑖3 + 𝑤 𝑏4 𝑥 𝑖4 + 𝑏 𝑏 𝑓 𝑐 = 𝑒 𝑔 𝑐 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑑 = 𝑒 𝑔 𝑑 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 ) 𝑓 𝑏 = 𝑒 𝑔 𝑏 / (𝑒 𝑔 𝑐 + 𝑒 𝑔 𝑑 + 𝑒 𝑔 𝑏 )

How do we find a good w and b? 𝑥 𝑖 =[ 𝑥 𝑖1 𝑥 𝑖2 𝑥 𝑖3 𝑥 𝑖4 ] [1 0 0] 𝑦 𝑖 = 𝑦 𝑖 = [ 𝑓 𝑐 (𝑤,𝑏) 𝑓 𝑑 (𝑤,𝑏) 𝑓 𝑏 (𝑤,𝑏)] We need to find w, and b that minimize the following: 𝐿 𝑤,𝑏 = 𝑖=1 𝑛 𝑗=1 3 − 𝑦 𝑖,𝑗 log⁡( 𝑦 𝑖,𝑗 ) = 𝑖=1 𝑛 −log⁡( 𝑦 𝑖,𝑙𝑎𝑏𝑒𝑙 ) = 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Why?

Gradient Descent (GD) expensive 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝐿(𝑤,𝑏)/𝑑𝑏 Print: 𝐿(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝐿(𝑤,𝑏)= 𝑖=1 𝑛 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

Gradient Descent (GD) (idea) 1. Start with a random value of w (e.g. w = 12) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) w=12 3. Recompute w as: w = w – lambda * (dL / dw) 𝑤

Gradient Descent (GD) (idea) 𝐿 𝑤 2. Compute the gradient (derivative) of L(w) at point w = 12. (e.g. dL/dw = 6) 3. Recompute w as: w = w – lambda * (dL / dw) w=10 𝑤

(mini-batch) Stochastic Gradient Descent (SGD) 𝜆=0.01 for e = 0, num_epochs do end Initialize w and b randomly 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Compute: and Update w: Update b: 𝑤=𝑤 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑤 𝑏=𝑏 −𝜆 𝑑𝑙(𝑤,𝑏)/𝑑𝑏 Print: 𝑙(𝑤,𝑏) // Useful to see if this is becoming smaller or not. 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) for b = 0, num_batches do end

Source: Andrew Ng

Three more things What is the form of the gradient Regularization Momentum updates

What is the form of the gradient? 𝑙(𝑤,𝑏)= 𝑖∈𝐵 −log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) Let’s assume |B| = 1, then we are interested in the following: 𝑙 𝑤,𝑏 =−log 𝑓 𝑖,𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏)

What is the form of the gradient? 𝑙 𝑤,𝑏 =−log 𝑓 𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) 𝑙 𝑤,𝑏 =−log 𝑒 𝑔 𝑙𝑎𝑏𝑒𝑙 (𝑤,𝑏) 𝑒 𝑔 𝑖 (𝑤,𝑏) 𝑑 𝑑𝑤 𝑙 𝑤,𝑏 =( 𝑒 𝑔 𝑖 𝑤,𝑏 𝑒 𝑔 𝑖 𝑤,𝑏 − 𝑦 𝑖 ) 𝑑 𝑑𝑤 𝑔(𝑤, 𝑏)

Regularization 𝑙(𝑤,𝑏)= 𝑖∈𝐵 𝑙 𝑤,𝑏 + 1 2 𝑤 2

Momentum updates 𝑤=𝑤 −𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 instead of this 𝑣=𝜌𝑣+ 𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 𝑤=𝑤 −𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 instead of this 𝑣=𝜌𝑣+ 𝜆 𝑑𝑙 𝑤,𝑏 𝑑𝑤 we use this with rho typically 0.8-0.9 𝑤=𝑤 −𝑣 See also: https://github.com/karpathy/neuraltalk2/blob/master/misc/optim_updates.lua 38