Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Slides:

Advertisements

Similar presentations

A Support Vector Method for Optimizing Average Precision

Advertisements

Slides from: Doug Gray, David Poole

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Lectures 17,18 – Boosting and Additive Trees Rice ECE697 Farinaz Koushanfar Fall 2006.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

Support Vector Machines

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Intelligent Environments1 Computer Science and Engineering University of Texas at Arlington.

Supervised Learning Recap

CMPUT 466/551 Principal Source: CMU

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Computer vision: models, learning and inference

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

Lecture 14 – Neural Networks

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

x – independent variable (input)

Classification and risk prediction

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Pattern Recognition: Readings: Ch 4: , , 4.13

Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)

COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Margins, support vectors, and linear programming Thanks to Terran Lane and S. Dreiseitl.

Decision Theory Naïve Bayes ROC Curves

SVMs Finalized. Where we are Last time Support vector machines in grungy detail The SVM objective function and QP Today Last details on SVMs Putting it.

What is Learning All about ?  Get knowledge of by study, experience, or being taught  Become aware by information or from observation  Commit to memory.

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 1: Administrivia & Review.

SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Classification and Prediction: Basic Concepts Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Classification and Prediction: Regression Analysis

Review of Lecture Two Linear Regression Normal Equation

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.

Lecture 27: Recognition Basics CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Non-Bayes classifiers. Linear discriminants, neural networks.

An Introduction to Support Vector Machine Classification Bioinformatics Lecture 7/2/2003 by Pierre Dönnes.

Linear Models for Classification

CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Support Vector Machines Optimization objective Machine Learning.

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Deep Feedforward Networks

An Empirical Comparison of Supervised Learning Algorithms

Dan Roth Department of Computer and Information Science

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Statistical Learning Dong Liu Dept. EEIS, USTC.

Deep Learning for Non-Linear Control

Support Vector Machine I

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Linear Discrimination

Introduction to Neural Networks

Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2

Recap: Basic Recipe Training Data: Model Class: Loss Function: Learning Objective: Linear Models Squared Loss Optimization Problem

Recap: Bias-Variance Trade-off Variance Bias Variance Bias Variance Bias

Recap: Complete Pipeline Training DataModel Class(es) Loss Function Cross Validation & Model SelectionProfit!

Today Beyond Linear Basic Linear Models – Support Vector Machines – Logistic Regression – Feed-forward Neural Networks – Different ways to interpret models Different Evaluation Metrics Hypothesis Testing

0/1 Loss Squared Loss f(x) Loss Target y How to compute gradient for 0/1 Loss?

Recap: 0/1 Loss is Intractable 0/1 Loss is flat or discontinuous everywhere VERY difficult to optimize using gradient descent Solution: Optimize smooth surrogate Loss – Today: Hinge Loss (…eventually)

Support Vector Machines aka Max-Margin Classifiers

Source: Which Line is the Best Classifier?

Source: Which Line is the Best Classifier?

Line is a 1D, Plane is 2D Hyperplane is many D – Includes Line and Plane Defined by (w,b) Distance: Signed Distance: Hyperplane Distance w un-normalized signed distance! Linear Model = b/|w|

Image Source: Max Margin Classifier (Support Vector Machine) “Linearly Separable” Better generalization to unseen test examples (beyond scope of course*) “Margin” *

Soft-Margin Support Vector Machine “Margin” Size of Margin vs Size of Margin Violations (C controls trade-off) Image Source:

f(x) Loss 0/1 Loss Target y Hinge Loss Regularization

0/1 Loss Squared Loss f(x) Loss Target y Hinge Loss Hinge Loss vs Squared Loss

Support Vector Machine 2 Interpretations Geometric – Margin vs Margin Violations Loss Minimization – Model complexity vs Hinge Loss Equivalent!

Logistic Regression aka “Log-Linear” Models

Logistic Regression “Log-Linear” Model y*f(x) P(y|x) Also known as sigmoid function:

Training set: Maximum Likelihood: – (Why?) Each (x,y) in S sampled independently! – See recitation next Wednesday! Maximum Likelihood Training

SVMs often better at classification – At least if there is a margin… Calibrated Probabilities? Increase in SVM score…. –...similar increase in P(y=+1|x)? – Not well calibrated! Logistic Regression! Why Use Logistic Regression? Image Source: *Figure above is for Boosted Decision Trees (SVMs have similar effect) f(x) P(y=+1)

Log Loss Solve using Gradient Descent

Log Loss vs Hinge Loss f(x) Loss 0/1 Loss Hinge Loss Log Loss

Logistic Regression Two Interpretations Maximizing Likelihood Minimizing Log Loss Equivalent!

Feed-Forward Neural Networks aka Not Quite Deep Learning

1 Layer Neural Network 1 Neuron – Takes input x – Outputs y ~Logistic Regression! – Gradient Descent Σ x y “Neuron” f(x|w,b) = w T x – b = w 1 *x 1 + w 2 *x 2 + w 3 *x 3 – b y = σ( f(x) ) sigmoid tanh rectilinear

2 Layer Neural Network 2 Layers of Neurons – 1 st Layer takes input x – 2 nd Layer takes output of 1 st layer Can approximate arbitrary functions – Provided hidden layer is large enough – “fat” 2-Layer Network Σ x y Σ Σ Hidden Layer Non-Linear!

Aside: Deep Neural Networks Why prefer Deep over a “Fat” 2-Layer? – Compact Model (exponentially large “fat” model) – Easier to train? Image Source:

Training Neural Networks Gradient Descent! – Even for Deep Networks* Parameters: – (w 11,b 11,w 12,b 12,w 2,b 2 ) Σ x y Σ Σ *more complicated f(x|w,b) = w T x – b y = σ( f(x) ) Backpropagation = Gradient Descent (lots of chain rules)

Today Beyond Linear Basic Linear Models – Support Vector Machines – Logistic Regression – Feed-forward Neural Networks – Different ways to interpret models Different Evaluation Metrics Hypothesis Testing

Evaluation 0/1 Loss (Classification) Squared Loss (Regression) Anything Else?

Example: Cancer Prediction Loss FunctionHas CancerDoesn’t Have Cancer Predicts CancerLowMedium Predicts No CancerOMG Panic!Low Model Patient Value Positives & Negatives Differently – Care much more about positives “Cost Matrix” – 0/1 Loss is Special Case

Precision & Recall Precision = TP/(TP + FP) Recall = TP/(TP + FN) TP = True Positive, TN = True Negative FP = False Positive, FN = False Negative CountsHas CancerDoesn’t Have Cancer Predicts Cancer2030 Predicts No Cancer570 Model Patient Care More About Positives! F1 = 2/(1/P+ 1/R) Image Source:

Example: Search Query Rank webpages by relevance

Predict a Ranking (of webpages) – Users only look at top 4 – Sort by f(x|w,b) =1/2 – Fraction of top 4 relevant =2/3 – Fraction of relevant in top 4 Top of Ranking Only! Ranking Measures Image Source: Top 4

Pairwise Preferences 2 Pairwise Disagreements 4 Pairwise Agreements

ROC-Area & Average Precision ROC-Area – Area under ROC Curve – Fraction pairwise agreements Average Precision – Area under P-R Curve – for each positive Example: Image Source: ROC-Area: 0.5AP:

Summary: Evaluation Measures Different Evaluations Measures – Different Scenarios Large focus on getting positives – Large cost of mis-predicting cancer – Relevant webpages are rare

Today Beyond Linear Basic Linear Models – Support Vector Machines – Logistic Regression – Feed-forward Neural Networks – Different ways to interpret models Different Evaluation Metrics Hypothesis Testing

Uncertainty of Evaluation Model 1: 0.22 Loss on Cross Validation Model 2: 0.25 Loss on Cross Validation Which is better? – What does “better” mean? True Loss on unseen test examples – Model 1 might be better… –...or not enough data to distinguish

Uncertainty of Evaluation Model 1: 0.22 Loss on Cross Validation Model 2: 0.25 Loss on Cross Validation Validation set is finite – Sampled from “true” P(x,y) So there is uncertainty

Uncertainty of Evaluation Model 1: 0.22 Loss on Cross Validation Model 2: 0.25 Loss on Cross Validation Model 1 Loss: Model 2 Loss:

Gaussian Confidence Intervals Model 1Model 2 See Recitation Next Wednesday! 50 Points 250 Points 1000 Points 5000 Points Points

Next Week Regularization Lasso Recent Applications Next Wednesday: – Recitation on Probability & Hypothesis Testing