Machine Learning Week 1, Lecture 2. Recap Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set 5 0.

Slides:

Advertisements

Similar presentations

Linear Regression.

Advertisements

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Supervised Learning Recap

Machine Learning Week 2 Lecture 1.

CMPUT 466/551 Principal Source: CMU

Computer vision: models, learning and inference

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

Visual Recognition Tutorial

Lecture 14 – Neural Networks

x – independent variable (input)

Classification and risk prediction

Lecture 29: Optimization and Neural Nets CS4670/5670: Computer Vision Kavita Bala Slides from Andrej Karpathy and Fei-Fei Li

Today Linear Regression Logistic Regression Bayesians v. Frequentists

The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:

Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.

Machine Learning CMPT 726 Simon Fraser University

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Crash Course on Machine Learning

Collaborative Filtering Matrix Factorization Approach

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.

Efficient Model Selection for Support Vector Machines

Biointelligence Laboratory, Seoul National University

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.

ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Logistic Regression Week 3 – Soft Computing By Yosi Kristian.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.

ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.

Classification of Breast Cancer Cells Using Artificial Neural Networks and Support Vector Machines Emmanuel Contreras Guzman.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Dan Roth Department of Computer and Information Science

ECE 5424: Introduction to Machine Learning

Classification with Perceptrons Reading:

Probabilistic Models for Linear Regression

Logistic Regression Classification Machine Learning.

Collaborative Filtering Matrix Factorization Approach

Logistic Regression Classification Machine Learning.

CSCI B609: “Foundations of Data Science”

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Generally Discriminant Analysis

The loss function, the normal equation,

CSCE833 Machine Learning Lecture 9 Linear Discriminant Analysis

Mathematical Foundations of BME Reza Shadmehr

Softmax Classifier.

Parametric Methods Berlin Chen, 2005 References:

Multivariate Methods Berlin Chen

Multivariate Methods Berlin Chen, 2005 References:

Linear Discrimination

Logistic Regression Classification Machine Learning.

Logistic Regression Classification Machine Learning.

Presentation transcript:

Machine Learning Week 1, Lecture 2

Recap Supervised Learning Data Set Learning Algorithm Hypothesis h h(x) ≈ f(x) Unknown Target f Hypothesis Set Hyperplane Halfspace >0 Halfspace < 0 w ClassificationRegression

np-hard in general Assume Data Is Linear Separable!!! Perception find separating hyperplane Convex

Today Convex Optimization – Convex sets – Convex functions Logistic Regression – Maximum Likelihood – Gradient Descent Maximum likelihood and Linear Regression

Convex Optimization Optimization problem, in general very hard (if possible at all)!!! For convex optimization problems theoretical (polynomial time) and practical solutions exist (most of the time) Example:

Convex Sets Convex Set Non-convex Set The “line” from x to y must also be in the set

Convex Sets Union of convex setsmay not be convex Intersection of convex sets

Convex Functions x,f(x) y,f(y) f is concave if –f is convex Concave?, Convex? Both

Differentiable Convex Functions x,f(x) y,f(y) f(x)+f’(x)(y-x) Example

Twice Differentiable Convex Functions f is convex if the Hessian is positive semi-definite for all x. Real symmetric matrix A is positive semidefinite if for all nonzero x 1D:

Simple 2D Example

More Examples Quadratic Functions: Convex if A is positive semidefinite Affine Functions:

Convexity of Linear Regression Quadratic Functions: Convex if A is positive semidefinite Real and Symmetric:Clearly

Epigraph Connection between convex sets and convex functions f is convex if epi(f) is a convex set

Sublevel sets Convex function Define α-Sublevel set: Is Convex

Convex Optimization f and g are convex, h is affine Local Minima are Global Minima

Examples of Convex Optimization Linear Programming Quadratic Programming (P is positive semidefinite)

Summary Rockafellar stated, in his 1993 SIAM Review survey paper “In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexity and nonconvexity” Convex GOOD!!!!

Estimating Probabilities Probability of getting cancer given your situation. Probability that AGF wins against Viborg given the last 5 results. Probability that the loan is not payed back as a function of credit worthiness Probability of a student getting an A in Machine Learning given his grades. Data is actual events not probabilities, e.g. some students that failed and some who did not…

Breast Cancer 1. Sample code number: id number 2. Clump Thickness: Uniformity of Cell Size: Uniformity of Cell Shape: Marginal Adhesion: Single Epithelial Cell Size: Bare Nuclei: Bland Chromatin: Normal Nucleoli: Mitoses: Input Features benign malignant Target Function PREDICT PROBABILITY OF BENIGN AND MALIGNANT ON FUTURE PATIENTS

Maximum Likelihood Biased Coin, (bias θ probability of heads) Flip it n times independently (Bernoulli trials), Count the number of heads k Fix θ, What is the probability of seeing D Take Logs After seeing the data what can we infer Likelihood of the data

Maximum Likelihood solve for 0 Compute Gradient Negative Log Likelihood of the data (log is monotone) Maximize Minimize

Bayesian Perspective Bayes Rule: Want: Need: A Prior Likelihood x Prior Normalizing factor Posterior

Bayesian Perspective Compute the probability of each hypotheses Pick the most likely and use for predictions (map = maximum a posteriori) Compute Expected Values (Weighed average over all hypotheses)

Logistic Regression Assume Independent Data Points, Apply Maximum Likelihood (there is a Bayesian version to) Hard Threshold Hard and Soft Threshold Can and is used for classification. Predict most likely y

Maximum Likelihood Logistic Regression Neg. Log likelihood is convex Cannot solve for zero analytically

Descent Methods Iteratively move toward a better solution Numerically we are doing small pertubation or mutations if you want in each variable e.g. O(dim * time(eval cost function)) time Show contour plot where f is twice continuously differentiable Pick start point x Repeat Until Stopping Criterion Satisfied Compute Descent Direction v Line Search: Compute Step Size t Update: x = x + t v Gradient Descent

Line (Ray) Search Pick start point x Repeat Until Stopping Criterion Satisfied Compute Descent Direction v Line Search: Compute Step Size t Update: x = x + t v Solve analytically (if possible) Backwards Search start high and decrease until improving distance found [SL 9.2] Fix to a small constant Use size of the gradient scaled with small constant. Start with constant, let it decrease slowly or when to high

Stopping Criteria Gradient becomes very small Max number of iterations used

Gradient Descent for Linear Reg.

GD For Linear Regression Matlab style function theta= GD(X,y,theta) LR = 0.1 for i=1:50 cost = (1/length(y))* sum((X*theta-y).^2) grad = (1/length(y))*2.*X'*(X*theta-y) theta = theta – LR * grad end Note we do not scale gradient to unit vector

Learning Rate

Gradient Descent Jump Around Use Exact Line Search Starting From (10,1)

Gradient Descent Running Time Number of iterations x Cost per iteration. Cost Per Iteration is usually not a problem. Number of iterations depends choice of line search and stopping Criterion clearly. – Very Problem and Data Specific – Need a lot of math to give bounds. – We will not cover it in this course.

Gradient Descent For Logistic Regression Handin 1! A long with multiclass extension

Stochastic Gradient Descent Pick at random and use Use K points chosen at random Mini-Batch Gradient Descent

Linear Classification with K classes Use Logistic regression All Vs one. – Train K classifiers one for each class – Input X is the same. Y is 1 for all elements from that class and 0 otherwise (All vs. One) – Prediction, compute the probability for all K classifiers output class with highest probability. Use Softmax Regression – Extension of logistic function to K classes in some sense – Covered in Handin 1.

Maximum Likelihood and Linear Regression (Time to spare slide) Assume: Independently

Todays Summary Convex Optimizations – Many Definitions – Local Optimal is Global Optimal – Usually theoretical and practically feasible Maximum likelihood – Use as a proxy for – Assume Independent Data Gradient Descent – Minimize function – Iteratively finding better solution by local steps based on gradient – First order method (Uses gradient) – Other methods exist, e.g. Second order methods (use hessian)