R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Linear Regression.
Pattern Recognition and Machine Learning
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
Ch11 Curve Fitting Dr. Deshi Ye
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Classification and risk prediction
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Visual Recognition Tutorial
An Introduction to Logistic Regression
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
Classification and Prediction: Regression Analysis
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Today Wrap up of probability Vectors, Matrices. Calculus
R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
PATTERN RECOGNITION AND MACHINE LEARNING
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.
Linear Models for Classification
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CpSc 881: Machine Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Regression Usman Roshan CS 675 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Machine Learning 5. Parametric Methods.
Roberto Battiti, Mauro Brunato
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Regression Usman Roshan.
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
LECTURE 11: Advanced Discriminant Analysis
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
CH 5: Multivariate Methods
10701 / Machine Learning.
Roberto Battiti, Mauro Brunato
Probabilistic Models for Linear Regression
Roberto Battiti, Mauro Brunato
Roberto Battiti, Mauro Brunato
Roberto Battiti, Mauro Brunato
Roberto Battiti, Mauro Brunato
Roberto Battiti, Mauro Brunato
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Biointelligence Laboratory, Seoul National University
Generally Discriminant Analysis
Regression Usman Roshan.
Parametric Methods Berlin Chen, 2005 References:
Presentation transcript:

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb optimization.org/LIONbook © Roberto Battiti and Mauro Brunato, 2014, all rights reserved. Can be used and modified for classroom usage, provided that the attribution (link to book website) is kept.

Chap. 8 Specific nonlinear models He who would learn to fly one day must first learn to stand and walk and run and climb and dance; one cannot fly into flying. (Friedrich Nietzsche)

Logistic regression (1) Goal: predicting the probability of the outcome of a categorical variable It is a technique for classification, obtained through an estimate of the probability Problem with linear models: the output value is not bounded. We need to bound it between zero and one Solution: use a logistic function to transform the output of a linear predictor in a value between zero and one, which can be interpreted as a probability

Logistic regression (2) (logistic functions) A logistic function transforms input values into an output value in the range 0-1, in a smooth manner. The output of the function can be interpreted as a probability.

Logistic regression (3) (logistic functions) A logistic function a.k.a. sigmoid function is When applied to the output of the linear model The weights in the linear model need to be learnt

Logistic regression (4) (maximum likelihood estimation) The best values of the weights in the linear model can be determined via by maximum likelihood estimation: Maximize the (model) probability of getting the output values which were actually obtained on the given labeled examples.

Logistic regression (5) (maximum likelihood estimation) Let y i be the observed output (1 or 0) for the corresponding input x i let Pr(y = 1|x i ) be the probability obtained by the model. the probability of obtaining the measured output value y i is Pr(y = 1|x i ) if y i= 1, Pr(y = 0|x i ) = 1 - Pr(y = 1|x i ) if y i= 0.

Logistic regression (6) (maximum likelihood estimation) After multiplying all the probabilities (independence) and applying logarithms  log likelihood The log likelihood depends on the weights of the linear regression No closed form expression for the weights that maximize the likelihood function: an iterative process can be used for maximizing (for example gradient descent)

Locally-weighted regression Motivation: similar cases are usually deemed more relevant than very distant ones to determine the output (natura non facit saltus) Determine the output of an evaluation point as a weighted linear combination of the outputs of the neighbors (with bigger weights for closer neighbors)

Locally-weighted regression (2) The linear regression depends on the query point (it is local) Linear regression on the training points with a significance that decreases with its distance from the query point.

Locally-weighted regression (3) Given a query-point q, let s i be the significance level assigned to the i-th labelled example. The weighted version of least squares fit aims at minimizing the following weighted error (1)

Locally-weighted regression (4)

Locally-weighted regression (5) Minimization of equation (1) yields the solution where S = diag(s 1,..., s d ) A common function used to set the relationship between significance and distance is

Locally-weighted regression (6) Evaluation of LWR model at query point q, sample point significance is represented by the interior shade. Right: Evaluation over all points, each point requires a different linear fit.

Bayesian Locally-weighted regression B-LWR, is used if prior information about what values the coefficients should have can be specified when there is not enough data to determine them. Advantage of Bayesian techniques: the possibility to model not only the expected values but entire probability distributions (and to derive “error bars”)

Bayesian Locally-weighted regression(2) Prior assumption: w=N(0,Σ) Σ=diag(σ 1,…,σ l ) 1/σ i =Gamma(k,ϑ) Let S=diag(s 1,…,s l ) be the matrix of the significance levels prescribed to each point

Bayesian Locally-weighted regression (3) The local model for the query point q is predicted by using the distribution of w whose mean is estimated as The variance of the Gaussian noise is estimated as

The estimated covariance matrix of the w distribution is then calculated as The predicted output response for the query point q is The variance of the mean predicted output is: Bayesian Locally-weighted regression (4)

LASSO to shrink and select inputs With a large number of input variables, we would like to determine a smaller subset that exhibits the strongest effects. Feature subset selection: can be very variable Ridge regression: more stable, but does not reduce the number of input variables LASSO (“least absolute shrinkage and selection operator”) retains the good features of both subset selection and ridge regression. It shrinks some coefficients and sets other ones to zero.

LASSO to shrink and select inputs (2) LASSO minimizes the residual sum of squares subject to the sum of the absolute value of the coefficients being less than a constant. Using Lagrange multipliers, it is equivalent to the following unconstrained minimization:

LASSO to shrink and select inputs (3)

LASSO vs. Ridge regression In ridge regression, as the penalty is increased, all parameters are reduced while still remaining non- zero In LASSO, increasing the penalty will cause more and more of the parameters to be driven to zero. The inputs corresponding to weights equal to zero can be eliminated LASSO is an embedded method to perform feature selection as part of the model construction

Lagrange multipliers The method of Lagrange multipliers is a strategy for finding the local maxima and minima of a function subject to constraints The problem is transformed into an unconstrained one by adding each constraint multiplied by a parameter (Lagrange multiplier)

Lagrange multipliers(2) Two-dimensional problem: Max f(x,y) Subject to g(x,y)=c

Lagrange multipliers(3) Suppose we walk along the contour line with g = c. while moving along the contour line for g = c the value of f can vary. Only when the contour line for g = c meets contour lines of f tangentially, f is approx. constant. This is the same as saying that the gradients of f and g are parallel, thus we want points (x, y) where g(x, y) = c and The Lagrange multiplier specifies how one gradient needs to be multiplied to obtain the other one.

Gist Linear models are widely used but insufficient in many cases logistic regression can be used if the output needs to have a limited range of possible values (e.g., if it is a probability), locally-weighted regression can be used if a linear model needs to be localized, by giving more significance to input points which are closer to a given input sample to be predicted.

Gist(2) LASSO reduces the number of weights different from zero, and therefore the number of inputs which influence the output.