Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Bayes rule, priors and maximum a posteriori
Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Linear Regression.
Brief introduction on Logistic Regression
Logistic Regression Psy 524 Ainsworth.
Outline input analysis input analyzer of ARENA parameter estimation
Indian Statistical Institute Kolkata
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Machine Learning Week 2 Lecture 1.
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Visual Recognition Tutorial
x – independent variable (input)
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Maximum likelihood (ML)
Visual Recognition Tutorial
Maximum likelihood (ML) and likelihood ratio (LR) test
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
Descriptive statistics Experiment  Data  Sample Statistics Experiment  Data  Sample Statistics Sample mean Sample mean Sample variance Sample variance.
Machine Learning CMPT 726 Simon Fraser University
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Naïve Bayes Classification Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata August 14, 2014.
Thanks to Nir Friedman, HU
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Maximum likelihood (ML)
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Semi-Supervised Learning
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Consistency An estimator is a consistent estimator of θ, if , i.e., if
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Generalized Linear Models (GLMs) and Their Applications.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Machine Learning 5. Parametric Methods.
CSE 446 Logistic Regression Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Conditional Expectation
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Chapter 7. Classification and Prediction
Oliver Schulte Machine Learning 726
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
10701 / Machine Learning.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ying shen Sse, tongji university Sep. 2016
Econometric Models The most basic econometric model consists of a relationship between two variables which is disturbed by a random error. We need to use.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Logistic Regression.
Parametric Methods Berlin Chen, 2005 References:
Logistic Regression Chapter 7.
Recap: Naïve Bayes classifier
Naïve Bayes Classifier
Presentation transcript:

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014

Recall: Linear Regression 2 Engine displacement (cc) Power (bhp)  Assume: the relation is linear  Then for a given x (=1800), predict the value of y  Both the dependent and the independent variables are continuous

Scenario: Heart disease – vs – Age 3 Age (X) Heart disease (Y) The task: calculate P(Y = Yes | X) No Yes Training set Age (numarical): independent variable Heart disease (Yes/No): dependent variable with two classes Task: Given a new person’s age, predict if (s)he has heart disease

Scenario: Heart disease – vs – Age 4 Age (X) Heart disease (Y)  Calculate P(Y = Yes | X) for different ranges of X  A curve that estimates the probability P(Y = Yes | X) No Yes Training set Age (numarical): independent variable Heart disease (Yes/No): dependent variable with two classes Task: Given a new person’s age, predict if (s)he has heart disease

The Logistic function Logistic function on t : takes values between 0 and 1 5 The logistic curve t L(t)L(t) If t is a linear function of x Logistic function becomes: Probability of the dependent variable Y taking one value against another

The Likelihood function  Let, a discrete random variable X has a probability distribution p(x; θ), that depends on a parameter θ  In case of Bernoulli’s distribution 6  Intuitively, likelihood is “how likely” is an outcome being estimated correctly by the parameter θ – For x = 1, p(x;θ) = θ – For x = 0, p(x;θ) = 1−θ  Given a set of data points x 1, x 2,…, x n, the likelihood function is defined as:

About the Likelihood function  The actual value does not have any meaning, only the relative likelihood matters, as we want to estimate the parameter θ  Constant factors do not matter  Likelihood is not a probability density function  The sum (or integral) does not add up to 1  In practice it is often easier to work with the log-likelihood  Provides same relative comparison  The expression becomes a sum 7

Example  Experiment: a coin toss, not known to be unbiased  Random variable X takes values 1 if head and 0 if tail  Data: 100 outcomes, 75 heads, 25 tails 8  Relative likelihood: if θ 1 > θ 2, L(θ 1 ) > L(θ 2 )

Maximum likelihood estimate  Maximum likelihood estimation: Estimating the set of values for the parameters (for example, θ) which maximizes the likelihood function  Estimate: 9  One method: Newton’s method – Start with some value of θ and iteratively improve – Converge when improvement is negligible  May not always converge

Taylor’s theorem  If f is a – Real-valued function – k times differentiable at a point a, for an integer k > 0 Then f has a polynomial approximation at a  In other words, there exists a function h k, such that 10 Polynomial approximation (k-th order Taylor’s polynomial) and

Newton’s method  Finding the global maximum w * of a function f of one variable Assumptions: 1.The function f is smooth 2.The derivative of f at w * is 0, second derivative is negative  Start with a value w = w 0  Near the maximum, approximate the function using a second order Taylor polynomial 11  Using the gradient descent approach iteratively estimate the maximum of f

Newton’s method  Take derivative w.r.t. w, and set it to zero at a point w 1 12 Iteratively:  Converges very fast, if at all  Use the optim function in R

Logistic Regression: Estimating β 0 and β 1  Logistic function 13  Log-likelihood function – Say we have n data points x 1, x 2,…, x n – Outcomes y 1, y 2,…, y n, each either 0 or 1 – Each y i = 1 with probabilities p and 0 with probability 1 − p

Visualization 14 Age (X) Heart disease (Y) No Yes  Fit some plot with parameters β 0 and β 1

Visualization 15 Age (X) Heart disease (Y) No Yes  Fit some plot with parameters β 0 and β 1  Iteratively adjust curve and the probabilities of some point being classified as one class vs another For a single independent variable x the separation is a point x = a

Two independent variables Separation is a line where the probability becomes

CLASSIFICATION Wrapping up classification 17

Binary and Multi-class classification  Binary classification: – Target class has two values – Example: Heart disease Yes / No  Multi-class classification – Target class can take more than two values – Example: text classification into several labels (topics)  Many classifiers are simple to use for binary classification tasks  How to apply them for multi-class problems? 18

Compound and Monolithic classifiers  Compound models – By combining binary submodels – 1-vs-all: for each class c, determine if an observation belongs to c or some other class – 1-vs-last  Monolithic models (a single classifier) – Examples: decision trees, k-NN 19