CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression) Pushpak Bhattacharyya CSE Dept., IIT Bombay.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Modeling of Data. Basic Bayes theorem Bayes theorem relates the conditional probabilities of two events A, and B: A might be a hypothesis and B might.
Regularized risk minimization
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
CS626: NLP, Speech and the Web
Linear Regression.
CS344: Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 15, 16: Perceptrons and their computing power 6 th and.
Prof. Navneet Goyal CS & IS BITS, Pilani
Brief introduction on Logistic Regression
Pattern Recognition and Machine Learning
Pattern Recognition and Machine Learning
Logistic Regression.
Chapter 4: Linear Models for Classification
Logistic Regression STA302 F 2014 See last slide for copyright information 1.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Pattern Recognition and Machine Learning
x – independent variable (input)
Classification and risk prediction
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Statistical Methods Chichang Jou Tamkang University.
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Statistical Learning Theory: Classification Using Support Vector Machines John DiMona Some slides based on Prof Andrew Moore at CMU:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Principles of Pattern Recognition
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Logistic Regression Database Marketing Instructor: N. Kumar.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Regression Usman Roshan CS 698 Machine Learning. Regression Same problem as classification except that the target variable y i is continuous. Popular.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Generalized Linear Models (GLMs) and Their Applications.
MAXIMUM ENTROPY MARKOV MODEL Adapted From: Heshaam Faili University of Tehran – Dikkala Sai Nishanth – Ashwin P. Paranjape
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
The simple linear regression model and parameter estimation
Chapter 7. Classification and Prediction
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Machine Learning Logistic Regression
Empirical risk minimization
CH 5: Multivariate Methods
Machine Learning Logistic Regression
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Parametric Methods Berlin Chen, 2005 References:
Empirical risk minimization
Logistic Regression Chapter 7.
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
Logistic Regression.
What is Artificial Intelligence?
Logistic Regression Geoff Hulten.
Presentation transcript:

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 16– Linear and Logistic Regression) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th Feb, 2011

Least Square Method: fitting a line (following Manning and Schutz, Foundation of Statistical NLP, 1999) Given set of N points (x 1,y 1 ), (x 2,y 2 ),…, (x N,y N ) Find a line f(x)=mx+b that best fits the data m and b are the parameters to be found W: is the weight vector The line that best fits the data is the one that minimizes the sum of squares of the distances

Values of m and b Partial differentiation of SS(m,b) wrt b and m yields respectively

Example (Manning and Schutz, FSNLP, 1999)

Implication of the “line” fitting A B CD O 1, 2, 3, 4: are the points A, B, C, D: are their projections on the fitted line Suppose 1, 2 form a class and 3, 4 another class Of course, it is easy to set up a hyper plane that will separate 1 and 2 from 3 and 4 That will be classification in 2 dimension But suppose we form another attribute of these points, viz., distances of their projections On the line from “O” Then the points can be classified by a threshold on these distances This effectively is classification in the reduced dimension (1 dimension)

When the dimensionality is more than 2 Let X be the input vectors: M X N (M input vectors with N features) Yj=w 0 +w 1.x j1 +w 2.x j2 +w 3.x j3 +…+w n.x jn find the weight vector W: It can be shown that

The multivariate data f 1 f 2 f 3 f 4 f 5 … f n x 11 x 12 x 13 x 14 x 15 … x 1n y 1 x 21 x 22 x 23 x 24 x 25 … x 2n y 2 x 31 x 32 x 33 x 34 x 35 … x 3n y 3 x 41 x 42 x 43 x 44 x 45 … x 4n y 4 … x m1 x m2 x m3 x m4 x m5 … x mn y m

Logistic Regression Linear regression: predicting a real- valued outcome Classification: Output takes a value from a small set of discrete values Simplest classification: Two classes (o/1 or true/false) Predict the class and also give the probability of belongingness to the class

Linear to logistic regression P(y=true |x)= Σ i=0,n w i X f i = w.f But, not a legal probability value! Value from –∞ to +∞ Predict the ratio of the probability of being in the class to the probability of not being in the class Odds Ratio: If an event has probability 0.75 of occurring and probability 0.25 of not occurring, we say the odds of occurring is 0.75/0.25 = 3.

Odds Ratio (following Jurafski and Martin, Speech and NLP, 2009) Ratio of probabilities can lie between 0 and ∞. But the RHS is between -∞ and + ∞. Introduce log. Then get the expression for p(y=true|x)

Logistic function for p(y=true|x) The form of p() is called the logistic function It maps values from –∞ to +∞ to lie between 0 and 1

Classification using logistic regression For belonging to the true class This gives In other words, Equivalent to placing a Hyperplane to separate the Two classes

Learning in logistic regression In linear regression we used minimizing the sum square error (SSE) In Logistic regression, we use maximum likelihood estimation Choose the weights such that the conditional probability p(y|x) is maximized

Steps of learning w For a particular Substituting the values of Ps This can be converted to For all pairs Working with log