Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.

Slides:



Advertisements
Similar presentations
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Advertisements

Linear Regression.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Linear Discriminant Functions
The loss function, the normal equation,
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Collaborative Filtering Matrix Factorization Approach
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
CSE 446 Perceptron Learning Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS598: Machine Learning and Natural Language Lecture 7: Probabilistic Classification Oct. 19, Dan Roth University of Illinois, Urbana-Champaign.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Logistic Regression William Cohen.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CS 9633 Machine Learning Support Vector Machines
Machine Learning – Classification David Fenyő
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Deep Feedforward Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Dan Roth Department of Computer and Information Science
LECTURE 03: DECISION SURFACES
10701 / Machine Learning.
ECE 5424: Introduction to Machine Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Statistical Learning Dong Liu Dept. EEIS, USTC.
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Large Scale Support Vector Machines
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
EE513 Audio Signals and Systems
Logistic Regression.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The loss function, the normal equation,
CIS 519/419 Applied Machine Learning
Mathematical Foundations of BME Reza Shadmehr
Logistic Regression Chapter 7.
Recap: Naïve Bayes classifier
Introduction to Neural Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Logistic Regression Geoff Hulten.
Patterson: Chap 1 A Review of Machine Learning
Presentation transcript:

Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied. The linear form of the classifiers provides some hints.

Naïve Bayes: Two Classes In the case of two classes we have that: but since We get (plug in (2) in (1); some algebra): Which is simply the logistic (sigmoid) function We have: A = 1-B; Log(B/A) = -C. Then: Exp(-C) = B/A = = (1-A)/A = 1/A – 1 = + Exp(-C) = 1/A A = 1/(1+Exp(-C))

Why Does it Work? Learning Theory Expressivity (1: Methodology) Probabilistic predictions are done via Linear [Statistical Queries] Models [Roth’99] The low expressivity explains Generalization + Robustness Expressivity (1: Methodology) Why is it possible to (approximately) fit the data with these models? Is there a reason to believe that these hypotheses minimize the empirical error? In General, No. (Unless it some probabilistic assumptions happen to hold). But: if the hypothesis does not fit the training data, augment set of features But now, you actually follow the Learning Theory Protocol: Try to learn a hypothesis that is consistent with the data Generalization will be a function of the low expressivity Expressivity (2: Theory) [Garg&Roth (ECML’01)]: Product distributions are “dense” in the space of all distributions. Consequently, for most generating distributions the resulting predictor’s error is close to optimal classifier (that is, given the correct distribution) What did I say so far?

What’s Next? (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? Yes (2) If probabilistic hypotheses are actually like other linear functions, can you train them similarly (that is, discriminatively)? Yes. Classification: Logistics regression/Max Entropy HMM: can be learned as a linear model, e.g., with a version of Perceptron (Structured Models class)

Recall: Naïve Bayes, Two Classes In the case of two classes we have: but since We get (plug in (2) in (1); some algebra): Which is simply the logistic (sigmoid) function used in the neural network representation.

Conditional Probabilities (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? Yes General recipe Train a classifier f using your favorite algorithm (Perceptron, SVM, Winnow, etc). Then: Use Sigmoid1/1+exp{-(AwTx + B)} to get an estimate for P(y | x) A, B can be tuned using a held out that was not used for training. Done in LBJava, for example

P(y= +/-1 | x,w)= [1+exp(-y(wTx + b)]-1 Logistic Regression (2) If probabilistic hypotheses are actually like other linear functions, can you actually train them similarly (that is, discriminatively)? The logistic regression model assumes the following model: P(y= +/-1 | x,w)= [1+exp(-y(wTx + b)]-1 This is the same model we derived for naïve Bayes, only that now we will not assume any independence assumption. We will directly find the best w. Therefore training will be more difficult. However, the weight vector derived will be more expressive. It can be shown that the naïve Bayes algorithm cannot represent all linear threshold functions. On the other hand, NB converges to its performance faster. How?

Logistic Regression (2) Given the model: P(y= +/-1 | x,w)= [1+exp(-y(wTx + b)]-1 The goal is to find the (w, b) that maximizes the log likelihood of the data: {x1,x2… xm}. We are looking for (w,b) that minimizes the negative log-likelihood minw,b 1m log P(y= +/-1 | x,w)= minw,b 1m log[1+exp(-yi(wTxi + b)] This optimization problem is called Logistics Regression Logistic Regression is sometimes called the Maximum Entropy model in the NLP community (since the resulting distribution is the one that has the largest entropy among all those that activate the same features).

Logistic Regression (3) Using the standard mapping to linear separators through the origin, we would like to minimize: minw 1m log P(y= +/-1 | x,w)= minw, 1m log[1+exp(-yi(wTxi)] To get good generalization, it is common to add a regularization term, and the regularized logistics regression then becomes: minw f(w) = ½ wTw + C 1m log[1+exp(-yi(wTxi)], Where C is a user selected parameter that balances the two terms. Regularization term Empirical loss

Comments on discriminative Learning minw f(w) = ½ wTw + C 1m log[1+exp(-yiwTxi)], Where C is a user selected parameter that balances the two terms. Since the second term is the loss function Therefore, regularized logistic regression can be related to other learning methods, e.g., SVMs. L1 SVM solves the following optimization problem: minw f1(w) = ½ wTw + C 1m max(0,1-yi(wTxi) L2 SVM solves the following optimization problem: minw f2(w) = ½ wTw + C 1m (max(0,1-yiwTxi))2 Regularization term Empirical loss

Optimization: How to Solve All methods are iterative methods, that generate a sequence wk that converges to the optimal solution of the optimization problem above. Many options within this category: Iterative scaling: Low cost per iteration, slow convergence, updates each w component at a time Newton methods: High cost per iteration, faster convergence non-linear conjugate gradient; quasi-Newton methods; truncated Newton methods; trust-region newton method. Limited memory BFGS is very popular Stochastic Gradient Decent methods The runtime does not depend on n=#(examples); advantage when n is very large. Stopping criteria is a problem: method tends to be too aggressive at the beginning and reaches a moderate accuracy quite fast, but it’s convergence becomes slow if we are interested in more accurate solutions.

Summary (1) If probabilistic hypotheses are actually like other linear functions, can we interpret the outcome of other linear learning algorithms probabilistically? Yes (2) If probabilistic hypotheses are actually like other linear functions, can you train them similarly (that is, discriminatively)? Yes. Classification: Logistic regression/Max Entropy HMM: can be trained via Perceptron (Structured Learning Class: Spring 2016)

Conditional Probabilities Data: Two class (Open/NotOpen Classifier) The plot shows a (normalized) histogram of examples as a function of the dot product act = (wTx + b) and a couple other functions of it. In particular, we plot the positive Sigmoid: P(y= +1 | x,w)= [1+exp(-(wTx + b)]-1 Is this really a probability distribution?

Conditional Probabilities Plotting: For example z: y=Prob(label=1 | f(z)=x) (Histogram: for 0.8, # (of examples with f(z)<0.8)) Claim: Yes; If Prob(label=1 | f(z)=x) = x Then f(z) = f(z) is a probability dist. That is, yes, if the graph is linear. Theorem: Let X be a RV with distribution F. F(X) is uniformly distributed in (0,1). If U is uniform(0,1), F-1(U) is distributed F, where F-1(x) is the value of y s.t. F(y) =x. Alternatively: f(z) is a probability if: ProbU {z| Prob[(f(z)=1 · y]} = y Plotted for SNoW (Winnow). Similarly, perceptron; more tuning is required for SVMs.