WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17.

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
CS276A Text Retrieval and Mining Lecture 17 [Borrows some slides from Ray Mooney]
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Linear Classifiers/SVMs
CHAPTER 10: Linear Discrimination
Support Vector Machines
1 Lecture 5 Support Vector Machines Large-margin linear classifier Non-separable case The Kernel trick.
SVM—Support Vector Machines
Support vector machine
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Chapter 4: Linear Models for Classification
Support Vector Machine
Support Vector Machines (and Kernel Methods in general)
Support Vector Machines and Kernel Methods
Support Vector Machines (SVMs) Chapter 5 (Duda et al.)
x – independent variable (input)
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
CES 514 – Data Mining Lecture 8 classification (contd…)
Today Linear Regression Logistic Regression Bayesians v. Frequentists
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 9 Text Classification IV Feb 13, 2003.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CS347 Lecture 9 May 11, 2001 ©Prabhakar Raghavan.
Logistic Regression 10701/15781 Recitation February 5, 2008 Parts of the slides are from previous years’ recitation and lecture notes, and from Prof. Andrew.
An Introduction to Support Vector Machines Martin Law.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Outline Separating Hyperplanes – Separable Case
This week: overview on pattern recognition (related to machine learning)
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Machine Learning CSE 681 CH2 - Supervised Learning.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
An Introduction to Support Vector Machines (M. Law)
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support Vector Machines (SVMs) Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Support Vector Machines
CS 9633 Machine Learning Support Vector Machines
PREDICT 422: Practical Machine Learning
LECTURE 16: SUPPORT VECTOR MACHINES
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
LECTURE 17: SUPPORT VECTOR MACHINES
Support Vector Machines
Linear Discrimination
SVMs for Document Ranking
Support Vector Machines 2
Presentation transcript:

WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17

Today’s Topics Text classification Logistic regression Support vector machines

Intro to Logistic Regression Naïve Bayes and Logistic Regression are probabilistic models. Naïve Bayes is a generative model. Logistic regression is a discriminative model. directly maximizes classification accuracy.

Intro to Logistic Regression A generative model predicts probability that document d will be generated by a source c Naïve Bayes model: Parameters, i.e. P(w |c)’s, are fit to optimally predict generation of d

One source model for each class c Choose class c with largest value of: Generative model Our criterion is: how likely is it that this model generated the document? Classification accuracy is optimized indirectly and imperfectly. Accuracy and MAP are different goals, and will in general lead to different predictions. Classify Text w/ Gen. Model

For binary classification, we have: Naïve Bayes

Naïve Bayes: Directly model probability of generating class conditional on words w Logistic regression: The discriminative alternative: Logistic Regression

Logistic regression: Tune parameters β w to optimize conditional likelihood (class probability predictions) What a statistician would probably tell you to use if you said you had a categorical decision problem (like text categorization)

The Logit-Based Model The simplest model for optimizing predictive accuracy is linear regression. Why don’t we use linear regression:? p =  +  X + e Normal assumption does not work for probabilities. Need to transform the input variables and predicted variable to be able to apply regression. The transformation is the logit: Logit(p) = ln[p/(1-p)] =  +  X + e Apply also to input variables: logit(p) = a+b1*logit(F1) + b2*logit(F2) bn*logit(Fn) (where p≈class-probability and Fi≈word i) Model: The logit of a predicted probability is the (weighted) sum of the logits of the probabilities associated with each of the features An instance of a generalized linear model where one response is conditioned on all features

Logit and Logistic Logit: ln[p/(1-p)] =  +  X The slope coefficient (  ) is interpreted as the rate of change in the log odds as X changes A more intuitive interpretation of the logit is the “odds ratio”. Since [p/(1-p)] = exp(  +  X) exp(…) is the effect of the independent variable on the odds of having a certain classification Logistic(X) = 1/ (1+exp(-  -  X))

1 1 Logit and logistic transforms logit(p) = ln(p/[1-p]) logistic(x) = 1/(1+e -x ) logit logistic

Classification Compute vector representation X of document Compute z =  +  X Dot product of weight vector  with vector representation X Beta defines a hyperplane as before P(C) = logistic(z) = 1/(1+e -z ) P(C) is the probability that the document is in the class If we use a good method to estimate beta, then this will be a ‘’good’’ probability (as opposed to Naïve Bayes)

Training a Logistic Regression Model Training consists of computing parameters  and  MLE – maximum likelihood estimation MLE is a statistical method for estimating the coefficients of a model that maximizes some likelihood Here the likelihood function (L) measures the probability of observing the particular set of class labels (C, not C) that occur in the training data LogR MLE is normally done by some form of iterative fitting algorithm, or a gradient descent procedure such as CG Expensive for large models with many features

LR & NB: Same Parameters! - Binary or raw TF weighting - Optimized differently

Performance Early results with LogR were disappointing, because people didn’t understand the means to regularize (smooth) LogR to cope with sparse data Done right, LogR outperforms NB in text classification NB optimizes parameters to predict words, LogR optimizes to predict class LogR seems as good as SVMs (or any known text cat method – Tong & Oles 2001) though less studied and less trendy than SVMs.

Support Vector Machines

Recall: Which Hyperplane? In general, lots of possible solutions Support Vector Machine (SVM) finds an optimal solution.

Support Vector Machine (SVM) Support vectors Maximize margin SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Quadratic programming problem Seen by many as most successful current text classification method

w: hyperplane normal x i : data point i y i : class of data point i (+1 or -1) Constraint optimization formalization: (1) (2) maximize margin: 2/||w|| Maximum Margin: Formalization

Key differentiator of SVMs is reliance on support vectors Conceptually: only what is close to the decision boundary should matter. Why is margin determined by support vectors only? Support vectors

Quadratic Programming Quadratic programming setup explains two properties of SVMs 1. Classification is determined by support vectors only 2. The notion of kernel

Quadratic Programming Most  i will be zero. Non-zero  i are support vectors. One can show that hyperplane normal w with maximum margin is:  i : lagrange multipliers x i : data point i y i : class of data point i (+1 or -1) Where the  i are the solution to maximizing:

Non-Separable Case Now we know how to build a separator for two linearly separable classes What about classes whose exemplary docs are not linearly separable?

Not Linearly Separable Find a line that penalizes points on “the wrong side”.

Penalizing Bad Points Define distance for each point with respect to separator ax + by = c: (ax + by) - c for red points c - (ax + by) for green points. Negative for bad points.

Classification with SVMs Given a new point (x1,x2), can score its projection onto the hyperplane normal: Compute score: w x + b In 2 dims: score = w 1 x 1 +w 2 x 2 +b. Set confidence threshold t Score > t: yes Score < -t: no Else: don’t know

SVMs: Predicting Generalization We want the classifier with the best generalization (best accuracy on new data). What are clues for good generalization? Large training set Low error on training set Capacity/variance (number of parameters in the model, expressive power of model) SVMs give you an explicit bound on error on new data based on these.

Capacity/Variance: VC Dimension Theoretical risk boundary: Risk = mean error rate  – the model (defined by its parameters) R emp - empirical risk, l - #observations, h – VC dimension, the above holds with prob. (1-η) VC dimension/Capacity: max number of points that can be shattered A set can be shattered if the classifier can learn every possible labeling. VC = Vapnik-Chervonenkis Dimension

Capacity of Hyperplanes?

Exercise Suppose you have n points in d dimensions, labeled red or green. How big need n be (as a function of d) in order to create an example with the red and green points not linearly separable? E.g., for d=2, n  4.

Capacity/Variance: VC Dimension Theoretical risk boundary: R emp - empirical risk, l - #observations, h – VC dimension, the above holds with prob. (1-η) Important theoretical property Not very often used in practice

SVM Kernels Recall: We’re maximizing: Observation: data only occur in dot products. We can map data into a very high dimensional space (even infinite!) as long as kernel computable. For mapping function Ф, compute kernel K(i,j) = Ф(xi)∙Ф(xj) Example:

Kernels Why use kernels? Make non-separable problem separable. Map data into better representational space Common kernels Linear Polynomial Radial basis function (infinite space)

Results for Kernels (Joachims)

Performance of SVM SVM are seen as best-performing method by many. Statistical significance of most results not clear. There are many methods that perform about as well as SVM. Example: regularized logistic regression (Zhang&Oles) Example of a comparison study: Yang&Liu

Yang&Liu: SVM vs Other Methods

Yang&Liu: Statistical Significance

Summary Support vector machines (SVM) Choose hyperplane based on support vectors Support vector = “critical” point close to decision boundary Kernels: powerful and elegant way to define similarity metric Bound on “risk” (expected error on test set) Best performing text classifier? Partly popular due to availability of svmlight Svmlight is accurate and fast – and free (for research) Logistic regression (LR) Traditional statistical technique for classification Does not work “out of the box” due to high dimensionality of text Robust/regularized versions perform as well as SVM? No equivalent to svmlight available (Degree-1) SVMs and LR are linear classifiers.

Resources Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press. Manning and Schuetze. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer- Verlag, New York. A Tutorial on Support Vector Machines for Pattern Recognition (1998) Christopher J. C. Burges R.M. Tong, L.A. Appelbaum, V.N. Askman, J.F. Cunningham. Conceptual Information Retrieval using RUBRIC. Proc. ACM SIGIR , (1987). S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), Jul/Aug 1998 S. T. Dumais, J. Platt, D. Heckerman and M. Sahami Inductive learning algorithms and representations for text categorization. Proceedings of CIKM ’98, pp re-examination of text categorization methods (1999) Yiming Yang, Xin Liu 22nd Annual International SIGIR Tong Zhang, Frank J. Oles: Text Categorization Based on Regularized Linear Classification Methods. Information Retrieval 4(1): 5-31 (2001) ‘Classic’ Reuters data set: /resources /testcollections/reuters21578/