CS347 Lecture 9 May 11, 2001 ©Prabhakar Raghavan.

Slides:



Advertisements
Similar presentations
Text Categorization.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Classification / Regression Support Vector Machines
Data Mining Classification: Alternative Techniques
Support vector machine
Machine learning continued Image source:
Ping-Tsun Chang Intelligent Systems Laboratory Computer Science and Information Engineering National Taiwan University Text Mining with Machine Learning.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 16.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Discriminative and generative methods for bags of features
Hinrich Schütze and Christina Lioma
Chapter 7: Text mining UIC - CS 594 Bing Liu 1 1.
CS347 Lecture 8 May 7, 2001 ©Prabhakar Raghavan. Today’s topic Clustering documents.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Chapter 5: Linear Discriminant Functions
CES 514 – Data Mining Lecture 8 classification (contd…)
Support Vector Machines Based on Burges (1998), Scholkopf (1998), Cristianini and Shawe-Taylor (2000), and Hastie et al. (2001) David Madigan.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Support Vector Machines
Linear Discriminant Functions Chapter 5 (Duda et al.)
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 9 Text Classification IV Feb 13, 2003.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
This week: overview on pattern recognition (related to machine learning)
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Principles of Pattern Recognition
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Document Categorization Problem: given –a collection of documents, and –a taxonomy of subject areas Classification: Determine the subject area(s) most.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
WEB BAR 2004 Advanced Retrieval and Web Mining Lecture 17.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
Support Vector Machines and Gene Function Prediction Brown et al PNAS. CS 466 Saurabh Sinha.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Information Retrieval and Organisation Chapter 14 Vector Space Classification Dell Zhang Birkbeck, University of London.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
SVMs in a Nutshell.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Ping-Tsun Chang Intelligent Systems Laboratory NTU/CSIE Using Support Vector Machine for Integrating Catalogs.
CS276B Text Information Retrieval, Mining, and Exploitation Lecture 5 23 January 2003.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Support Vector Machines
Information Retrieval
Mathematical Foundations of BME
SVMs for Document Ranking
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

CS347 Lecture 9 May 11, 2001 ©Prabhakar Raghavan

Today’s topic Automatic document classification –Rule-based classification –Supervised learning

Classification Given one or more topics, decide which one(s) a given document belongs to. Applications –Classification into a topic taxonomy –Intelligence analysts –Routing to help desks/customer service Choice of “topic” must be unique

Step back Manual classification –accurate when done by subject experts –consistent when done by a small team –difficult to scale –used by Yahoo!, Looksmart, about.com, ODP hundreds of subject editors maintain thousands of topics (topics organized in a tree-like navigation structure)

Supervised vs. unsupervised learning Unsupervised learning: –Given corpus, infer structure implicit in the docs, without prior training. Supervised learning: –Train system to recognize docs of a certain type (e.g., docs in Italian, or docs about religion) –Decide whether or not new docs belong to the class(es) trained on

Challenges Must teach machine a model of each topic Given new doc, must measure fit to model(s) Evaluation: how well does the system perform? Threshold of pain: how confident is the system’s assessment? –Sometimes better to give up.

Teaching the system models Through an explicit query Through exemplary docs Combination

Explicit queries For the topic “Performing Arts”, query accrues evidence from stemmed and non- stemmed words. Query built by a subject expert. New doc scored against query: –if accrued evidence exceeds some threshold, declare it to belong to this topic.

Explicit queries Topic queries can be built up from other topic queries.

Large scale applications Document routing Customer service Profiled newsfeeds Spam/porn filtering

Typical example Dow Jones –Over 100,000 standing profiles –A profile can have >100 atomic terms –Common sub-expressions shared by different topics –Optimizing this sharing is a hard problem.

Example of sharing Costa Rica Eco-Tourism Hotels Lake Tahoe area Hotels Costa Rica Eco-Tourism Hotel room suite Tahoe Donner Boreal AND OR AND OR Hotel room suite

Example of sharing Costa Rica Eco-Tourism Hotels Lake Tahoe area Hotels Costa Rica Eco-Tourism Hotel room suite Tahoe Donner Boreal AND OR AND

Measuring classification Figures of merit include: –Accuracy of classification (more below) –Speed of classification (docs/hour) –Effort in training system (human hours/topic)

Factors affecting measures Documents –size, length –quality/style of authorship uniformity of vocabulary Accuracy measurement –need definitive judgement on which topic(s) a doc belongs to usually human

Accuracy measurement Confusion matrix 53 Topic assigned by classifier Actual Topic This (i, j) entry means 53 of the docs actually in topic i were put in topic j by the classifier.

Confusion matrix Function of classifier, topics and test docs. For a perfect classifier, all off-diagonal entries should be zero.

Confusion measures Fraction of docs in topic i classified correctly: Fraction of docs assigned topic i that are actually about topic i: Fraction of docs classified correctly:

Classification by exemplary docs Feed system exemplary docs on topic (training) Positive as well as negative examples System builds its model of topic Subsequent test docs evaluated against model –decides whether test is a member of the topic

More generally, set of topics Exemplary docs for each Build model for each topic –differential models Given test doc, decide which topic(s) it belongs to

Recall doc as vector Each doc j is a vector, one component for each term. Normalize to unit length. Have a vector space –terms are axes –n docs live in this space –even with stemming, may have dimensions

Classification using vector spaces Each training doc a point (vector) labeled by its topic Hypothesis: docs of the same topic form a contiguous region of space Define surfaces to delineate topics in space

Topics in a vector space Government Science Arts

Given a test doc Figure out which region it lies in Assign corresponding topic

Test doc = Government Government Science Arts

Issues How do we define (and find) the separating surfaces? How do we compose separating surfaces into regions? How do we test which region a test doc is in?

Composing surfaces: issues ? ? ?

Separation by hyperplanes Assume linear separability for now: –in 2 dimensions, can separate by a line –in higher dimensions, need hyperplanes. Can find separating hyperplane by linear programming: –separator can be expressed as ax + by = c;

Linear programming Find a,b,c, such that ax + by  c for red points ax + by  c for green points.

Which hyperplane? In general, lots of possible solutions for a,b,c.

Support Vector Machine (SVM) Support vectors Maximize margin Quadratic programming problem The decision function is fully specified by training samples which lie on two parallel hyper-planes

Building an SVM classifier Now we know how to build a separator for two linearly separable topics What about topics whose exemplary docs are not linearly separable? What about >2 topics?

Not linearly separable Find a line that penalizes points on “the wrong side”.

Exercise Suppose you have n points in d dimensions, labeled red or green. How big need n be (as a function of d) in order to create an example with the red and green points not linearly separable? E.g., for d=2, n  4.

Penalizing bad points Define distance for each point with respect to separator ax + by = c: (ax + by) - c for red points c - (ax + by) for green points. Negative for bad points.

Solve quadratic program Solution gives “separator” between two topics: choice of a,b. Given a new point (x,y), can score its proximity to each class: –evaluate ax+by. –Set confidence threshold

Category: Interest Example SVM features w i t i w i t i 0.70 prime 0.67 rate 0.63 interest 0.60 rates 0.46 discount 0.43 bundesbank 0.43 baker dlrs world sees year group dlr january

Separating multiple topics Build a separator between each topic and its complementary set (docs from all other topics). Given test doc, evaluate it for membership in each topic. Declare membership in topics above threshold.

Negative examples Formulate as above, except negative examples for a topic are added to its complementary set. Positive examples Negative examples

Recall explicit queries Can be viewed as defining a region in vector space. No longer linear separators.

Simple example Query = Costa Rica AND hotel Costa Rica hotel Topic region

Challenge Combining rule-based and machine learning based classifiers. –Nonlinear decision surfaces vs. linear. –User interface and expressibility issues.

UI issues Can specify rule-based query in the interface. Can exemplify docs. What is the representation of the combination?

Classification - closing remarks Can also use Bayesian nets to formulate classification –Compute probability doc belongs to a class, conditioned on its contents Many fancy schemes exist for term weighting in vectors, beyond simple tf  idf.

Resources R.M. Tong, L.A. Appelbaum, V.N. Askman, J.F. Cunningham. Conceptual Information Retrieval using RUBRIC. Proc. ACM SIGIR , (1987). S. T. Dumais, Using SVMs for text categorization, IEEE Intelligent Systems, 13(4), Jul/Aug 1998.