Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005

Introduction ● Smart RSS aggregator ● Predicts how interesting a user finds an unread article ● Presents news articles depending on the prediction

Issues ● Extremely high dimensional data ● Lots of unlabeled data ● Few training examples ● Only clickthrough information ● Multiuser environment

Support Vector Machine ● Support Vector Machine ● Max-margin for generalization ● Linear but easily extended to non-linear classification

Max-margin separator + + + + + + + + + + + + +

SVM ● The problem of finding the optimal w can be reduced to the following QP

Transductive SVM (TSVM) ● Semi-supervised learning VS supervised learning. ● TSVM is well suited for problem where: – There are few labeled data available – There are lots of unlabeled data. ● Information lying in the unlabeled data is captured and modifies the decision surface.

TSVM VS SVM

TSVM optimization problem ● New optimized variable set : y i * ● New set of slack variables ● New user-specified variable : C * ● Very difficult optimization problem: – Intractable when the number of unlabeled data is greater than 10 – Approximative solution proposed by Johachims.

Text Classification ● Joachims T. Transductive “Inference for Text Classification using SVM” ● Characteristics of the Text Classification problem ● Why are SVM and TSVM well suited for this kind of problem? ● Feature selection for text classification using SVM

Characteristics of the Text Classification problem ● High dimensional input space – One dimension for each word in the vocabulary (10 000 words) ● Sparse input vector – In one text, a tiny proportion of the full vocabulary is used

Why (T)SVM? ● SVM has been shown to perform well in these conditions and can outperform other classifiers. ● Transductive SVM, exploiting information in test data, can outperform SVM when few training samples but lots of test data are available.

Feature selection for Text Classification using SVM ● Feature selection is the main problem in many machine learning applications. ● A poor feature selection leads to poor accuracy.

Feature selection (cont) ● For the text classification problem: – The number of dimensions of the document vector is the number of words in the vocabulary. (Huge number of dimensions!) – Each component of the document vector is the count of the number of word in the document.

Feature selection (cont) ● Refinement of the feature selection: – Johachims add to this document vector the Inverse Document Frequency of each relevant word in the document. – The IDF can be computed using the Document Frequency DF(w) ● IDF(w) = log(n/DF(w)) ● Where n is the total number of document

Feature selection (cont) ● Other refinements : – Stopword elimination – Word stemmer

Feature selection (cont) ● Ex : “the text classification task is characterized by a special set of characteristics. The text classification problem....” ● Transformation of the above text into a feature vector

Feature selection (example) text 2 classification2 task 1 charact2 ● The document vector is very sparse ● The words characteristics and characterized have the same stemmer charact

Smart stuff ● Wordnet ● Combinations of words ● Putting users into clusters ● Using additional features (links, dates, author, source etc.) ● Active learning

Conclusion ● TSVM is well suited for text classification problems ● Feature selection is crucial ● To boost accuracy to a reasonable level, we have to combine techniques.

References ● Simon Haykin, Neural Networks, Second Edition, Pearson Education, chapter 6 1999 ● Joachims Thorsten, Transductive Inference for Text Classification using SVM, Proceedings of ICML-99, 16th International Conference on Machine Learning, 1999

References (cont) ● Tom M. Mitchell, Machine Learning, chapter 6 Mc Graw-Hill international editions, 1997 ● K. Nigam, A. K. Mccallum, S. ThMachine Learningrun, T. Mitchell, Text Classification from Labeled and Unlabeled Documents using EM, Kluwer Academic Publishers, Boston, 1999

Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Similar presentations

Presentation on theme: "Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005.

Similar presentations

Presentation on theme: "Smart RSS Aggregator A text classification problem Alban Scholer & Markus Kirsten 2005."— Presentation transcript:

Similar presentations

About project

Feedback