Presentation is loading. Please wait.

Presentation is loading. Please wait.

Text Classification using SVM- light DSSI 2008 Jing Jiang.

Similar presentations


Presentation on theme: "Text Classification using SVM- light DSSI 2008 Jing Jiang."— Presentation transcript:

1 Text Classification using SVM- light DSSI 2008 Jing Jiang

2 Text Classification Goal: to classify documents (news articles, emails, Web pages, etc.) into predefined categories Examples –To classify news articles into “business” and “sports” –To classify Web pages into personal home pages and others –To classify product reviews into positive reviews and negative reviews Approach: supervised machine learning –For each pre-defined category, we need a set of training documents known to belong to the category. –From the training documents, we train a classifier.

3 Overview Step 1—text pre-processing – to pre-process text and represent each document as a feature vector Step 2—training –to train a classifier using a classification tool (e.g. SNoW, SVM-light) Step 3—classification –to apply the classifier to new documents

4 Pre-processing: tokenization Goal: to separate text into individual words Example: “We’re attending a tutorial now.”  we ’re attending a tutorial now Tool: –Word Splitter http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS http://l2r.cs.uiuc.edu/~cogcomp/atool.php?tkey=WS

5 Pre-processing: stop word removal (optional) Goal: to remove common words that are usually not useful for text classification Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. Stop word list: –http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_ utils/stop_wordshttp://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_ utils/stop_words

6 Pre-processing: stemming (optional) Goal: to normalize words derived from the same root Examples: –attending  attend –teacher  teach Tool: –Porter stemmer http://tartarus.org/~martin/PorterStemmer/ http://tartarus.org/~martin/PorterStemmer/

7 Pre-processing: feature extraction Unigram features: to use each word as a feature –To use TF (term frequency) as feature value –To use TF*IDF (inverse document frequency) as feature value –IDF = log (total-number-of-documents / number-of- documents-containing-t) Bigram features: to use two consecutive words as a feature Tool: –Write your own program/script –Lemur API

8 Index *ind = IndexManager::openIndex("index-file.key"); int d1; TermInfoList *tList = ind->termInfoList(d1); tList->startIteration(); while (tList->hasMore()) { TermInfo * entry = tList->nextEntry(); cout termID() << endl; cout termCount() << endl; } delete dList; delete ind; Using Lemur to Extract Unigram Features

9 SVM (Support Vector Machines) A learning algorithm for classification –General for any classification problem (text classification as one example) Binary classification Maximizes the margin between the two different classes

10 picture from http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tuto rial.pdf http://www1.cs.columbia.edu/~kathy/cs4701/documents/jason_svm_tuto rial.pdf

11 SVM-light SVM-light: a command line C program that implements the SVM learning algorithm Classification, regression, ranking Download at http://svmlight.joachims.org/ http://svmlight.joachims.org/ Documentation on the same page Two programs –svm_learn for training –svm_classify for classification

12 SVM-light Examples Input format 1 1:0.5 3:1 5:0.4 -1 2:0.9 3:0.1 4:2 To train a classifier from train.data –svm_learn train.data train.model To classify new documents in test.data –svm_classify test.data train.model test.result Output format –Positive score  positive class –Negative score  negative class –Absolute value of the score indicates confidence Command line options –-c a tradeoff parameter (use cross validation to tune)

13 More on SVM-light Kernel –Use the “-t” option –Polynomial kernel –User-defined kernel Semi-supervised learning (transductive SVM) –Use “0” as the label for unlabeled examples –Very slow


Download ppt "Text Classification using SVM- light DSSI 2008 Jing Jiang."

Similar presentations


Ads by Google