Text Classification using SVM- light DSSI 2008 Jing Jiang
Text Classification Goal: to classify documents (news articles, s, Web pages, etc.) into predefined categories Examples –To classify news articles into “business” and “sports” –To classify Web pages into personal home pages and others –To classify product reviews into positive reviews and negative reviews Approach: supervised machine learning –For each pre-defined category, we need a set of training documents known to belong to the category. –From the training documents, we train a classifier.
Overview Step 1—text pre-processing – to pre-process text and represent each document as a feature vector Step 2—training –to train a classifier using a classification tool (e.g. SNoW, SVM-light) Step 3—classification –to apply the classifier to new documents
Pre-processing: tokenization Goal: to separate text into individual words Example: “We’re attending a tutorial now.” we ’re attending a tutorial now Tool: –Word Splitter
Pre-processing: stop word removal (optional) Goal: to remove common words that are usually not useful for text classification Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc. Stop word list: – utils/stop_wordshttp:// utils/stop_words
Pre-processing: stemming (optional) Goal: to normalize words derived from the same root Examples: –attending attend –teacher teach Tool: –Porter stemmer
Pre-processing: feature extraction Unigram features: to use each word as a feature –To use TF (term frequency) as feature value –To use TF*IDF (inverse document frequency) as feature value –IDF = log (total-number-of-documents / number-of- documents-containing-t) Bigram features: to use two consecutive words as a feature Tool: –Write your own program/script –Lemur API
Index *ind = IndexManager::openIndex("index-file.key"); int d1; TermInfoList *tList = ind->termInfoList(d1); tList->startIteration(); while (tList->hasMore()) { TermInfo * entry = tList->nextEntry(); cout termID() << endl; cout termCount() << endl; } delete dList; delete ind; Using Lemur to Extract Unigram Features
SVM (Support Vector Machines) A learning algorithm for classification –General for any classification problem (text classification as one example) Binary classification Maximizes the margin between the two different classes
picture from rial.pdf rial.pdf
SVM-light SVM-light: a command line C program that implements the SVM learning algorithm Classification, regression, ranking Download at Documentation on the same page Two programs –svm_learn for training –svm_classify for classification
SVM-light Examples Input format 1 1:0.5 3:1 5: :0.9 3:0.1 4:2 To train a classifier from train.data –svm_learn train.data train.model To classify new documents in test.data –svm_classify test.data train.model test.result Output format –Positive score positive class –Negative score negative class –Absolute value of the score indicates confidence Command line options –-c a tradeoff parameter (use cross validation to tune)
More on SVM-light Kernel –Use the “-t” option –Polynomial kernel –User-defined kernel Semi-supervised learning (transductive SVM) –Use “0” as the label for unlabeled examples –Very slow