Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li
Introduction Feature Generation with Wikipedia ◦ Wikipedia as a knowledge Repository ◦ Feature Construction ◦ Feature generator design ◦ Using the link structure Empirical Evaluation ◦ Implementation Details ◦ Experimental Methodology ◦ The effect of feature generation ◦ Classifying short documents Conclusions and Future Work
Text categorization ◦ Deals with automatic assignment of category labels to natural language documents ◦ Represent document as bags of words ◦ Features from words ◦ Categorization based on features ◦ Limitation of BOW: by individual word occurrences in the training set Wal-Mart supply chain goes real time Wal-Mart manages its stock with RFID technology Effective in medium difficulty categorization, but bad in small categories or short documents Using encyclopedia to endow the machine document with the broader of knowledge available to humans
Auxiliary text classifier: ◦ matching documents with the most relevant articles of wikipedia ◦ conventional bag of words + new features Examples for idea of auxiliary text classifier: ◦ “ Bernanke takes charge ” ◦ BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, … Using wikipedia ◦ Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document ◦ Leverage the knowledge gained from these articles
Extend the representation of documents for text categorization with knowledge concepts relevant to the document text. Wikipedia ◦ Largest knowledge repository ◦ Large-scale hierarchies ◦ Qualify, stander written English ◦ …
Receive a text fragment, and map to most relevant wikipedia articles ◦ E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge ◦ ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS Training documents -> features -> wikipedia concepts -> augment the bag of word
Unit for feature generation? ◦ Word, sentence, paragraph, document? Multi-resolution approach ◦ Features are generated for Individual words Sentences Paragraphs Entire document ◦ Polysemous words is mapped to the concepts that correspond to the sense shared by the context words
“jaguar car models”, the Wikipedia-based feature generator returns: ◦ JAGUAR (CAR), ◦ DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar), ◦ V12 (Jaguar’s engine), ◦ JAGUAR E-TYPE ◦ JAGUAR XJ. “jaguar Panthera onca”, ◦ JAGUAR, ◦ FELIDAE (feline species family), related felines such as LEOPARD, ◦ PUMA and BLACK PANTHER, as well as KINKAJOU
A set of simple heuristics for pruning the sets of concepts (wikipedia): ◦ Discarding: with <100 non stop words <5 incoming and outgoing links (too short) disambiguation pages ◦ Each concept is an attribute vector assigned weights using a TF.IDF
Link—anchor text: ◦ Identical to the canonical name of the target article ◦ Different anchor text refer to the same article: alternative names, variant spellings, and related phrases ◦ Incoming links: significance of an article ◦ Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material ◦ Pursue this direction in future work
Wikipedia snapshot: November 5, 2005 1.8Gb text in 910,989 articles, ◦ removing small and overly specific concepts -- remaining 171,332 articles ◦ Removing stop words and rare words ◦ Stemmed ◦ 296,157 distinct terms presenting concepts
1 Reuter 2 Reuters Corpus Volume I (RCV1) 3 OHSUMED 4 20 Newsgroups(20NG) 5 Movie Reviews (Movies) Method: SVM with a linear kernel Metrics: ◦ precision-recall break-even point (BEP) ◦ Reuter and OHSUMED: micro- and macro-average BEP ◦ 20 NG and Movies: 4-fold cross-validation
Improve more More effective in small categories
Only use title of the articles to do classification
Feature generator: ◦ identify the most relevant encyclopedia articles ◦ Creating new features Add semantics to conventional BOW ◦ Latent semantic indexing ◦ LSI + SVM: not good ◦ Wikipedia +svm: improve Information retrieval