Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Text classification based on multi-word with support vector machine Presenter : Shao-Wei Cheng Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang KBS 2008
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Methodology Experiments Conclusion Comments
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation 3 Bag of words (BOW) in vector space model is used to represent the text using individual words obtained from the given text data set. But text representation using individual words is not interpretable and comprehensible. And how does the degree of relevance of a multi-word to a document measure? Last December LastDecember U.S. agriculture department agriculture department U.S. agriculture
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objectives Proposed two strategies to represent the documents using the extracted multi-words. Investigating the effectiveness of using multi-words for text representation on the performances of text classification. Linear kernel Non-linear kernel Multi-word extraction Experiment IG for feature selection SVM for classification Text representation Decomposition strategy Combination strategy
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Methodology Multi-word extraction Repetition pattern extraction The U.S. agriculture department last December slashed its 12 month of 1987 sugar import quota from the Philippines to 143,780 short tons from 231,660 short tons in U.S. agriculture department (NNN) U.S. agriculture (NN) agriculture department (NN) last December (AN) sugar import quota (NNN) short tons (AN)
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Methodology Text representation Decomposition strategy Using the document frequencies Combination strategy U.S. agriculture department agriculture department U.S. agriculture U.S. agriculture department agriculture department U.S. agriculture Mickey is a mouse whose name is Mickey occurrence ratio (OR) = 1 minimum scope (MS ) = 4 Multi-word : Mickey mouse
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experiments 7 Data from Reuters IG for feature selection SVM for classification M = multi-word C = Combination strategy D = Decomposition strategy L = Linear kernel N = Non-Linear kernel
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusion 8 Firstly, it has lower dimension than individual words but its performance is acceptable. Secondly, multi-word is easy to acquire from documents by corpus learning without any support of thesaurus, dictionary or ontology. Thirdly, multi-word includes more semantics and is a larger meaningful unit than individual word.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Comments Advantage The content of this article and the proposed method are clear. Drawback … Application Text classification