Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG LAI AND CHUNG-HSIEN WU Meaningful term extraction and discriminative term selection in text categorization via unknown-word methodology ACM Transactions on Asian Language Information Processing, 2002, Pages 34-64
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction System Overview Term Extraction and Selection Discriminative Term Selection Indexing And Classification Experimental Result Conclusions Personal Opinion
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation In text categorization, terms are extracted from documents and used for estimating the textual similarity between documents. The extracted terms often determine system performance. N-grams are typically employed for textual indexing. Need comparatively higher storage space. N-gram is not a meaningful unit in linguistics Inconsistencies problem. Unknown words presented are more domain-specific than traditional words. Domain dependency
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective Propose a method for extracting meaningful and highly domain-specific unknown words form Chinese text documents.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Two main methods for detecting unknown words Statistical Some of which are restricted to particular type Rule-based Using dictionary Need part-of-speech information Limited length unknown word
Intelligent Database Systems Lab N.Y.U.S.T. I. M. System Overview T1 新聞 T2 體育 n=1~8 document j
Intelligent Database Systems Lab N.Y.U.S.T. I. M. System Overview
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Term Extraction and Selection Phrase-like Unit (PLU) A frequently occurring word sequence P, if a word w i in the sequence P and the preceding word w 1 w 2 …w i is always followed by the word sequence w i+1 w i+2 … P is probably an unknown word or phrase For example, 陳水扁 PLU-base likelihood ratio PLR(p) 陳水扁 250 陳 1000 水扁 200
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Term Extraction and Selection A word sequence p is considered an unknown word if n>1 tf (p)>=c PLR(p) >= 1-εor PLR(p)*tf(p) >= d
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Further Purification Some PLUs are useless or interfering Discard stopping terms Deal with cross-included terms Reliability degree
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discriminative Term Selection Here the term “discriminative” indicates the utility in distinguishing categories. A term, 陳水扁, is used for distinguish 政治, 體育 classes. For a term t representing category g discriminability W(t, g) can be defined as
Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Index machine Using for locating keywords in a text. M = (S, I, g, f, s 0, O) For example, “ 半自動套裝遊程 ”
Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION For improving performance The vector space model (VSM) is used. The document is represented as a vector The member of vector is a weighted indexing feature Term weighting for training documents K categories, N k documents in k category D k, j is the jth document in kth category
Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Term weighting for training documents S(w) is a smooth 0-1 function for avoiding bias problem α is a constant
Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Term weighting for unclassified documents not know the category of an unclassified document, each unclassified document should be represented as multiple description vectors. unclassified document is represented as K vectors X k, k=1…K
Intelligent Database Systems Lab N.Y.U.S.T. I. M. INDEXING AND CLASSIFICATION Classification Function Combine the vectors of each category into a mean vector Classification function f Gk (X; A) is
Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT CORPUS Min-Sheng Daily News (MSDN) 44,675 text documents, consisting of over 35 million words 1997 to April 1997 was for training, and 1999 to July 1999 was for testing. Performance Evaluation
Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT
Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Baseline performance Using the words defined in dictionary
Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Parameter Testing The number of representative terms is variable Constrain the number of terms selected from each category or not Examine discriminablility (Nor) effect on performance
Intelligent Database Systems Lab N.Y.U.S.T. I. M. EXPERIMENTAL RESULT Parameter Testing The number of representative terms is variable Constrain the number of terms selected from each category or not Examine discriminablility (Nor) effect on performance
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Experimental Results on Purification Process
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Combined Approach-unknown word-based
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Comparative Performance
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Experimental Result Consistency between Training and Testing Data
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Conclusions we have proposed two new concepts, meaningful term extraction and discriminative term selection. PLUs improve the performance of text Purification process reduces the dimensionality of the feature space.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion Advantages ─ Take into account meaningful and discriminative terms. ─ Purification process save time ─ Terms can be extracted automatically and systematically Application ─ ICD9 codes classifications and so on. ─ May solve the problem that Patient records with Chinese and English Limited ─ Sparse data problem need to solve.