CATEGORIZATION OF NEWS ARTICLES USING NEURAL TEXT CATEGORIZER FUZZ-IEEE 2009, Korea, August 20-24, 2009 Taeho Jo Inha University Reporter:洪紹祥 Adviser:鄭淑真
OUTLINE Introduction Previous Works Framework Empirical Results Conclusions
INTRODUCTION(1/2) Text categorization is necessary for managing textual text as efficiently as possible. Text categorization is requires the two manual preliminary: The predefined of categorization The preparation of sample labeled documents.
INTRODUCTION(2/2) Traditional Text Categorization requires encoding documents into numerical vector Cause two main problems: Huge dimensionality. Sparse distribution. Solve the two main problem Encoded into String Vector Different from numerical Vector, words are given as feature value. Propose a neural network, called NTC(Neural Text Categorizer)
PREVIOUS WORKS(1/2) Popular approaches for Text categorization KNN(K Nearest Neighbor) NB(Naïve Bayes) SVM(Support Vector Machine) Neural Networks Causes two Main problems Huge dimensionality. Sparse distribution. Using String kernel in SVM Failed to improve the performance.
PREVIOUS WORKS(2/2) String kernel Receives two raw texts as inputs and computes their syntactical similarity between them Advantage Don’t need to be encoded into numerical vector. More transparent than numerical vector . Easier to trace why each document is classified. Disadvantages Cost too much time for computing the similarity.
FRAMEWORK(1/2) Bag of Words
FRAMEWORK(2/2)
EMPIRICAL RESULTS(1/3) The collection of news articles, called NewsPage. News articles 500 dimensional numerical vectors. 50 dimensional string vectors.
EMPIRICAL RESULTS(2/3) The configuration of participating approaches
EMPIRICAL RESULTS(3/3) The Results of This Set of Experiments
CONCLUSIONS The four contributions are considered as the significance of this research. According to the results of the set of experiments, this research proposes the practical approach. It solved the two main problems, the huge dimensionality the sparse distribution Created a new neural network, called NTC. Provides the potential easiness for tracing why each document is classified so.