Download presentation
Presentation is loading. Please wait.
Published byLydia Horn Modified over 8 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models Author: DouShen, Jian-Tao Sun, Qiang Yang, and Zheng Chen Reporter: Wen-Cheng Tsai 2007/07/06 CIKM,2006
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Method ─ N-gram models ─ N-Multigram models ─ N-gram models + N-Multigram models Experience Conclusion Personal Comments
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation In the past, much work has been conducted to find better ways to represent documents. However, most of the attempts rely on certain extra resources such as WordNet, or the face the problem of extremely high dimmension.
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective We propose a new document representation approach based on n-multigram language models. this approach can automatically discover the hidden semantic sequences in the documents under each category. We put forward two text classification algorithms based on n- multigram language models and n-gram language models.
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Method N-gram models : It assumes that the probability of one word in a document depends on its previous n-1 words. Given a word sequence W=w 1 w 2, …,w T, the probability of W can be calculated as follows : P(w i |w i - n + 1 … w i - 1 ) can be estimated from a corpus with Maximum Likelihood criteria. That is :
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Method In real world applications, P(w i |w i - n + 1 … w i - 1 ) is often under- estimated due to the data sparseness in a training data set. When they do not appear in the training data, many grams are assigned zero probability. In back-off models, the smoothing function is as follows :
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Method N-multigram models Likelihood calculation :
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Method N-gram models based classifier C NG : good-turing smoothing C NW : witten-bell smoothing N-multigram models based classifier Classifier based on the combination of n-gram and n- multigram models
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experience In order to speed up our experiments, we select 10% documents from the top 10 categories which results in 47042 documents. 3-fold cross validation procedure is applied in our experiments.
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Experience
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experience
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Conclusion We proposed an approach of document representation based on the automatically extracted sequences generated through n-multigram models. We conducted a series of experiments on a subset of RCV1. the experiments show that our proposed text classification algorithms work well.
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Personal Comments Advantages ─ Improvement performance Disadvantage ─ A lot of parameter Application ─ Text classification
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.