Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models Author: DouShen, Jian-Tao Sun, Qiang Yang, and Zheng Chen Reporter: Wen-Cheng Tsai 2007/07/06 CIKM,2006

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline  Motivation  Objective  Method ─ N-gram models ─ N-Multigram models ─ N-gram models + N-Multigram models  Experience  Conclusion  Personal Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  In the past, much work has been conducted to find better ways to represent documents. However, most of the attempts rely on certain extra resources such as WordNet, or the face the problem of extremely high dimmension.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective  We propose a new document representation approach based on n-multigram language models.  this approach can automatically discover the hidden semantic sequences in the documents under each category.  We put forward two text classification algorithms based on n- multigram language models and n-gram language models.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Method  N-gram models : It assumes that the probability of one word in a document depends on its previous n-1 words. Given a word sequence W=w 1 w 2, …,w T, the probability of W can be calculated as follows :  P(w i |w i - n + 1 … w i - 1 ) can be estimated from a corpus with Maximum Likelihood criteria. That is :

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Method  In real world applications, P(w i |w i - n + 1 … w i - 1 ) is often under- estimated due to the data sparseness in a training data set.  When they do not appear in the training data, many grams are assigned zero probability.  In back-off models, the smoothing function is as follows :

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Method N-multigram models Likelihood calculation :

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Method  N-gram models based classifier C NG : good-turing smoothing C NW : witten-bell smoothing  N-multigram models based classifier  Classifier based on the combination of n-gram and n- multigram models

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Experience  In order to speed up our experiments, we select 10% documents from the top 10 categories which results in 47042 documents.  3-fold cross validation procedure is applied in our experiments.

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Experience

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Experience

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Conclusion  We proposed an approach of document representation based on the automatically extracted sequences generated through n-multigram models.  We conducted a series of experiments on a subset of RCV1. the experiments show that our proposed text classification algorithms work well.

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Personal Comments  Advantages ─ Improvement performance  Disadvantage ─ A lot of parameter  Application ─ Text classification


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models."

Similar presentations


Ads by Google