Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Classification Method with Small Training Data

Similar presentations


Presentation on theme: "Document Classification Method with Small Training Data"— Presentation transcript:

1 Document Classification Method with Small Training Data
Yasunari MAEDA (Kitami Institute of Technology) Hideki YOSHIDA (Kitami Institute of Technology) Toshiyasu Matsushima (Waseda University)

2 Topics Overview of document classification
Document classification using distance Document classification using probabilistic model Preparations The first our previous research The second our previous research Experiments of previous methods Proposed method Experiments Conclusion

3 Overview of document classification
Key words in documents are used. ex. classification of newspaper articles class article economy art1, art3 science art5, art6 sport art2, art4 art2 art1 art3 art4 class key word economy stocks, company, science amino acid, computer, sport baseball, swimming, art5 art6 result of classification articles article A new characteristic of amino acid was found. key word class of articles In many cases, each key word belongs to more than one classes.

4 Document classification using distance
A new article is classified into a class whose distance is minimum. class B class A art A1 art B2 art B1 art A3 new article art B3 art A5 art B4 art A4 art A2 class C art C1 The distance between class A and the new article is minimum. The new article is classified into the class A. art C2 art C3 art C4 art C5 It is very easy to use in real case. There is no theoretical guarantee on accuracy. Vector space model is well known.

5 Document classification using probabilistic model
Key words occur depending on probabilistic distributions. A new article is classified into a class whose error rate is minimum. A new characteristic of amino acid was found. article occurrence of article parameters which dominate probability distributions class “science” occurs key word “amino acid” occurs under the condition that class “science” occurs. article classification = estimation for class our previous research minimizing the error rate with respect to the Bayes criterion accuracy is low with small training data We want to improve accuracy with small training data.

6 Preparations(1/3) : class of documents : set of classes , : key word
: set of key words, : a probability of an event that class occurs : a parameter which dominates , : a true parameter which is unknown, : a probability of an event that key word occurs in a document in class . : a parameter which dominates , : a true parameter which is unknown, : a new document, : a class of new document , ( is unknown) : a string of key words in new document , ( is known) : the number of key words in new document

7 Preparations(2/3) (1) (2) . : training data
: the number of documents in the training data : the class of the th document in the training data, : the number of key words in the th document in the training data : a string of key words in the th document in the training data : the th key word in the th document in the training data a probability of an event that the new document occurs (1) a probability of an event that the training data occurs (2) .

8 Preparations(3/3) document classification problem
estimating the unknown class of the new document under the condition that the string of key words in and the training data are given.

9 The first our previous research(1/2)
The class of the new document is estimated by (3) where, (4) , (5) , , : parameters of Dirichlet distribution for , (prior distributions for the unknown parameters) : the number of documents in the class in the training data : the number of key word in the documents in the class in the training data : the number of key word in the string

10 The first our previous research(2/2)
the first our previous method optimal method which minimizes the error rate with respect to Bayes criterion But, the accuracy is low with small training data. 0.5 is used as parameter of prior distributions in order to represent no information. the second our previous research improve accuracy with small training data Accuracy depends on prior distributions with small training data.

11 The second our previous research(1/2)
We estimate prior distribution using estimating data. estimating data for prior distributions The new documents and the training data occur from the same source. The estimating data occurs from another source. , (6) : the number of documents in the estimating data : the class of the th document : the number of key words in the th document : the string of key words in the th document : the th key word in the th document

12 The second our previous research(2/2)
Parameters in eq(4) and eq(5) are estimated using estimating data. , (7) , (8) where, , : parameters of Dirichlet distribution for ,

13 Experiments of previous methods(1/2)
comparison between the first our previous method and the second first our previous method(prev1) 0.5 is used as each parameter for prior distributions. second our previous method(prev2) Prior distributions are estimated using estimating data. new documents : (Japanese Mainichi News Paper 2007) training data : Japanese Mainichi News Paper 2007 estimating data : (Japanese Mainichi News Paper 1994)

14 Experiments of previous methods(2/2)
prev2 is higher than prev1 with small training data. But prev2 is lower than prev1 with large training data.

15 Proposed method Parameters in eq(4) and eq(5) are estimated as follows: , (9) , (10) where,

16 Experiments (1/2) comparison between the first our previous method and
the new proposed method first our previous method(prev1) 0.5 is used as each parameter for prior distributions. new our proposed method(pro) new documents : (Japanese Mainichi News Paper 2007) training data : Japanese Mainichi News Paper 2007 estimating data : (Japanese Mainichi News Paper 1994)

17 Experiments (2/2) pro is higher than prev1 in all points.

18 Conclusion further works
Accuracy of our new proposed method is higher than the first our previous method with small training data. And the accuracy is equal when the size of training data is big. with small training data use estimating data mainly with large training data use training data mainly further works We want to study a method to choose the parameters , and .


Download ppt "Document Classification Method with Small Training Data"

Similar presentations


Ads by Google