Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15

 Introduction  Document preprocessing  Scoring measures for feature selection  Classification, performance evaluation, and corpora description  Experiments  Reuters  Ohsumed  Comparing the results  Conclusion

 Machine Learning (ML) automatically builds a classifier for a certain category by observing the characteristics of a set of documents that have been classified manually under this category.  The high dimensionality of TC problems makes most ML-based classification algorithms infeasible.  Many features could be irrelevant or noisy.  Small percentage of the words are really meaningful.  Feature selection is performed to reduce the number of features and avoid overfitting.

 Before performing FS  must transform documents to obtain a representation suitable for computational use.  Additionally, we perform two kinds of feature reduction.  The first removes the stop words (extremely common words such as “the,” “and,” and “to”) Aren’t useful for classification.  The second is stemming Maps words with the same meaning to one morphological form by removing suffixes.

 Information retrieval measures  determine word relevance  Information theory measures  These measures consider a word’s distribution over the different categories. Information gain(IG) Takes into account the word’s presence or absence in a category.

 ML measures  To define our measures, we associate to each pair (w, c) this rule w → c : If the word w appears in a document, then that document belongs to category c.  Then, we use measures that have been applied to quantify the quality of the rules induced by an ML algorithm.  In this way, we reduce the quantification of the importance of a word w in a category c to the quantification of the quality of w → c.

 Laplace measure (L) Modifies the percentage of success Takes into account the documents in which the word appears  difference (D) Establishes a balance between the documents of category c and the documents in other categories that also contain w

 impurity level(IL) Take into account the number of documents in the category in which the word occurs and the distribution of the documents over the categories

 For a classifier, we chose support vector machines because  Shown better results than other traditional text classifiers.  Perform better because they handle examples with many features well and they deal well with sparse vectors.  Are binary classifiers that can determine linear or nonlinear threshold functions to separate the examples of documents in one category from those in other categories. Disadvantages They handle missing values poorly Multiclass classification doesn’t perform well

 Evaluating performance  Precision (P) quantifies the percentage of documents that are correctly classified as positives (they belong to the category).  Recall (R) quantifies the percentage of correctly classified positive documents.  F 1 Gives the same relevance to both precision and recall

 The corpora  We used the Reuters-21578 and the Ohsumed corpora.  Reuters-21578 is a set of economic news documents that Reuters published in 1987.  Ohsumed is a clinically oriented MEDLINE subset consisting of 348,566 references of 270 medical journals published between 1987 and 1991.

 The results show that our proposed measures are more dependent on some statistical properties of the corpora, particularly the distribution of the words throughout the categories and of the documents over the categories.  However, ML measures exploit that dependence by finding at least a simple measure that performs better than IG and TF-IDF.

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Similar presentations

Presentation on theme: "Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.

Similar presentations

Presentation on theme: "Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15."— Presentation transcript:

Similar presentations

About project

Feedback