Download presentation
Presentation is loading. Please wait.
Published byJordan Curtis Modified over 9 years ago
1
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15
2
Introduction Document preprocessing Scoring measures for feature selection Classification, performance evaluation, and corpora description Experiments Reuters Ohsumed Comparing the results Conclusion
3
Machine Learning (ML) automatically builds a classifier for a certain category by observing the characteristics of a set of documents that have been classified manually under this category. The high dimensionality of TC problems makes most ML-based classification algorithms infeasible. Many features could be irrelevant or noisy. Small percentage of the words are really meaningful. Feature selection is performed to reduce the number of features and avoid overfitting.
4
Before performing FS must transform documents to obtain a representation suitable for computational use. Additionally, we perform two kinds of feature reduction. The first removes the stop words (extremely common words such as “the,” “and,” and “to”) Aren’t useful for classification. The second is stemming Maps words with the same meaning to one morphological form by removing suffixes.
5
Information retrieval measures determine word relevance Information theory measures These measures consider a word’s distribution over the different categories. Information gain(IG) Takes into account the word’s presence or absence in a category.
6
ML measures To define our measures, we associate to each pair (w, c) this rule w → c : If the word w appears in a document, then that document belongs to category c. Then, we use measures that have been applied to quantify the quality of the rules induced by an ML algorithm. In this way, we reduce the quantification of the importance of a word w in a category c to the quantification of the quality of w → c.
7
Laplace measure (L) Modifies the percentage of success Takes into account the documents in which the word appears difference (D) Establishes a balance between the documents of category c and the documents in other categories that also contain w
8
impurity level(IL) Take into account the number of documents in the category in which the word occurs and the distribution of the documents over the categories
9
For a classifier, we chose support vector machines because Shown better results than other traditional text classifiers. Perform better because they handle examples with many features well and they deal well with sparse vectors. Are binary classifiers that can determine linear or nonlinear threshold functions to separate the examples of documents in one category from those in other categories. Disadvantages They handle missing values poorly Multiclass classification doesn’t perform well
10
Evaluating performance Precision (P) quantifies the percentage of documents that are correctly classified as positives (they belong to the category). Recall (R) quantifies the percentage of correctly classified positive documents. F 1 Gives the same relevance to both precision and recall
11
The corpora We used the Reuters-21578 and the Ohsumed corpora. Reuters-21578 is a set of economic news documents that Reuters published in 1987. Ohsumed is a clinically oriented MEDLINE subset consisting of 348,566 references of 270 medical journals published between 1987 and 1991.
15
The results show that our proposed measures are more dependent on some statistical properties of the corpora, particularly the distribution of the words throughout the categories and of the documents over the categories. However, ML measures exploit that dependence by finding at least a simple measure that performs better than IG and TF-IDF.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.