Download presentation
Presentation is loading. Please wait.
Published byPriscilla Lucas Modified over 9 years ago
1
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003
2
Outline Lexical semantics are not sufficiently robust Using pLSA for automatically concepts extraction Using AdaBoost to combine weak hypotheses Experimental results confirm the validity.
3
Text Categorization (1) Most recently excellent results: SVMs and AdaBoost Document representation: term frequencies (bag-of-words), tfidf concept-based Not using general-purpose thesauri but auto extract domain-specific concepts
4
Text Categorization (2) Extract concepts using an unsupervised learning stage, and used as additional features for supervised learning. Document used need not to be labeled. Labeled documents (with smaller size) for supervised learning.
5
Text Categorization (3) 3 Steps: –Stage 1: using pLSA to auto extract concepts –Stage 2: weak classifiers or hypotheses are defined based on single terms and extract concepts –Stage 3: term-based and semantic weak hypotheses are combined using AdaBoost
6
Using pLSA D={d1,d2,d3…..,dM} W={w1,w2,w3….,wN} Z={z1,z2,z3…..,zK} The distribution P(wj|zk) for a fixed zk is a representation for concept zk Concept membership P(zk|di) Use EM algorithm for model fitting
7
Diagrammatic representation of pLSA
8
ADABOOST (1) For every category, we have: S (doc,score) = {(x1,y1),(x2,y2),….,(xM,yM)} score = {-1,1} Two types of experiments: –AdaBoost.MH – minimize error –AdaBoost.MR – f(x+) <= f(x-) is minimized
9
Using semantic features in ADABOOST P(zk|wj) : can identify synonyms and polysemies in some extent Document representation: by semantic features P(zk|di) or word features n(di,wj) Indicator function for word features Threshold Hypotheses (continuous value)
10
Experiments Reuters-21578 dataset (News stories collection) and Medline (OHSUMED collection for TREC 9) Precision, Recall, F1, Error (classification error/false-alarm), Micro Ave, Macro Ave, Ranking function: Maximal F1, BEP, Adjusted F1
11
Results (Reuters) The relative gains for the macro-averaged metrics are higher, which seems to indicate that semantic features are especially useful for categories with a small number of positive examples.
12
Results (2)
13
Results (Medline)
14
Results (2) Initial runs, term-based features are chosen more often, while semantic features dominate in later founds.
15
Conclusion Using 3 stages in the overall approaches, address shortcomings of using only term-based representations. Two standard document collections support the validity. Future work: investigate the utilization of additional unlabeled data to improve the concept extraction stage as well as linguistic resources.
16
Appendix:ADABOOST http://kiew.cs.uni- dortmund.de:8001/mlnet/instances/81d91e8d-dc15ed23e9http://kiew.cs.uni- dortmund.de:8001/mlnet/instances/81d91e8d-dc15ed23e9 AdaBoost is a boosting algorithm, running a given weak learner several times on slightly altered training data, and combining the hypotheses to one final hypothesis, in order to achieve higher accuracy than the weak learner's hypothesis would have. The main idea of AdaBoost is to assign each example of the given training set a weight. At the beginning all weights are equal, but in every round the weak learner returns a hypothesis, and the weights of all examples classified wrong by that hypothesis are increased. That way the weak learner is forced to focus on the difficult examples of the training set. The final hypothesis is a combination of the hypotheses of all rounds, namely a weighted majority vote, where hypotheses with lower classification error have higher weight.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.