Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

On-line learning and Boosting
AdaBoost Reference Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal.
Evaluation of Decision Forests on Text Categorization
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Harun Ug˘uz 2011.KBS A two-stage feature selection method for text categorization by.
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.
1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.
Rapid Object Detection using a Boosted Cascade of Simple Features Paul Viola, Michael Jones Conference on Computer Vision and Pattern Recognition 2001.
FilterBoost: Regression and Classification on Large Datasets Joseph K. Bradley 1 and Robert E. Schapire 2 1 Carnegie Mellon University 2 Princeton University.
TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Review of : Yoav Freund, and Robert E
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Jierui Xie, Boleslaw Szymanski, Mohammed J. Zaki Department of Computer Science Rensselaer Polytechnic Institute Troy, NY 12180, USA {xiej2, szymansk,
Ensemble Learning: An Introduction
Using IR techniques to improve Automated Text Classification
Semantic (Language) Models: Robustness, Structure & Beyond Thomas Hofmann Department of Computer Science Brown University Chief Scientist.
Predicting Unix Commands With Decision Tables and Decision Trees Kathleen Durant Third International Conference on Data Mining Methods and Databases September.
Machine Learning: Ensemble Methods
Distributed Representations of Sentences and Documents
Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Machine Learning CS 165B Spring 2012
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Text Classification, Active/Interactive learning.
CS 391L: Machine Learning: Ensembles
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Benk Erika Kelemen Zsolt
Boosting of classifiers Ata Kaban. Motivation & beginnings Suppose we have a learning algorithm that is guaranteed with high probability to be slightly.
AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.
Web Taxonomy Integration through Co-Bootstrapping Dell Zhang National University of Singapore Wee Sun Lee National University of Singapore SIGIR’04.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.
Learning with AdaBoost
National Taiwan University, Taiwan
E NSEMBLE L EARNING : A DA B OOST Jianping Fan Dept of Computer Science UNC-Charlotte.
Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
1 January 24, 2016Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques — Chapter 7 — Classification Ensemble Learning.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
TEXT CLASSIFICATION AND CLASSIFIERS: A SURVEY & ROCCHIO CLASSIFICATION Kezban Demirtas
AdaBoost Algorithm and its Application on Object Detection Fayin Li.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
By Subhasis Dasgupta Asst Professor Praxis Business School, Kolkata Classification Modeling Decision Tree (Part 2)
High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.
Machine Learning: Ensemble Methods
Reading: R. Schapire, A brief introduction to boosting
Session 7: Face Detection (cont.)
Feature Selection for Ranking
CIS 519 Recitation 11/15/18.
Presentation transcript:

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003

Outline Lexical semantics are not sufficiently robust Using pLSA for automatically concepts extraction Using AdaBoost to combine weak hypotheses Experimental results confirm the validity.

Text Categorization (1) Most recently excellent results: SVMs and AdaBoost Document representation: term frequencies (bag-of-words), tfidf concept-based Not using general-purpose thesauri but auto extract domain-specific concepts

Text Categorization (2) Extract concepts using an unsupervised learning stage, and used as additional features for supervised learning. Document used need not to be labeled. Labeled documents (with smaller size) for supervised learning.

Text Categorization (3) 3 Steps: –Stage 1: using pLSA to auto extract concepts –Stage 2: weak classifiers or hypotheses are defined based on single terms and extract concepts –Stage 3: term-based and semantic weak hypotheses are combined using AdaBoost

Using pLSA D={d1,d2,d3…..,dM} W={w1,w2,w3….,wN} Z={z1,z2,z3…..,zK} The distribution P(wj|zk) for a fixed zk is a representation for concept zk Concept membership P(zk|di) Use EM algorithm for model fitting

Diagrammatic representation of pLSA

ADABOOST (1) For every category, we have: S (doc,score) = {(x1,y1),(x2,y2),….,(xM,yM)} score = {-1,1} Two types of experiments: –AdaBoost.MH – minimize error –AdaBoost.MR – f(x+) <= f(x-) is minimized

Using semantic features in ADABOOST P(zk|wj) : can identify synonyms and polysemies in some extent Document representation: by semantic features P(zk|di) or word features n(di,wj) Indicator function for word features Threshold Hypotheses (continuous value)

Experiments Reuters dataset (News stories collection) and Medline (OHSUMED collection for TREC 9) Precision, Recall, F1, Error (classification error/false-alarm), Micro Ave, Macro Ave, Ranking function: Maximal F1, BEP, Adjusted F1

Results (Reuters) The relative gains for the macro-averaged metrics are higher, which seems to indicate that semantic features are especially useful for categories with a small number of positive examples.

Results (2)

Results (Medline)

Results (2) Initial runs, term-based features are chosen more often, while semantic features dominate in later founds.

Conclusion Using 3 stages in the overall approaches, address shortcomings of using only term-based representations. Two standard document collections support the validity. Future work: investigate the utilization of additional unlabeled data to improve the concept extraction stage as well as linguistic resources.

Appendix:ADABOOST dortmund.de:8001/mlnet/instances/81d91e8d-dc15ed23e9http://kiew.cs.uni- dortmund.de:8001/mlnet/instances/81d91e8d-dc15ed23e9 AdaBoost is a boosting algorithm, running a given weak learner several times on slightly altered training data, and combining the hypotheses to one final hypothesis, in order to achieve higher accuracy than the weak learner's hypothesis would have. The main idea of AdaBoost is to assign each example of the given training set a weight. At the beginning all weights are equal, but in every round the weak learner returns a hypothesis, and the weights of all examples classified wrong by that hypothesis are increased. That way the weak learner is forced to focus on the difficult examples of the training set. The final hypothesis is a combination of the hypotheses of all rounds, namely a weighted majority vote, where hypotheses with lower classification error have higher weight.