TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.

Slides:

Advertisements

Similar presentations

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Advertisements

Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.

Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.

Evaluation of Decision Forests on Text Categorization

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Sentiment Analysis An Overview of Concepts and Selected Techniques.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.

Multiple Instance Learning

The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.

Mapping Between Taxonomies Elena Eneva 30 Oct 2001 Advanced IR Seminar.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Using IR techniques to improve Automated Text Classification

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Sentence Classifier for Helpdesk s Anthony 6 June 2006 Supervisors: Dr. Yuval Marom Dr. David Albrecht.

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Learning to Rank for Information Retrieval

Aiding WSD by exploiting hypo/hypernymy relations in a restricted framework MEANING project Experiment 6.H(d) Luis Villarejo and Lluís M à rquez.

Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Tie-Yan.

Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.

1 Scaling multi-class Support Vector Machines using inter- class confusion Author:Shantanu Sunita Sarawagi Sunita Sarawagi Soumen Chakrabarti Soumen Chakrabarti.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. BNS Feature Scaling: An Improved Representation over TF·IDF for SVM Text Classification Presenter : Lin,

Text Classification, Active/Interactive learning.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.

Text Classification Chapter 2 of “Learning to Classify Text Using Support Vector Machines” by Thorsten Joachims, Kluwer, 2002.

A resource and tool for Super Sense Tagging of Italian Texts LREC 2010, Malta – 19-21/05/2010 Giuseppe Attardi* Alessandro Lenci* + Stefano Dei Rossi*

Special topics on text mining [ Part I: text classification ] Hugo Jair Escalante, Aurelio Lopez, Manuel Montes and Luis Villaseñor.

Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.

AUTOMATED TEXT CATEGORIZATION: THE TWO-DIMENSIONAL PROBABILITY MODE Abdulaziz alsharikh.

A Language Independent Method for Question Classification COLING 2004.

10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A Ralph Grishman NYU.

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

USE RECIPE INGREDIENTS TO PREDICT THE CATEGORY OF CUISINE Group 7 – MEI, Yan & HUANG, Chenyu.

Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa

Information Retrieval Lecture 4 Introduction to Information Retrieval (Manning et al. 2007) Chapter 13 For the MSc Computer Science Programme Dell Zhang.

Latent Dirichlet Allocation

Nuhi BESIMI, Adrian BESIMI, Visar SHEHU

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.

Musical Genre Categorization Using Support Vector Machines Shu Wang.

A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

Information Retrieval

Presentation transcript:

TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid

OUTLINE Motivation Concept indexing with WordNet synsets Concept indexing in ATC Experiments set-up Summary of results & discussion Updated results Conclusions & current work

MOTIVATION Most popular & effective model for thematic ATC IR-like text representation ML feature selection, learning classifiers Pre-classified documents Representation & learning New documents Representation Classifier(s) New documents instances Classification New documents categorized Categories

MOTIVATION Bag of Words Binary TF TF*IDF Stoplist Stemming Feature Selection

MOTIVATION Text representation requirements in thematic ATC Semantic characterization of text content Words convey an important part of the meaning But we must deal with polysemy and synonymy Must allow effective learning Thousands to tens of thousands attributes  noise (effectiveness) & lack of efficiency

CONCEPT INDEXING WITH WORDNET SYNSETS Using vectors of synsets instead of word stems Ambiguous words mapped to correct senses Synonyms mapped to same synsets automobile ---- car wagon -- N {automobile, car, wagon} N {train wagon, wagon}

CONCEPT INDEXING WITH WORDNET SYNSETS Considerable controversy in IR Assumed potential for improving text representation Mixed experimental results, ranging from Very good [Gonzalo et al. 98] to bad [Voorhees 98] Recent review in [Stokoe et al. 03] A problem of state-of-the-art WSD effectiveness But ATC is different!!!

CONCEPT INDEXING IN ATC Apart of the potential... We have much more information about ATC categories than IR queries WSD lack of effectiveness can be less hurting because of term (feature) selection But we have new problems!!! Data sparseness & noise Most terms are rare (Zipf’s Law)  bad estimates Categories with few documents  bad estimates, lack of information

CONCEPT INDEXING IN ATC Concept indexing helps to solve IR & new ATC problems Text ambiguity in IR & ATC Data sparseness & noise in ATC Less indexing units of higher quality (selection)  probably better estimates Categories with few documents  why not enriching representation with WordNet semantic relations? Hyperonymy, meronymy, etc.

CONCEPT INDEXING IN ATC Literature review As in IR, mixed results, ranging from Good [Fukumoto & Suzuki, 01] to bad [Scott, 98] Notably, researchers use words in synsets instead of the synset codes themselves Still lacking Concept indexing evaluation in ATC over a representative range of selection strategies and learning algorithms

EXPERIMENTS SETUP Primary goal Comparing terms vs. correct synsets as indexing units Requires perfect disambiguated collection (SemCor) Secondary goals Comparing perfect WSD with simple methods More scalability, less accuracy Comparing terms with/out stemming, stop-listing Nature of SemCor (genre + topic classification)

EXPERIMENTS SETUP Overview of parameters Binary classifiers vs. multi-class classifiers Three concept indexing representations Correct WSD (CD) WSD by POS Tagger (CF) WSD by corpus frequency (CA)

EXPERIMENTS SETUP Overview of parameters Four term indexing representations No Stemming, No StopList (BNN) No Stemming, with Stoplist (BNS) With Stemming, without Stoplist (BSN) With Stemming and Stoplist (BSS)

EXPERIMENTS SETUP Levels of selection with IG No selection (NOS) top 1% (S01) top 10% (S10) IG>0 (S00)

EXPERIMENTS SETUP Learning algorithms Naïve Bayes kNN C4.5 PART SVMs Adaboost+Naïve Bayes Adaboost+C4.5

EXPERIMENTS SETUP Evaluation metrics F1 (average of recall – precission) Macroaverage Microaverage K-fold cross validation (k=10 in our experiments)

SUMMARY OF RESULTS & DISCUSSION Overview of results Binary classification Multi-class classification

SUMMARY OF RESULTS & DISCUSSION CD > C* weakly supports that accurate WSD is required BNN > B* does not support that stemming & stop-listing are NOT required Genre/topic orientation Most importantly CD > B* does not support that synsets are better indexing units than words (stemmed & stop-listed or not)

UPDATED RESULTS Recent results combining synsets & words (no stemming, no stop-listing, binary problem) NB  S00, C4.5  S00, S01, S10 SVM  S01, ABNB  S00, S00, S10

CONCLUSSIONS & CURRENT WORK Synsets are NOT a better representation, but IMPROVE the bag-of-words representation We are testing semantic relations (hyperonymy) on SemCor It is required more work on Reuters We will have to address WSD, initially with the approaches described in this work

THANK YOU !