1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Text Categorization Moshe Koppel Lecture 1: Introduction Slides based on Manning, Raghavan and Schutze and odds and ends from here and there.
Evaluation of Decision Forests on Text Categorization
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Approaches to automatic summarization Lecture 5. Types of summaries Extracts – Sentences from the original document are displayed together to form a summary.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Text mining.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Text Classification, Active/Interactive learning.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Complex Linguistic Features for Text Classification: A Comprehensive Study Alessandro Moschitti and Roberto Basili University of Texas at Dallas, University.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document-Based Techniques Dr. Paula Matuszek
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Experiments of Opinion Analysis On MPQA and NTCIR-6 Yaoyong Li, Kalina Bontcheva, Hamish Cunningham Department of Computer Science University of Sheffield.
Project 1: Machine Learning Using Neural Networks Ver 1.1.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
National Taiwan University, Taiwan
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Blog Summarization We have built a blog summarization system to assist people in getting opinions from the blogs. After identifying topic-relevant sentences,
Transductive Inference for Text Classification using Support Vector Machines - Thorsten Joachims (1999) 서울시립대 전자전기컴퓨터공학부 데이터마이닝 연구실 G 노준호.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
Class Imbalance in Text Classification
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with SVM Lan Man 3 Nov, 2004.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Spam Detection Kingsley Okeke Nimrat Virk. Everyone hates spams!! Spam s, also known as junk s, are unwanted s sent to numerous recipients.
Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.
Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
BioCreAtIvE Critical Assessment for Information Extraction in Biology Granada, Spain, March28-March 31, 2004 Task 2: Functional annotation of gene products.
An Effective Statistical Approach to Blog Post Opinion Retrieval Ben He, Craig Macdonald, Jiyin He, Iadh Ounis (CIKM 2008)
Language Identification and Part-of-Speech Tagging
A Simple Approach for Author Profiling in MapReduce
A Straightforward Author Profiling Approach in MapReduce
Using lexical chains for keyword extraction
Text Categorization Assigning documents to a fixed set of categories
Automatic Detection of Causal Relations for Question Answering
Introduction to Search Engines
Presentation transcript:

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16

2 Introduction If and how automatically extracted keywords can be used to improve text categorization If and how automatically extracted keywords can be used to improve text categorization The full-text representation is combined with the automatically extracted keywords The full-text representation is combined with the automatically extracted keywords Measure performance by micro-averaged F-measure on a standard text categorization collection Measure performance by micro-averaged F-measure on a standard text categorization collection

3 Definition Automatic text categorization Automatic text categorization the task of assigning any of a set of predefined categories to a document the task of assigning any of a set of predefined categories to a document the documents must be represented in a form that is understandable to the learning algorithm the documents must be represented in a form that is understandable to the learning algorithm A keyword may consist of one or several tokens. In addition, a keyword may well be a whole expression or phrase A keyword may consist of one or several tokens. In addition, a keyword may well be a whole expression or phrase such as snakes and ladders such as snakes and ladders

4 Major decisions for text categorization In order to perform a text categorization task, there are two major decisions to make: In order to perform a text categorization task, there are two major decisions to make: how to represent the text how to represent the text what features to select as input what features to select as input which type of value to assign to these features which type of value to assign to these features what learning algorithm to use to create the prediction model what learning algorithm to use to create the prediction model

5 Related work A number of experiments have been performed in which richer representations have been evaluated A number of experiments have been performed in which richer representations have been evaluated compare unigrams and bigrams compare unigrams and bigrams add complex nominals to bag-of-words representation add complex nominals to bag-of-words representation automatically extracted sentences constitute the input to the representation automatically extracted sentences constitute the input to the representation

6 Investigate of this research Keywords have been automatically extracted and are used as input to the learning Keywords have been automatically extracted and are used as input to the learning Both on their own and in combination with a full-text representation Both on their own and in combination with a full-text representation

7 Selecting the Keywords Method developed by Hulth (2003; 2004), was chosen Method developed by Hulth (2003; 2004), was chosen it is tuned for short texts (more specifically for scientific journal abstracts) it is tuned for short texts (more specifically for scientific journal abstracts) Supervised machine learning Supervised machine learning Prediction models were trained on manually annotated data Prediction models were trained on manually annotated data

8 Selecting the Keywords Candidate terms are selected from the document in three different manners Candidate terms are selected from the document in three different manners statistically oriented statistically oriented extracts all uni-, bi-, and trigrams from a document extracts all uni-, bi-, and trigrams from a document linguistic oriented linguistic oriented utilizing the words’ parts-of-speech (PoS) utilizing the words’ parts-of-speech (PoS) extracts all noun phrase (NP) chunks extracts all noun phrase (NP) chunks terms matching any of a set of empirically defined PoS patterns terms matching any of a set of empirically defined PoS patterns All candidate terms are stemmed All candidate terms are stemmed

9 Selecting the Keywords Four features are calculated for each candidate term Four features are calculated for each candidate term term frequency (TF) term frequency (TF) inverse document frequency (IDF) inverse document frequency (IDF) relative position of the first occurrence relative position of the first occurrence the PoS tag or tags assigned to the candidate term the PoS tag or tags assigned to the candidate term

10 Selecting the Keywords To make the final selection of keywords, the three predictions models are combined To make the final selection of keywords, the three predictions models are combined Terms that are subsumed by another keyword selected for the document are removed Terms that are subsumed by another keyword selected for the document are removed For each selected stem, the most frequently occurring unstemmed form in the document is presented as a keyword For each selected stem, the most frequently occurring unstemmed form in the document is presented as a keyword Each document is assigned at the most twelve keywords, provided that the added regression value Each document is assigned at the most twelve keywords, provided that the added regression value

11 Text Categorization Experiments Corpus Corpus Reuters corpus, which contains 20,000 newswire articles in English with multiple categories Reuters corpus, which contains 20,000 newswire articles in English with multiple categories 9,603 documents for training and 3,299 documents in the fixed test set, and the 90 categories that are present in both training and test sets. 9,603 documents for training and 3,299 documents in the fixed test set, and the 90 categories that are present in both training and test sets.

12 Design of experiment Extracted the texts contained in the TITLE and BODY tags, then given as input to the keyword extraction algorithm Extracted the texts contained in the TITLE and BODY tags, then given as input to the keyword extraction algorithm

13 Design of experiment 1. Learning Method Used only one learning algorithm, namely an implementation of Linear Support Vector Machines (Joachims, 1999). Used only one learning algorithm, namely an implementation of Linear Support Vector Machines (Joachims, 1999).

14 Design of experiment 2. Representations Important step for the feature selection is the dimensionality reduction Important step for the feature selection is the dimensionality reduction reducing the number of features reducing the number of features removing words that are rare or very common removing words that are rare or very common merged stemmed terms into a common form merged stemmed terms into a common form Assign feature value Assign feature value boolean representation boolean representation term frequency term frequency tf*idf tf*idf

15 Design of experiment(Set1,2) keywords-only experiments 1. Keywords that contained several tokens were kept intact, no stemming was performed in this set of experiments. such as paradise fruit was represented as paradise fruit such as paradise fruit was represented as paradise fruit 2. The keywords were split up into unigrams, and also stemmed

16 Design of experiment(Set3) Extracted only the content in the TITLE tags. The tokens in the headlines were stemmed and represented as unigrams Extracted only the content in the TITLE tags. The tokens in the headlines were stemmed and represented as unigrams

17 Design of experiment(Set4) All stemmed unigrams occurring three or more times in the training data were selected, with the feature value tf*idf. All stemmed unigrams occurring three or more times in the training data were selected, with the feature value tf*idf. The feature values in the full-text representation were given higher weights if the feature was identical to a keyword token The feature values in the full-text representation were given higher weights if the feature was identical to a keyword token Adding the term frequency of a full-text unigram to the term frequency of an identical keyword unigram Adding the term frequency of a full-text unigram to the term frequency of an identical keyword unigram This does not mean that the term frequency value was necessarily doubled, as a keyword often contains more than one token This does not mean that the term frequency value was necessarily doubled, as a keyword often contains more than one token

18 Results Set 1,2

19 Results Set 3,4 Set 3,4

20 Conclusion Higher performance is achieved when the full-text representation is combined with the automatically extracted keywords. Higher performance is achieved when the full-text representation is combined with the automatically extracted keywords.