Download presentation
Presentation is loading. Please wait.
Published byPatience Porter Modified over 9 years ago
1
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16
2
2 Introduction If and how automatically extracted keywords can be used to improve text categorization If and how automatically extracted keywords can be used to improve text categorization The full-text representation is combined with the automatically extracted keywords The full-text representation is combined with the automatically extracted keywords Measure performance by micro-averaged F-measure on a standard text categorization collection Measure performance by micro-averaged F-measure on a standard text categorization collection
3
3 Definition Automatic text categorization Automatic text categorization the task of assigning any of a set of predefined categories to a document the task of assigning any of a set of predefined categories to a document the documents must be represented in a form that is understandable to the learning algorithm the documents must be represented in a form that is understandable to the learning algorithm A keyword may consist of one or several tokens. In addition, a keyword may well be a whole expression or phrase A keyword may consist of one or several tokens. In addition, a keyword may well be a whole expression or phrase such as snakes and ladders such as snakes and ladders
4
4 Major decisions for text categorization In order to perform a text categorization task, there are two major decisions to make: In order to perform a text categorization task, there are two major decisions to make: how to represent the text how to represent the text what features to select as input what features to select as input which type of value to assign to these features which type of value to assign to these features what learning algorithm to use to create the prediction model what learning algorithm to use to create the prediction model
5
5 Related work A number of experiments have been performed in which richer representations have been evaluated A number of experiments have been performed in which richer representations have been evaluated compare unigrams and bigrams compare unigrams and bigrams add complex nominals to bag-of-words representation add complex nominals to bag-of-words representation automatically extracted sentences constitute the input to the representation automatically extracted sentences constitute the input to the representation
6
6 Investigate of this research Keywords have been automatically extracted and are used as input to the learning Keywords have been automatically extracted and are used as input to the learning Both on their own and in combination with a full-text representation Both on their own and in combination with a full-text representation
7
7 Selecting the Keywords Method developed by Hulth (2003; 2004), was chosen Method developed by Hulth (2003; 2004), was chosen it is tuned for short texts (more specifically for scientific journal abstracts) it is tuned for short texts (more specifically for scientific journal abstracts) Supervised machine learning Supervised machine learning Prediction models were trained on manually annotated data Prediction models were trained on manually annotated data
8
8 Selecting the Keywords Candidate terms are selected from the document in three different manners Candidate terms are selected from the document in three different manners statistically oriented statistically oriented extracts all uni-, bi-, and trigrams from a document extracts all uni-, bi-, and trigrams from a document linguistic oriented linguistic oriented utilizing the words’ parts-of-speech (PoS) utilizing the words’ parts-of-speech (PoS) extracts all noun phrase (NP) chunks extracts all noun phrase (NP) chunks terms matching any of a set of empirically defined PoS patterns terms matching any of a set of empirically defined PoS patterns All candidate terms are stemmed All candidate terms are stemmed
9
9 Selecting the Keywords Four features are calculated for each candidate term Four features are calculated for each candidate term term frequency (TF) term frequency (TF) inverse document frequency (IDF) inverse document frequency (IDF) relative position of the first occurrence relative position of the first occurrence the PoS tag or tags assigned to the candidate term the PoS tag or tags assigned to the candidate term
10
10 Selecting the Keywords To make the final selection of keywords, the three predictions models are combined To make the final selection of keywords, the three predictions models are combined Terms that are subsumed by another keyword selected for the document are removed Terms that are subsumed by another keyword selected for the document are removed For each selected stem, the most frequently occurring unstemmed form in the document is presented as a keyword For each selected stem, the most frequently occurring unstemmed form in the document is presented as a keyword Each document is assigned at the most twelve keywords, provided that the added regression value Each document is assigned at the most twelve keywords, provided that the added regression value
11
11 Text Categorization Experiments Corpus Corpus Reuters-21578 corpus, which contains 20,000 newswire articles in English with multiple categories Reuters-21578 corpus, which contains 20,000 newswire articles in English with multiple categories 9,603 documents for training and 3,299 documents in the fixed test set, and the 90 categories that are present in both training and test sets. 9,603 documents for training and 3,299 documents in the fixed test set, and the 90 categories that are present in both training and test sets.
12
12 Design of experiment Extracted the texts contained in the TITLE and BODY tags, then given as input to the keyword extraction algorithm Extracted the texts contained in the TITLE and BODY tags, then given as input to the keyword extraction algorithm
13
13 Design of experiment 1. Learning Method Used only one learning algorithm, namely an implementation of Linear Support Vector Machines (Joachims, 1999). Used only one learning algorithm, namely an implementation of Linear Support Vector Machines (Joachims, 1999).
14
14 Design of experiment 2. Representations Important step for the feature selection is the dimensionality reduction Important step for the feature selection is the dimensionality reduction reducing the number of features reducing the number of features removing words that are rare or very common removing words that are rare or very common merged stemmed terms into a common form merged stemmed terms into a common form Assign feature value Assign feature value boolean representation boolean representation term frequency term frequency tf*idf tf*idf
15
15 Design of experiment(Set1,2) keywords-only experiments 1. Keywords that contained several tokens were kept intact, no stemming was performed in this set of experiments. such as paradise fruit was represented as paradise fruit such as paradise fruit was represented as paradise fruit 2. The keywords were split up into unigrams, and also stemmed
16
16 Design of experiment(Set3) Extracted only the content in the TITLE tags. The tokens in the headlines were stemmed and represented as unigrams Extracted only the content in the TITLE tags. The tokens in the headlines were stemmed and represented as unigrams
17
17 Design of experiment(Set4) All stemmed unigrams occurring three or more times in the training data were selected, with the feature value tf*idf. All stemmed unigrams occurring three or more times in the training data were selected, with the feature value tf*idf. The feature values in the full-text representation were given higher weights if the feature was identical to a keyword token The feature values in the full-text representation were given higher weights if the feature was identical to a keyword token Adding the term frequency of a full-text unigram to the term frequency of an identical keyword unigram Adding the term frequency of a full-text unigram to the term frequency of an identical keyword unigram This does not mean that the term frequency value was necessarily doubled, as a keyword often contains more than one token This does not mean that the term frequency value was necessarily doubled, as a keyword often contains more than one token
18
18 Results Set 1,2
19
19 Results Set 3,4 Set 3,4
20
20 Conclusion Higher performance is achieved when the full-text representation is combined with the automatically extracted keywords. Higher performance is achieved when the full-text representation is combined with the automatically extracted keywords.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.