Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining keywords/keyphrases by machine learning? Reviewing some of the work of Peter Turney from NRC. The papers are dated 1997 and 1999 so recent developments in data mining may suggest improvements.
Automated Keyphrase Extraction Introduction and definitions Applications and algorithms Learning algorithms Empirical results Future work
Definitions Information extraction: text analysis in this domain serves to provide user-anticipated information.(Ex.The names of companies in news services reports.) Index generation: an index may be created as a “back-of-the-book” listing for human use or as an exhaustive computer listing used by a search engine. Important phrase extraction:may be used especially with scientific journals Keyphrase: a phrase of one to three words to capture the main topic Keyphrase list:usually 5 to 15 keyphrases Keyphrase generation: obtaining the keyphrases some of which are not available in the body of the text document. Keyphrase extraction: obtaining the keyphrases which are available in the body of the text document. Note: On average about 75% of the keyphrases appear in the text.
Applications and Algorithms Keyphrases may serve as a mini-summary Partial indexing Automated keyphrases can help an author with some keywords or phrases he may have missed Labels for text documents Providing of highlights for a document. Algorithms of concern: stemming of words WordPorter StemLovins Stem believesbelievbelief beliefbeliefbelief believablebelievbelief Turney finds the more aggressive Lovins stemming algorithm to be more useful for keyword extraction.
Applications and Algorithms(continued) “stone church” not equal to “church stone” “neural networks” = “neural network” Sometimes the Stemming Algorithm does not get it correct: WordPorter StemLovins Stem realisticrealistreal realityrealitire Both the Porter and Lovins stemming algorithms see the two words as distinct.
Measuring performance of the algorithms Confusion matrix for keyphrase extraction. Human ClassifiedHuman Classified.As a Keyphraseas NOT a Keyphrase Machine class- ified as Keyphrase ab Machine class- ified as NOTcd. Human ClassifiedHuman Classified.As a Keyphraseas NOT a Keyphrase Machine class- ified as Keyphrase 43 Machine class- ified as NOT2(2500-9) (For a total of 2500 stemmed words)
Measuring performance(continued) Accuracy = (a+d) / (a+b+c+d)= 2495/2500 Precision = a / (a+b) = 4/7 Recall = a / (a+c) = 4/6 The F-measure is used as a balanced measure. F-measure = (2a) / (2a+b+c) = 8 / 15 A journal article will typically contain 10,000 words and these will narrow down to approximately 2500 stemmed word equivalents. Out of the average 7.5 keyphrases used only 6 keyphrases are available in the text for extraction. Class imbalance does present machine learning difficulties.
Empirical Results (Turney,1997) MethodF-measureF-measureF-measure of Extraction(text#1)(text#2)(text#3) Microsoft Word Brill’s Tagger Verity’s Search NRC’s Extractor The above results are from journal articles. Test #2 was a very difficult scientific article. The author also obtained good results(F-measures) for extraction of and also Web-Page keyphrases.
Machine Learning Results (Turney,1999) The next paper deals with the different possible approaches to automatic keyphrase extraction. Part I: The use of C4.5 software to find the keyphrases. (Where features are provided for the phrases in the determination of positive and negative cases.) Part II: The use of the GenEx algorithm which is the combination of the Genitor genetic algorithm (Whitely, 1989) and the Extractor algorithm.(NRC) Part I The author went through 110 features before settling on: 1)stemmed_phrase 2)whole phrase 3) num_words_phrase 4)first_occur_phrase 5)first_occur_word 6)freq_phrase 7)freq_word8)relative_length 9)proper_noun 10)final_adjective 11)common_verb 12) class Class 1 is an extracted keyphrase and Class 0 is NOT a keyphrase
NRC’s Extractor Ten steps 1) Find single stems (stemming algorithm) 2)Score single stems 3)Select top single stems 4)Find stem phrases (phrases up to length 3) 5)Score stem phrases 6)Expand single stems 7)Drop duplicates 8)Add suffixes 9)Add capitals 10)Final output Summary: Extractor is the NRC software that allows text as the input and keyphrases as the output.
NRC’s Extractor(continued) The tests (within the algorithm): 1)The phrase should not have the capitalization of a proper noun, unless the flag suppress_proper is set to zero 2)The phrase should not have an ending that indicates a possible adjective 3)The phrase should be longer than the min_length_low_rank 4)If the phrase is shorter than min_length_low_rank it may still be acceptable 5)If phrase fails both tests 3) and 4) it may still be acceptable if its capitalization indicates that it is probably an abbreviation. 6)The phrase should not contain any words commonly used as verbs. 7)The phrase should not match any phrase Lastly, a phrase must pass tests 1), 2), 6), and 7) and at least one of 3), 4) and 5).
NRC’s Extractor(continued) Twelve parameters( Used with Extractor and Genitor) Parameter NameRangeNumber of bits Num_phrases[5,15]0 Num_working[15,75]0 Factor_two_one[1.0,3.0]8 Factor_three_one[1.0,5.0]8 Min_length_low_rank[0.3,3.0]8 Min_rank_low_length[1,20]5 First_low_thresh[1,1000]10 First_high_thresh[1,4000]12 First_low_factor[1.0,15.0]8 First_high_factor[0.01,1.0]8 Stem_length[1,10]4 Suppress_proper[0,1]1 Total number of bits is 72. ( 72-bit binary string.)
GenEx (Combines Genitor with Extractor) Genitor is run with a population of 50 for 1050 trials(default setting). Each trial consists of running Extractor with the parameter settings specified in the 72-bit binary string. The fitness measure is based on the average precision for the whole training set. The final output is the highest scoring binary string. Experimental results in adapting a penalty such that:.fitness = precision*penalty (Modification of fitness function to output the correct number of keyphrases. Penalties vary from 0 to 1.) Notes on Genitor: It is a steady-state genetic algorithm. The initial population is usually randomly chosen. Population changes one individual at a time such that the “least fit” individual is replaced by a new randomly selected individual. Whitely(1989) suggests that steady- state genetic algorithms are more aggressive than generational genetic algorithms.
GenEx (Continued)) GenEx may take a significant time to run.(750 times longer than C4.5) GenEx was trained seperately on different corpora (journals, s and Web-pages) in order to increase precision..Average Precision +/- Stand.Dev. Training/TestingGenExC4.5 Journals / / / / / / / / NASA Web-pages / / / / Summary / / (Averages) / / The Question still remains whether 29% precision is acceptable? How else can automatic keyphrase extraction be tested?
Human Evaluation of GenEx keyphrases A website explaining GenEx was created where the reader was asked to “volunteer” a URL for processing. Keyphrases were extracted and then presented to the user to judge whether he/she found them Good or Bad or No Opinion. Web-based human evaluation of keyword extraction Number of voters 205Number of documents 267 Number of keyphrases 1869Max. documents per person 5 Good1159(62.1%) Bad 339(18.1%) No Opinion371(19.9%) From these voters about 80% of the keyphrases were found to be acceptable.
Last Notes on Keyphrase Algorithms Frank et al.(1999) developed Kea which is a Bayesian approach to keyphrase extraction. Eibe Frank and the authors acknowledge the help of Peter Turney and NRC. Kea is available through the internet. (See Weka.) Turney believes that GenEx and Kea should give statistically similar results. The work that was done on specialized procedural domain knowledge was the main element in the automation of keyword extraction. Future Work: –Would under-sampling or over-sampling help the machine learning process? (Since there is a class imbalance.) –A thesaurus of synonyms would be a welcome addition. –For specialized journals or Web-pages (ex. Medline) a lexicon of frequently used keyphrases could be found.
Bibliography Extraction of Keyphrases from Text: Evaluation of Four Algorithms (Turney, 1997) Learning Algorithms for Keyphrase Extraction (Turney, 1999) Kea: Practical Automatic Keyphrase Extraction (Witten et al., 1999)