Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining.

Slides:



Advertisements
Similar presentations
CWS: A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, § Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at.
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
On the Genetic Evolution of a Perfect Tic-Tac-Toe Strategy
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Evaluating Search Engine
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
Welcome to Turnitin.com’s Peer Review! This tour will take you through the basics of Turnitin.com’s Peer Review. The goal of this tour is to give you.
CSE 730 Information Retrieval of Biomedical Data The use of medical lexicon in biomedical IR.
CONTENT-BASED BOOK RECOMMENDING USING LEARNING FOR TEXT CATEGORIZATION TRIVIKRAM BHAT UNIVERSITY OF TEXAS AT ARLINGTON DATA MINING CSE6362 BASED ON PAPER.
Chapter 5: Information Retrieval and Web Search
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.
Computer Science 1000 Spreadsheets II Permission to redistribute these slides is strictly prohibited without permission.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
A Compositional Context Sensitive Multi-document Summarizer: Exploring the Factors That Influence Summarization Ani Nenkova, Stanford University Lucy Vanderwende,
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali Vasileios Hatzivassiloglou The University.
Learning to Extract Keyphrases from Text Paper by: Peter Turney National Research Council of Canada Technical Report (1999) Presented by: Prerak Sanghvi.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Chapter 6: Information Retrieval and Web Search
Query and Analysis on the document and customer/item bag card of the DataDex Kellie Erickson.
Presenter: Shanshan Lu 03/04/2010
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Processing of large document collections Part 5 (Text summarization) Helena Ahonen-Myka Spring 2005.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval
Semi-automatic Product Attribute Extraction from Store Website
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
SENTIWORDNET: A Publicly Available Lexical Resource for Opinion Mining
Query Suggestions in the Absence of Query Logs Sumit Bhatia, Debapriyo Majumdar,Prasenjit Mitra SIGIR’11, July 24–28, 2011, Beijing, China.
Extracting and Ranking Product Features in Opinion Documents Lei Zhang #, Bing Liu #, Suk Hwan Lim *, Eamonn O’Brien-Strain * # University of Illinois.
DESIGNING AN ARTICLE Effective Writing 3. Objectives Raising awareness of the format, requirements and features of scientific articles Sharing information.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Event-Based Extractive Summarization E. Filatova and V. Hatzivassiloglou Department of Computer Science Columbia University (ACL 2004)
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
An evolutionary approach for improving the quality of automatic summaries Constantin Orasan Research Group in Computational Linguistics School of Humanities,
Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
A Document-Level Sentiment Analysis Approach Using Artificial Neural Network and Sentiment Lexicons Yan Zhu.
Evaluation Anisio Lacerda.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Introduction to Information Retrieval
Connecting the Dots Between News Article
Presentation transcript:

Automatic Keyphrase Extraction (Jim Nuyens) Keywords are an everyday part of looking up topics and specific content. What are some of the ways of obtaining keywords/keyphrases by machine learning? Reviewing some of the work of Peter Turney from NRC. The papers are dated 1997 and 1999 so recent developments in data mining may suggest improvements.

Automated Keyphrase Extraction Introduction and definitions Applications and algorithms Learning algorithms Empirical results Future work

Definitions Information extraction: text analysis in this domain serves to provide user-anticipated information.(Ex.The names of companies in news services reports.) Index generation: an index may be created as a “back-of-the-book” listing for human use or as an exhaustive computer listing used by a search engine. Important phrase extraction:may be used especially with scientific journals Keyphrase: a phrase of one to three words to capture the main topic Keyphrase list:usually 5 to 15 keyphrases Keyphrase generation: obtaining the keyphrases some of which are not available in the body of the text document. Keyphrase extraction: obtaining the keyphrases which are available in the body of the text document. Note: On average about 75% of the keyphrases appear in the text.

Applications and Algorithms Keyphrases may serve as a mini-summary Partial indexing Automated keyphrases can help an author with some keywords or phrases he may have missed Labels for text documents Providing of highlights for a document. Algorithms of concern: stemming of words WordPorter StemLovins Stem believesbelievbelief beliefbeliefbelief believablebelievbelief Turney finds the more aggressive Lovins stemming algorithm to be more useful for keyword extraction.

Applications and Algorithms(continued) “stone church” not equal to “church stone” “neural networks” = “neural network” Sometimes the Stemming Algorithm does not get it correct: WordPorter StemLovins Stem realisticrealistreal realityrealitire Both the Porter and Lovins stemming algorithms see the two words as distinct.

Measuring performance of the algorithms Confusion matrix for keyphrase extraction. Human ClassifiedHuman Classified.As a Keyphraseas NOT a Keyphrase Machine class- ified as Keyphrase ab Machine class- ified as NOTcd. Human ClassifiedHuman Classified.As a Keyphraseas NOT a Keyphrase Machine class- ified as Keyphrase 43 Machine class- ified as NOT2(2500-9) (For a total of 2500 stemmed words)

Measuring performance(continued) Accuracy = (a+d) / (a+b+c+d)= 2495/2500 Precision = a / (a+b) = 4/7 Recall = a / (a+c) = 4/6 The F-measure is used as a balanced measure. F-measure = (2a) / (2a+b+c) = 8 / 15 A journal article will typically contain 10,000 words and these will narrow down to approximately 2500 stemmed word equivalents. Out of the average 7.5 keyphrases used only 6 keyphrases are available in the text for extraction. Class imbalance does present machine learning difficulties.

Empirical Results (Turney,1997) MethodF-measureF-measureF-measure of Extraction(text#1)(text#2)(text#3) Microsoft Word Brill’s Tagger Verity’s Search NRC’s Extractor The above results are from journal articles. Test #2 was a very difficult scientific article. The author also obtained good results(F-measures) for extraction of and also Web-Page keyphrases.

Machine Learning Results (Turney,1999) The next paper deals with the different possible approaches to automatic keyphrase extraction. Part I: The use of C4.5 software to find the keyphrases. (Where features are provided for the phrases in the determination of positive and negative cases.) Part II: The use of the GenEx algorithm which is the combination of the Genitor genetic algorithm (Whitely, 1989) and the Extractor algorithm.(NRC) Part I The author went through 110 features before settling on: 1)stemmed_phrase 2)whole phrase 3) num_words_phrase 4)first_occur_phrase 5)first_occur_word 6)freq_phrase 7)freq_word8)relative_length 9)proper_noun 10)final_adjective 11)common_verb 12) class Class 1 is an extracted keyphrase and Class 0 is NOT a keyphrase

NRC’s Extractor Ten steps 1) Find single stems (stemming algorithm) 2)Score single stems 3)Select top single stems 4)Find stem phrases (phrases up to length 3) 5)Score stem phrases 6)Expand single stems 7)Drop duplicates 8)Add suffixes 9)Add capitals 10)Final output Summary: Extractor is the NRC software that allows text as the input and keyphrases as the output.

NRC’s Extractor(continued) The tests (within the algorithm): 1)The phrase should not have the capitalization of a proper noun, unless the flag suppress_proper is set to zero 2)The phrase should not have an ending that indicates a possible adjective 3)The phrase should be longer than the min_length_low_rank 4)If the phrase is shorter than min_length_low_rank it may still be acceptable 5)If phrase fails both tests 3) and 4) it may still be acceptable if its capitalization indicates that it is probably an abbreviation. 6)The phrase should not contain any words commonly used as verbs. 7)The phrase should not match any phrase Lastly, a phrase must pass tests 1), 2), 6), and 7) and at least one of 3), 4) and 5).

NRC’s Extractor(continued) Twelve parameters( Used with Extractor and Genitor) Parameter NameRangeNumber of bits Num_phrases[5,15]0 Num_working[15,75]0 Factor_two_one[1.0,3.0]8 Factor_three_one[1.0,5.0]8 Min_length_low_rank[0.3,3.0]8 Min_rank_low_length[1,20]5 First_low_thresh[1,1000]10 First_high_thresh[1,4000]12 First_low_factor[1.0,15.0]8 First_high_factor[0.01,1.0]8 Stem_length[1,10]4 Suppress_proper[0,1]1 Total number of bits is 72. ( 72-bit binary string.)

GenEx (Combines Genitor with Extractor) Genitor is run with a population of 50 for 1050 trials(default setting). Each trial consists of running Extractor with the parameter settings specified in the 72-bit binary string. The fitness measure is based on the average precision for the whole training set. The final output is the highest scoring binary string. Experimental results in adapting a penalty such that:.fitness = precision*penalty (Modification of fitness function to output the correct number of keyphrases. Penalties vary from 0 to 1.) Notes on Genitor: It is a steady-state genetic algorithm. The initial population is usually randomly chosen. Population changes one individual at a time such that the “least fit” individual is replaced by a new randomly selected individual. Whitely(1989) suggests that steady- state genetic algorithms are more aggressive than generational genetic algorithms.

GenEx (Continued)) GenEx may take a significant time to run.(750 times longer than C4.5) GenEx was trained seperately on different corpora (journals, s and Web-pages) in order to increase precision..Average Precision +/- Stand.Dev. Training/TestingGenExC4.5 Journals / / / / / / / / NASA Web-pages / / / / Summary / / (Averages) / / The Question still remains whether 29% precision is acceptable? How else can automatic keyphrase extraction be tested?

Human Evaluation of GenEx keyphrases A website explaining GenEx was created where the reader was asked to “volunteer” a URL for processing. Keyphrases were extracted and then presented to the user to judge whether he/she found them Good or Bad or No Opinion. Web-based human evaluation of keyword extraction Number of voters 205Number of documents 267 Number of keyphrases 1869Max. documents per person 5 Good1159(62.1%) Bad 339(18.1%) No Opinion371(19.9%) From these voters about 80% of the keyphrases were found to be acceptable.

Last Notes on Keyphrase Algorithms Frank et al.(1999) developed Kea which is a Bayesian approach to keyphrase extraction. Eibe Frank and the authors acknowledge the help of Peter Turney and NRC. Kea is available through the internet. (See Weka.) Turney believes that GenEx and Kea should give statistically similar results. The work that was done on specialized procedural domain knowledge was the main element in the automation of keyword extraction. Future Work: –Would under-sampling or over-sampling help the machine learning process? (Since there is a class imbalance.) –A thesaurus of synonyms would be a welcome addition. –For specialized journals or Web-pages (ex. Medline) a lexicon of frequently used keyphrases could be found.

Bibliography Extraction of Keyphrases from Text: Evaluation of Four Algorithms (Turney, 1997) Learning Algorithms for Keyphrase Extraction (Turney, 1999) Kea: Practical Automatic Keyphrase Extraction (Witten et al., 1999)