Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore

Slides:

Advertisements

Similar presentations

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.

Advertisements

Yansong Feng and Mirella Lapata

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Software development process. Explanation of the iterative nature of the software development process.

Product Review Summarization Ly Duy Khang. Outline 1.Motivation 2.Problem statement 3.Related works 4.Baseline 5.Discussion.

SEARCHING QUESTION AND ANSWER ARCHIVES Dr. Jiwoon Jeon Presented by CHARANYA VENKATESH KUMAR.

Multilingual Information Access in a Digital Library Vamshi Ambati, Rohini U, Pramod, N Balakrishnan and Raj Reddy International Institute of Information.

SMS-Based Web Search for Low-end Mobile Devices Jay Chen New York University Lakshmi Subramanian New York University Eric Brewer University of California.

Personalizing Search via Automated Analysis of Interests and Activities Jaime Teevan Susan T.Dumains Eric Horvitz MIT,CSAILMicrosoft Researcher Microsoft.

Modern Information Retrieval Chapter 1: Introduction

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.

A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.

LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.

Learning to Extract Form Labels Nguyen et al.. The Challenge We want to retrieve and integrate online databases We want to retrieve and integrate online.

Automating Keyphrase Extraction with Multi-Objective Genetic Algorithms (MOGA) Jia-Long Wu Alice M. Agogino Berkeley Expert System Laboratory U.C. Berkeley.

Dg.o conference 2006 Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,

1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,

Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

In Situ Evaluation of Entity Ranking and Opinion Summarization using Kavita Ganesan & ChengXiang Zhai University of Urbana Champaign

Usability Evaluation of Digital Libraries Stacey Greenaway Submitted to University of Wolverhampton module Dec 15 th 2006.

Mining the Peanut Gallery: Opinion Extraction and Semantic Classification of Product Reviews K. Dave et al, WWW 2003, citations Presented by Sarah.

Opinion mining in social networks Student: Aleksandar Ponjavić 3244/2014 Mentor: Profesor dr Veljko Milutinović.

Keyphrase Extraction in Scientific Documents Thuy Dung Nguyen and Min-Yen Kan School of Computing National University of Singapore Slides available at.

Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Defining Text Mining Preprocessing Transforming unstructured data stored in document collections into a more explicitly structured intermediate format.

PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.

Which of the two appears simple to you? 1 2.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.

Austin, TX, USA, Landscaping Performance Research at the ICPE and its Predecessors: A Systematic Literature Review Short Paper International.

Markup and Validation Agents in Vijjana – A Pragmatic model for Self- Organizing, Collaborative, Domain- Centric Knowledge Networks S. Devalapalli, R.

Query and Analysis on the document and customer/item bag card of the DataDex Kellie Erickson.

CSM06 Information Retrieval Lecture 6: Visualising the Results Set Dr Andrew Salway

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

Personalizing Web Search using Long Term Browsing History Nicolaas Matthijs, Cambridge Filip Radlinski, Microsoft In Proceedings of WSDM

1 Statistical Machine Translation Models for Personalized Search Rohini U AOL India R&D, Bangalore India Vamshi Ambati Language.

Chapter 23: Probabilistic Language Models April 13, 2004.

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns Author : Stamatina Thomaidou, Konstantinos Leymonis, and Michalis Vazirgiannis.

Information Retrieval at NLC Jianfeng Gao NLC Group, Microsoft Research China.

NRC Report Conclusion Tu Zhaopeng NIST06  The Portage System  For Chinese large-track entry, used simple, but carefully- tuned, phrase-based.

Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.

Information Retrieval

UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.

Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq

Personal Tag Semantic Relation Yi-Ching Huang 2008/02/27 Yi-Ching Huang 2008/02/27.

Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,

Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University.

Single Document Key phrase Extraction Using Neighborhood Knowledge.

Automatic Labeling of Multinomial Topic Models Qiaozhu Mei, Xuehua Shen, and ChengXiang Zhai DAIS The Database and Information Systems Laboratory.

Using the Web for Language Independent Spellchecking and Auto correction Authors: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis Google Inc. Published.

Bringing Order to the Web : Automatically Categorizing Search Results Advisor ： Dr. Hsu Graduate ： Keng-Wei Chang Author ： Hao Chen Susan Dumais.

Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.

An Empirical Study of Learning to Rank for Entity Search

Applying Key Phrase Extraction to aid Invalidity Search

Multilingual Information Access in a Digital Library

Searching with context

Presentation transcript:

Extracting Keyphrases from Books using Language Modeling Approaches Rohini U AOL India R&D, Bangalore India Bangalore Vamshi Ambati Language Technologies Institute Carnegie Mellon University Pittsburgh, USA

Agenda Keyphrase Extraction Keyphrase Extraction Value addition to Digital Libraries Value addition to Digital Libraries Methods of Keyphrase Extraction Methods of Keyphrase Extraction Related Work Related Work Our Solution Our Solution

What are Keyphrases? Keyphrases Keyphrases (Give example) (Give example) Where used? Where used? Cataloguing in Libraries for IR purposes Cataloguing in Libraries for IR purposes Quick Summarization of documents Quick Summarization of documents

Why important to ULIB? Vast growth in digital content Vast growth in digital content More than a Million books! More than a Million books! Short Meta data description – useful to user while reading Short Meta data description – useful to user while reading For further processing of books like summarization, IR etc For further processing of books like summarization, IR etc

How do we extract KPs? Manual entry Manual entry Reliable, high quality outcome Reliable, high quality outcome But, time-consuming, expensive But, time-consuming, expensive Automatic Automatic Fast extraction but less reliable Fast extraction but less reliable No expense at all No expense at all

Automatic techniques for KPE Rule based methods Rule based methods Heuristics (paragraph beginning, headline etc) Heuristics (paragraph beginning, headline etc) Krulwich &Burkey etc Krulwich &Burkey etc Using Linguistic tools Using Linguistic tools Statistical techniques Statistical techniques Term counts and weighting based Methods Term counts and weighting based Methods Learn model from training data Learn model from training data Turney et. al[5], KEA[6], KSpotter[3] etc Turney et. al[5], KEA[6], KSpotter[3] etc

Requirements for a KPE for ULIB Automatic Identification of Keyphrases from chapters of books Automatic Identification of Keyphrases from chapters of books Language independent Language independent Easily adaptable for different domains Easily adaptable for different domains No training data to learn from No training data to learn from Most books in ULIB do not have keywords as part of the metadata Most books in ULIB do not have keywords as part of the metadata

Solution Outline Language Modeling based Language Modeling based Given n-grams Given n-grams Measure Informativeness, Phraseness Measure Informativeness, Phraseness Score n-grams based on the above measures Score n-grams based on the above measures Pick top K phrases as Keyphrases Pick top K phrases as Keyphrases

Extracting Keyphrases from Books Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning Extracted Keyphrases Text

Extracting Keyphrases from Books Topics are also used to construct user profiles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited

Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited {topics construct user, construct user profiles, user profiles explicit, profiles explicit specification, explicit specification interests, specification interests automatic, automatic analysis web, analysis web pages, web pages visited }

Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : explicit specication interests : specication interests automatic : user proles explicit : construct user proles : interests automatic analysis : topics construct user : automatic analysis web : web pages visited : analysis web pages :

Scoring Phraseness Phraseness Measures degree to which a given n-gram can be considered a phrase Measures degree to which a given n-gram can be considered a phrase Based on Co-occurrence of words Based on Co-occurrence of words Example.. Example.. Informativeness Informativeness Measures how informative a given n-gram is Measures how informative a given n-gram is There is a, a lot of etc There is a, a lot of etc Comparing co occurrence on a general corpus Vs given text(book) Comparing co occurrence on a general corpus Vs given text(book) Total Score Total Score Phraseness-Score + Informativeness-Score Phraseness-Score + Informativeness-Score

Scoring - Phraseness Computed by measuring distance between unigram model and N-gram model Computed by measuring distance between unigram model and N-gram model Point wise KL-divergence (Takashi et. al 2004) Point wise KL-divergence (Takashi et. al 2004) δ δ w (p||q) = p(w)log(p(w)/q(w)) Phraseness measure Phraseness measure δ δ w (LM fg N || LM fg 1 )

Scoring - Informativeness Computed by measuring distance between n-gram model from given data and n- gram model from general data Computed by measuring distance between n-gram model from given data and n- gram model from general data Point wise KL-divergence (Takashi et. al 2004) Point wise KL-divergence (Takashi et. al 2004) δ δ w (p||q) = p(w)log(p(w)/q(w)) Informativeness measure Informativeness measure δ δ w (LM fg 1 || LM bg 1 )

Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited profiles explicit specication : explicit specication interests : specication interests automatic : user proles explicit : construct user proles : interests automatic analysis : topics construct user : automatic analysis web : web pages visited : analysis web pages :

Extracting Keyphrases from Books Topics are also used to construct user proles via explicit specication of interests or automatic analysis of Web pages visited Extracted Keyphrases Cleaning & Initialization Candidate Keyphrases Extraction Scoring Pruning topics construct user profiles explicit specification interests automatic analysis web pages visited proles explicit specication explicit specication interests specication interests automatic user proles explicit construct user proles interests automatic analysis topics construct user automatic analysis web web pages visited analysis web pages

Conclusions and Future Work Discussed benefits of Keyphrases in ULIB context Discussed benefits of Keyphrases in ULIB context Demonstrated the building of a KPE that works for books Demonstrated the building of a KPE that works for books Robust evaluation Robust evaluation Building a test set from books in ULIB for generic robust evaluation of KPE tools Building a test set from books in ULIB for generic robust evaluation of KPE tools Are chapters really independent in a book Are chapters really independent in a book Revisit the assumption Revisit the assumption

Thank you

References Fred J. Damerau. Generating and evaluating domain-oriented multi-word terms from texts. Information Processing and Management, 29(4): , S.T Dumais, J Platt, D. Heckerman, and M. Sahami. Inductive learning algorithms and representations for text categorization. In Proceedings of the 7th international conference on information and knowledge management, page ACM Press, Min Song, Il-Yeol Song, and Xiaohua Hu. Kpspotter: a exible information gain-based keyphrase extraction system. In WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data management, pages 50-53, New York, NY, USA, ACM Press Takashi Tomokiyo and Mathew Hurst. A language modeling approach to keyphrase extraction. In Proceedings of the ACL 2003 workshop on Multiword expressions, pages 33{40, Morristown, NJ, USA, Association for Computational Linguistics P.D. Turney. Learning algorithms for keyphrase extraction. Information Retrieval, 2(4): , I.H. Witten, G.W. Paynter, E. Frank, C. Gutwin, and C.G Nevill-Manning. Kea: Practical automatic keyphrase extraction. In E. A. Fox and N. Rowe, editors, Proceedings of digital libraries 99: The fourth ACM conference on digital libraries, pages ACM Press, Mikio Yamamoto and Kenneth W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27(1):1-30, 2001