Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.

Slides:



Advertisements
Similar presentations
Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.
Advertisements

Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Encyclopaedic Annotation of Text.  Entity level difficulty  All the entities in a document may not be in reader’s knowledge space  Lexical difficulty.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Information Retrieval in Practice
Search Engines and Information Retrieval
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
1 Today  Tools (Yves)  Efficient Web Browsing on Hand Held Devices (Shrenik)  Web Page Summarization using Click- through Data (Kathy)  On the Summarization.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Overview of Search Engines
Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.
Web Information Retrieval Projects Ida Mele. Rules Students can work in teams (max 3 people) The project must be delivered by the deadline that will be.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Multilingual Synchronization focusing on Wikipedia
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Search Engines and Information Retrieval Chapter 1.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Presented by Tienwei Tsai July, 2005
Newsjunkie: Providing Personalized Newsfeeds via Analysis of Information Novelty Gabrilovich et.al WWW2004.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Feng Zhang, Guang Qiu, Jiajun Bu*, Mingcheng Qu, Chun Chen College of Computer Science, Zhejiang University Hangzhou, China Reporter: 洪紹祥 Adviser: 鄭淑真.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Giorgos Giannopoulos (IMIS/”Athena” R.C and NTU Athens, Greece) Theodore Dalamagas (IMIS/”Athena” R.C., Greece) Timos Sellis (IMIS/”Athena” R.C and NTU.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
29-30 October, 2006, Estonia 1 IST4Balt Information analysis using social bookmarking and other tools IST4Balt Information analysis using social bookmarking.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Learning to Link with Wikipedia David Milne and Ian H. Witten Department of Computer Science, University of Waikato CIKM 2008 (Best Paper Award) Presented.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors Mohamed Ali Hadj Taieb *, Mohamed Ben Aouicha, Abdelmajid Ben Hamadou KBS Computing.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
LOGO Finding High-Quality Content in Social Media Eugene Agichtein, Carlos Castillo, Debora Donato, Aristides Gionis and Gilad Mishne (WSDM 2008) Advisor.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Page 1 March 2011 Local and Global Algorithms for Disambiguation to Wikipedia Lev Ratinov 1, Dan Roth 1, Doug Downey 2, Mike Anderson 3 1 University of.
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Detecting Communities Via Simultaneous Clustering of Graphs and Folksonomies Akshay Java Anupam Joshi Tim Finin University of Maryland, Baltimore County.
Algorithmic Detection of Semantic Similarity WWW 2005.
Mining real world data Web data. World Wide Web Hypertext documents –Text –Links Web –billions of documents –authored by millions of diverse people –edited.
PCI th Panhellenic Conference in Informatics Clustering Documents using the 3-Gram Graph Representation Model 3 / 10 / 2014.
Evgeniy Gabrilovich and Shaul Markovitch
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.
Leveraging Knowledge Bases for Contextual Entity Exploration Categories Date:2015/09/17 Author:Joonseok Lee, Ariel Fuxman, Bo Zhao, Yuanhua Lv Source:KDD'15.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
1 Discovering Web Communities in the Blogspace Ying Zhou, Joseph Davis (HICSS 2007)
GUILLOU Frederic. Outline Introduction Motivations The basic recommendation system First phase : semantic similarities Second phase : communities Application.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
GRAPH BASED MULTI-DOCUMENT SUMMARIZATION Canan BATUR
Exploiting Wikipedia as External Knowledge for Document Clustering
GLOW- Global and Local Algorithms for Disambiguation to Wikipedia
Summarizing Entities: A Survey Report
Lecture 24: NER & Entity Linking
Entity Linking Survey
Presentation transcript:

Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS

Outline 1.Key terms extraction: traditional approaches and applications 2.Using Wikipedia as a knowledge base for Natural Language Processing 3.Main techniques of our approach: Wikipedia-based semantic relatedness Network analysis algorithm to detect community structure in networks 4.Our method 5.Experimental evaluation

Key Terms Extraction Basic step for various NLP tasks: –document classification –document clustering –text summarization –inferring a more general topic of a text document Core task of Internet content-based advertising systems, such as Google AdSense and Yahoo! Contextual Match –Web pages are typically noisy (side bars/menus, comments, future announces, etc.) –Dealing with multi-theme Web pages (portal home pages, etc.)

Approaches to Key Terms Extraction Based on statistical learning: –use for example: frequency criterion (TFxIDF model), keyphrase-frequency, distance between terms normalized by the number of words in the document (KEA) –compute statistical features over Wikipedia corpus (Wikify! ) –require training set Based on analyzing syntactic or semantic term relatedness within a document –compute semantic relatedness between terms (using, for example, Wikipedia) –modeling document as a semantic graph of terms and applying graph analysis techniques to it (TextRank) –no training set required

Using Wikipedia as a Knowledge Base for Natural Language Processing Wikipedia ( – free open encyclopedia –Today Wikipedia is the biggest encyclopedia (more than 2.7 million articles in English Wikipedia) –It is always up-to-date thanks to millions of editors over the world –Has huge network of cross-references between articles, large number of categories, redirect pages, disambiguation pages => rich resource for bootstrapping NLP and IR tasks

Basic Techniques of Our Method: Semantic Relatedness of Terms Semantic relatedness assigns a score for a pair of terms that represents the strength of relatedness between the terms We use Wikipedia compute terms semantic relatedness We use semantic relatedness to model document as a graph of terms

Wikipedia-based semantic relatedness for the two terms can be computed using: –the links found within their corresponding Wikipedia articles –Wikipedia categories structure –the article’s textual content Using Dice-measure for Wikipedia-based semantic relatedness Basic Techniques of Our Method: Semantic Relatedness of Terms

Basic Techniques of Our Method: Detecting Community Structure in Networks We discover terms communities in a document graph Community – densely interconnected group of nodes in a network Girvan-Newman algorithm for detection community structure in networks: betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them

Our Method 1.Candidate terms extraction 2.Word sense disambiguation 3.Building semantic graph 4.Discovering community structure of the semantic graph 5.Selecting valuable communities

Our Method: Candidate Terms Extraction Goal: extract all terms from the document and for each term prepare a set of Wikipedia articles that can describe its meaning Parse the input document and extract all possible n- grams For each n-gram (+ its morphological variations) provide a set of Wikipedia article titles –“drinks”, “drinking”, “drink” => [Wikipedia:] Drink; Drinking

Our Method: Word Sense Disambiguation Goal: choose the most appropriate Wikipedia article from the set of candidate articles for each ambiguous term extracted on the previous step Use of Wikipedia disambiguation and redirect pages to obtain candidate meanings of ambiguous terms Denis Turdakov, Pavel Velikhov “Semantic Relatedness Metric for Wikipedia Concepts Based on Link Analysis and its Application to Word Sense Disambiguation” SYRCoDIS, 2008

Our Method: Building Semantic Graph Goal: building document semantic graph using semantic relatedness between terms Semantic graph built from a news article "Apple to Make ITunes More Accessible For the Blind"

Our Method: Detecting Community Structure of the Semantic Graph

Our Method: Selecting Valuable Communities Goal: rank term communities in a way that: –the highest ranked communities contain key terms –the lowest ranked communities contain not important terms, and possible disambiguation mistakes Use: –density of community – sum of inner edges of community divided by the number of vertices in this community –informativeness – sum of keyphraseness measure (Wikipedia-based TFxIDF analogue) of community terms Community rank: density*informativeness

Our Method: Selecting Valuable Communities In 73% of web pages decline in communities scores separates key-terms communities from non-important ones

Advantages of the Method No training. Instead of training the system with hand- created examples, we use semantic information derived from Wikipedia Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages Thematically grouped key terms. Significantly improve further inferring of document topics using, for example, spreading activation over Wikipedia categories graph High accuracy. Evaluated using human judgments (further in this presentation)

Experimental Evaluation on Noise-free dataset Classical – TFxIDF, Yahoo! Terms Extractor Wikipedia-based – Wikify!, TextRank Evaluation on noise-free dataset (blog posts) using human judgment

Comparison to other methods Experimental Evaluation on Web Pages Performance of our method on different kinds of Web pages

Multi-theme stability evaluated on compound Web pages (popular news site, portal homepages, etc.) Experimental Evaluation on Web Pages

Thank You! Any Questions?