Evgeniy Gabrilovich and Shaul Markovitch

Slides:



Advertisements
Similar presentations
A Word at a Time Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky, Eugene Agichteiny, Evgeniy Gabrilovichz, Shaul Markovitch.
Advertisements

Chapter 5: Introduction to Information Retrieval
Improved TF-IDF Ranker
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Evaluating the Performance of IR Sytems
Chapter 19: Information Retrieval
Scalable Text Mining with Sparse Generative Models
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Random Graph Walk based Approach to Computing Semantic Relatedness Using Knowledge from Wikipedia Presenter: Ziqi Zhang OAK Research Group, Department.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
IR Systems and Web Search By Sri Harsha Tumuluri (UNI: st2653)
Combining Lexical Semantic Resources with Question & Answer Archives for Translation-Based Answer Finding Delphine Bernhard and Iryna Gurevvch Ubiquitous.
Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
 Copyright 2011 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Enabling Networked Knowledge.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors Mohamed Ali Hadj Taieb *, Mohamed Ben Aouicha, Abdelmajid Ben Hamadou KBS Computing.
The More the Better? Assessing the Influence of Wikipedia’s Growth on Semantic Relatedness Measures Torsten Zesch and Iryna Gurevych Ubiquitous Knowledge.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.
Instance-based mapping between thesauri and folksonomies Christian Wartena Rogier Brussee Telematica Instituut.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Learning Adjective Meanings with a Tensor-Based Skip- Gram Model Review by – Masare Akshay Sunil Jean Millard & Stephan Clark University of Cambridge.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
A Multilingual Hierarchy Mapping Method Based on GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of.
1 CS 8803 AIAD (Spring 2008) Project Group#22 Ajay Choudhari, Avik Sinharoy, Min Zhang, Mohit Jain Smart Seek.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Medical Semantic Similarity with a Neural Language Model Dongfang Xu School of Information Using Skip-gram Model for word embedding.
Automatic Writing Evaluation
Cross-lingual Dataless Classification for Many Languages
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Exploiting Wikipedia as External Knowledge for Document Clustering
Automatically Extending NE coverage of Arabic WordNet using Wikipedia
Cross-lingual Dataless Classification for Many Languages

Vector-Space (Distributional) Lexical Semantics
From frequency to meaning: vector space models of semantics
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis Evgeniy Gabrilovich and Shaul Markovitch Israel Institute of Technology IJCAI 2007

Introduction How related are “cat” and “mouse”? represents the meaning of texts in a high-dimensional space of concepts derived from Wikipedia Wikipedia is currently the largest knowledge repository on the Web English version is the largest of all with 400+ million words in over one million articles

Explicit Semantic Analysis T = {w1…wn} be input text <vi> be T’s TFIDF vector vi is the weight of word wi Wikipedia concept cj , {cj ∈ c1, . . . , cN} N = total number of Wikipedia concepts Let <kj> be an inverted index entry for word wi where kj quantifies the strength of association of word wi with Wikipedia concept cj

Explicit Semantic Analysis the semantic interpretation vector V for text T is a vector of length N, in which the weight of each concept cj is defined as To compute semantic relatedness of a pair of text fragments we compare their vectors using the cosine metric

Example 1 First ten concepts in sample interpretation vectors

First ten concepts in sample interpretation vectors

Example (texts with ambiguous words) First ten concepts in sample interpretation vectors

Empirical Evaluation Wikipedia parsing the Wikipedia XML dump, we obtained 2.9 Gb of text in 1,187,839 articles removing small and overly specific concepts (those having fewer than 100 words and fewer than 5 incoming or outgoing links), 241393 articles were left 389,202 distinct terms

Empirical Evaluation Open Directory Project hierarchy of over 400,000 concepts and 2,800,000 URLs. crawling all of its URLs, and taking the first 10 pages encountered at each site 70 Gb textual data. After removing stop words and rare words, we obtained 20,700,000 distinct terms

Datasets and Evaluation Procedure The WordSimilarity-353 collection contains 353 word pairs.3 Each pair has 13–16 human judgements Spearman rank-order correlation coefficient was used to compare computed relatedness scores with human judgements Spearman rank-order correlation (http://webclass.ncu.edu.tw/~tang0/Chap8/sas8.htm)

Datasets and Evaluation Procedure 50 documents from the Australian Broadcasting Corporation’s news mail service [Lee et al., 2005] These documents were paired in all possible ways, and each of the 1,225 pairs has 8–12 human judgements When human judgements have been averaged for each pair, the collection of 1,225 relatedness scores have only 67 distinct values. Spearman correlation is not appropriate in this case, and therefore we used Pearson’s linear correlation coefficient http://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Results word relatedness text relatedness

Related Work Lexical databases Different with ESA WordNet or Roget’s Thesaurus Different with ESA Inherently limited to individual words Need word sense disambiguation ESA provides much more sophisticated mapping of words (inWordNet amounts to simple lookup,without any weights)

Related Work LSA LSA is essentially a dimensionality reduction technique that identifies a number of most prominent dimensions in the data notoriously difficult to interpret Computed concepts cannot be readily mapped into natural concepts manipulated by humans

Related Work WikiRelate! [Strube and Ponzetto 2006] Different with ESA Given a pair of words w1 and w2, searches for Wikipedia articles, p1 and p2, that respectively contain w1 and w2 in their titles Semantic relatedness is then computed using various distance measures between p1 and p2. Different with ESA only process words that actually occur in titles limited to single words

Conclusions a novel approach to computing semantic relatedness of natural language texts Empirical evaluation confirms that using ESA leads to substantial improvements in computing word and text relatedness ESA model is easy to explain to human users