Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)

Slides:



Advertisements
Similar presentations
eClassifier: Tool for Taxonomies
Advertisements

Chapter 5: Introduction to Information Retrieval
Kai-Wei Chang Joint work with Scott Wen-tau Yih, Chris Meek Microsoft Research.
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
Scott Wen-tau Yih Joint work with Ming-Wei Chang, Chris Meek, Andrzej Pastusiak Microsoft Research The 51st Annual Meeting of the Association for Computational.
Creating a Similarity Graph from WordNet
Jean-Eudes Ranvier 17/05/2015Planet Data - Madrid Trustworthiness assessment (on web pages) Task 3.3.
LEDIR : An Unsupervised Algorithm for Learning Directionality of Inference Rules Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: From EMNLP.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
CS347 Review Slides (IR Part II) June 6, 2001 ©Prabhakar Raghavan.
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Scott Wen-tau Yih Joint work with Kristina Toutanova, John Platt, Chris Meek Microsoft Research.
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures Written by Alexander Budanitsky Graeme Hirst Retold by.
J. Chen, O. R. Zaiane and R. Goebel An Unsupervised Approach to Cluster Web Search Results based on Word Sense Communities.
The Vector Space Model …and applications in Information Retrieval.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Distributed Representations of Sentences and Documents
Chapter 5: Information Retrieval and Web Search
Tag-based Social Interest Discovery
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Learn to Comment Lance Lebanoff Mentor: Mahdi. Emotion classification of text  In our neural network, one feature is the emotion detected in the image.
1 Query Operations Relevance Feedback & Query Expansion.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!
SYMPOSIUM ON SEMANTICS IN SYSTEMS FOR TEXT PROCESSING September 22-24, Venice, Italy Combining Knowledge-based Methods and Supervised Learning for.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
A New Suffix Tree Similarity Measure for Document Clustering
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Pseudo-supervised Clustering for Text Documents Marco Maggini, Leonardo Rigutini, Marco Turchi Dipartimento di Ingegneria dell’Informazione Università.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Algorithmic Detection of Semantic Similarity WWW 2005.
Evgeniy Gabrilovich and Shaul Markovitch
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Vector Space Models.
NEW EVENT DETECTION AND TOPIC TRACKING STEPS. PREPROCESSING Removal of check-ins and other redundant data Removal of URL’s maybe Stemming of words using.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
A Knowledge-Based Search Engine Powered by Wikipedia David Milne, Ian H. Witten, David M. Nichols (CIKM 2007)
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Tag-based Social Interest Discovery By yjhuang Yahoo! Inc Searcher Xin Li, Lei Guo, Yihong(Eric) Zhao 此投影片所有權為該著作者所有,在此僅作講解使用。將於最後附上出處.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
1 A Fuzzy Logic Framework for Web Page Filtering Authors : Vrettos, S. and Stafylopatis, A. Source : Neural Network Applications in Electrical Engineering,
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Multilingual Information Retrieval using GHSOM Hsin-Chang Yang Associate Professor Department of Information Management National University of Kaohsiung.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
2016/3/11 Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Xia Hu, Nan Sun, Chao Zhang, Tat-Seng Chu.
Semantic search-based image annotation Petra Budíková, FI MU CEMI meeting, Plzeň,
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
From Frequency to Meaning: Vector Space Models of Semantics
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Vector-Space (Distributional) Lexical Semantics
Mining Query Subtopics from Search Log Data
Presentation transcript:

Scott Wen-tau Yih (Microsoft Research) Joint work with Vahed Qazvinian (University of Michigan)

How related are words “movie” and “popcorn”?

Semantic relatedness covers many word relations, not just similarity [Budanitsky & Hirst 06] Synonymy ( noon vs. midday ) Antonymy ( hot vs. cold ) Hypernymy/Hyponymy (Is-A) ( wine vs. gin ) Meronymy (Part-Of) ( finger vs. hand ) Functional relation ( pencil vs. paper ) Other frequent association ( drug vs. abuse ) Applications Text classification, paraphrase detection/generation, textual entailment, …

The physics professor designed his lectures to avoid ____ the material: his goal was to clarify difficult topics, not make them confusing. (a) theorizing (b) elucidating(c) obfuscating (d) delineating(e) accosting

 The answer word should be semantically related to some keywords in the sentence.

Distributional Hypothesis (Harris 54) Words appearing in the same context tend to have similar meaning Basic vector space model (Pereira 93; Lin & Pantel 02) For each target word, create a term vector using the neighboring words in a corpus The semantic relatedness of two words is measured by the cosine score of the corresponding vectors 

Representing a multi-sense word (e.g., jaguar ) with one vector could be problematic Violating triangle inequality Multi-prototype VSMs (Reisinger & Mooney 10) Sense-specific vectors for each word Discovering senses by clustering contexts Two potential issues in practice Quality depends heavily on the clustering algorithm The corpus may not have enough coverage

Novel Insight Vectors from different information sources bias differently Jaguar: Wikipedia (cat), Bing (car) Heterogeneous vector space models provide complementary coverage of word sense and meaning Solution Construct VSMs using general corpus (Wikipedia), Web (Bing) and thesaurus (Encarta & WordNet) Word relatedness measure: Average cosine score Strong empirical results Outperform existing methods on 2 benchmark datasets

Introduction Construct heterogeneous vector space models Corpus – Wikipedia Web – Bing search snippets Thesaurus – Encarta & WordNet Experimental evaluation Task & datasets Results Conclusion

Construction Collect terms within a window of [-10,+10] centered at each occurrence of a target word Create TFIDF term-vector Refinement Vocabulary Trimming (removing stop-words) Top 1500 high DF terms are removed from vocabulary Term Trimming (local feature selection) Top 200 high-weighted terms for each term-vector Data Wikipedia (Nov. 2010) – 917M words

Construction Issue each target word as a query to Bing Collect terms in the top 30 snippets Create TFIDF term-vector Vocabulary trimming: top 1000 high DF terms are removed No term trimming Compared to corpus-based VSM Reflects user preference May bias different word sense and meaning

Construction Create a TFIDF “document”-term matrix Each “document” is a group of synonyms (synset) Each word is represented by the corresponding column vector – the synsets it belongs to Data WordNet – 227,446 synsets, 190,052 words Encarta thesaurus – 46,945 synsets, 50,184 words

Introduction Construct heterogeneous vector space models Corpus – Wikipedia Web – Bing search snippets Thesaurus – Encarta & WordNet Experimental evaluation Task & datasets Results Conclusion

Word 1Word 2Human Score (mean) middaynoon9.3 tigerjaguar8.0 cupfood5.0 forestgraveyard1.9 ……… Data: list of word pairs with human judgment

Assessed on a 0-10 scale by human judges

Assessed on a 1-5 scale by 10 Turkers

Combining heterogeneous VSMs for measuring word relatedness Better coverage on word sense and meaning A simple and yet effective strategy Future Work Other combination strategy or model Extending to longer text segments (e.g., phrases) More fine-grained word relations Polarity Inducing LSA for Synonymy and Antonymy (Yih, Zweig & Platt, EMNLP-2012)