NLP.

Slides:



Advertisements
Similar presentations
Multi-Document Person Name Resolution Michael Ben Fleischman (MIT), Eduard Hovy (USC) From Proceedings of ACL-42 Reference Resolution workshop 2004.
Advertisements

Chapter 5: Introduction to Information Retrieval
TEXT SIMILARITY David Kauchak CS159 Spring Quiz #2  Out of 30 points  High:  Ave: 23  Will drop lowest quiz  I do not grade based on.
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
The Google Similarity Distance  We’ve been talking about Natural Language parsing  Understanding the meaning in a sentence requires knowing relationships.
Text Similarity David Kauchak CS457 Fall 2011.
SI485i : NLP Set 11 Distributional Similarity slides adapted from Dan Jurafsky and Bill MacCartney.
Measures of Text Similarity
Bag-of-Words Methods for Text Mining CSCI-GA.2590 – Lecture 2A
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.
Learning for Text Categorization
Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 28, 2011.
Word Sense Disambiguation Ling571 Deep Processing Techniques for NLP February 23, 2011.
Semantic distance in WordNet: An experimental, application-oriented evaluation of five measures Presenter: Cosmin Adrian Bejan Alexander Budanitsky and.
June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.
Vector Space Model CS 652 Information Extraction and Integration.
SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2004
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Using Information Content to Evaluate Semantic Similarity in a Taxonomy Presenter: Cosmin Adrian Bejan Philip Resnik Sun Microsystems Laboratories.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
APPLYING INFORMATION RETRIEVAL TO TEXT MINING Data mining Lab 이아람.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Lexical Semantics CSCI-GA.2590 – Lecture 7A
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Personalisation Seminar on Unlocking the Secrets of the Past: Text Mining for Historical Documents Sven Steudter.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
Lecture 22 Word Similarity Topics word similarity Thesaurus based word similarity Intro. Distributional based word similarityReadings: NLTK book Chapter.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Unsupervised Word Sense Disambiguation REU, Summer, 2009.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Vector Space Models.
August 17, 2005Question Answering Passage Retrieval Using Dependency Parsing 1/28 Question Answering Passage Retrieval Using Dependency Parsing Hang Cui.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Comparing Word Relatedness Measures Based on Google n-grams Aminul ISLAM, Evangelos MILIOS, Vlado KEŠELJ Faculty of Computer Science Dalhousie University,
Lecture 24 Distributional Word Similarity II Topics Distributional based word similarity example PMI context = syntactic dependenciesReadings: NLTK book.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Semantic Grounding of Tag Relatedness in Social Bookmarking Systems Ciro Cattuto, Dominik Benz, Andreas Hotho, Gerd Stumme ISWC 2008 Hyewon Lim January.
Lecture 24 Distributiona l based Similarity II Topics Distributional based word similarityReadings: NLTK book Chapter 2 (wordnet) Text Chapter 20 April.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
Detecting Remote Evolutionary Relationships among Proteins by Large-Scale Semantic Embedding Xu Linhe 14S
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
IR 6 Scoring, term weighting and the vector space model.
From Frequency to Meaning: Vector Space Models of Semantics
Plan for Today’s Lecture(s)
NLP.
Korean version of GloVe Applying GloVe & word2vec model to Korean corpus speaker : 양희정 date :
Semantic Processing with Context Analysis
Lecture 24 Distributional Word Similarity II
Robust Semantics, Information Extraction, and Information Retrieval
CSC 594 Topics in AI – Natural Language Processing
Vector-Space (Distributional) Lexical Semantics
Section 7.12: Similarity By: Ralucca Gera, NPS.
From frequency to meaning: vector space models of semantics
Lecture 22 Word Similarity
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Statistical NLP: Lecture 10
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

NLP

Thesaurus-based Word Similarity Methods Text Similarity Thesaurus-based Word Similarity Methods

Quiz Which pair of words exhibits the greatest similarity? 1. Deer-elk 2. Deer-horse 3. Deer-mouse 4. Deer-roof

Quiz Answer Which pair of words exhibits the greatest similarity? Why? 1. Deer-elk 2. Deer-horse 3. Deer-mouse 4. Deer-roof Why? Remember the Wordnet tree:

Remember Wordnet ungulate even-toed ungulate odd-toed ungulate ruminant equine okapi deer giraffe mule horse zebra elk wapiti caribou pony

Path Similarity Version 1 Version 2 Sim (v,w) = - pathlength (v,w) Sim (v,w) = - log pathlength (v,w) J&M - remove

Problems with this Approach There may be no tree for the specific domain or language A specific word (e.g., a term or a proper noun) may not be in any tree IS-A (hypernym) edges are not all equally apart in similarity space

Path similarity between two words Version 3 (Philip Resnik) Sim (v,w) = - log P(LCS(v,w)) where LCS = lowest common subsumer, e.g. ungulate for deer and horse deer for deer and elk J&M - remove

Information content Version 4 (Dekang Lin) Wordnet augmented with probabilities (Lin 1998) IC(c) = -log P(c) Sim (v,w) = 2 x log P(LCS(v,w)) / (log P(v) + log P(w)) J&M - remove = 0.59

Wordnet Similarity Software WordNet::Similarity (Perl) http://www.d.umn.edu/~tpederse/similarity.html NLTK (Python) http://www.nltk.org >>> dog.lin_similarity(cat, brown_ic) 0.879 >>> dog.lin_similarity(elephant, brown_ic) 0.531 >>> dog.lin_similarity(elk, brown_ic) 0.475 MOVE TO TEXT SIMILARITY

NLP

Text Similarity The Vector Space Model

The Vector Space Model t1 d1 d2 t3 t2 d3

Document Similarity Used in information retrieval to determine which document (d1 or d2) is more similar to a given query q. Note that documents and queries are represented in the same space. Often, the angle between two vectors (or, rather, the cosine of that angle) is used as a proxy for the similarity of the underlying documents.

Cosine Similarity 𝜎 𝐷,𝑄 = 𝐷∩𝑄 𝐷 𝑄 = 𝑑 𝑖 𝑞 𝑖 𝑑 𝑖 2 𝑞 𝑖 2 The Cosine measure is computed as the normalized dot product of two vectors: A variant of Cosine is the Jaccard coefficient: 𝜎 𝐷,𝑄 = 𝐷∩𝑄 𝐷 𝑄 = 𝑑 𝑖 𝑞 𝑖 𝑑 𝑖 2 𝑞 𝑖 2 𝜎 𝐷,𝑄 = 𝐷∩𝑄 𝐷∪𝑄

Example What is the cosine similarity between: Answer: In comparison: D= “cat,dog,dog” = <1,2,0> Q= “cat,dog,mouse,mouse” = <1,1,2> Answer: In comparison: 𝜎 𝐷,𝑄 = 1×1+2×1+0×2 1 2 + 2 2 + 0 2 1 2 + 1 2 + 2 2 = 3 5 6 ≈0.55 𝜎 𝐷,𝐷 = 1×1+2×2+0×0 1 2 + 2 2 + 0 2 1 2 + 2 2 + 0 2 = 5 5 5 =1

Quiz Given the three documents Compute the cosine scores σ(D1,D2) σ(D1,D3) What do the numbers tell you?

Answers to the Quiz σ(D1,D2) = 1 σ(D1,D3) = 0.6 one of the two documents is a scaled version of the other σ(D1,D3) = 0.6 swapping the two dimensions results in a lower similarity

Quiz What is the range of values that the cosine score can take?

Answer to the Quiz In general, the cosine function has a range of [-1,1] However, when the two vectors are both in the first quadrant (since all word counts are non-negative), the range is [0,1].

The Vector Space Model Applied to Word Similarity Text Similarity The Vector Space Model Applied to Word Similarity

Distributional similarity Two words that appear in similar contexts are likely to be semantically related, e.g., schedule a test drive and investigate Honda's financing options Volkswagen debuted a new version of its front-wheel-drive Golf the Jeep reminded me of a recent drive Our test drive took place at the wheel of loaded Ford EL model “You will know a word by the company that it keeps.” (J.R. Firth 1957)

Distributional Similarity The context can be any of the following: The word before the target word The word after the target word Any word within n words of the target word Any word within a specific syntactic relationship with the target word (e.g., the head of the dependency or the subject of the sentence) Any word within the same sentence Any word within the same document

Association Strength Frequency matters: we want to ignore spurious word pairings. However, frequency alone is not sufficient. A common technique is to use pointwise mutual information (PMI). Here w is a word and c is a feature from the context PMI 𝑤,𝑐 = log 𝑃(𝑤,𝑐) 𝑃 𝑤 𝑃(𝑐) COMPLETE

NLP