From Frequency to Meaning: Vector Space Models of Semantics

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Comparison of information retrieval techniques: Latent semantic indexing (LSI) and Concept indexing (CI) Jasminka Dobša Faculty of organization and informatics,
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Hinrich Schütze and Christina Lioma
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Intro to NLP - J. Eisner1 Words vs. Terms Taken from Jason Eisner’s NLP class slides:
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Information Retrieval in Practice
Chapter 2 Dimensionality Reduction. Linear Methods
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Text mining.
Row 1 Row 2 Row 3 Row m Column 1Column 2Column 3 Column 4.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Learn to Comment Lance Lebanoff Mentor: Mahdi. Emotion classification of text  In our neural network, one feature is the emotion detected in the image.
On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Expressing Implicit Semantic Relations without Supervision ACL 2006.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
SINGULAR VALUE DECOMPOSITION (SVD)
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
Vector Space Models.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
IR 6 Scoring, term weighting and the vector space model.
Plan for Today’s Lecture(s)
Information Retrieval: Models and Methods
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Deep learning David Kauchak CS158 – Fall 2016.
Semantic Processing with Context Analysis
Information Retrieval: Models and Methods
Information Retrieval and Web Search
Vector Semantics Introduction.
Word Meaning and Similarity
Information Retrieval and Web Search
CSC 594 Topics in AI – Natural Language Processing
Vector-Space (Distributional) Lexical Semantics
Information Organization: Clustering
Representation of documents and queries
Design open relay based DNS blacklist system
HCC class lecture 13 comments
From frequency to meaning: vector space models of semantics
Large scale multilingual and multimodal integration
Lecture 13: Singular Value Decomposition (SVD)
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
Recommendation Systems
Math review - scalars, vectors, and matrices
Restructuring Sparse High Dimensional Data for Effective Retrieval
Presentation transcript:

From Frequency to Meaning: Vector Space Models of Semantics Peter D. Turney, Patrick Pantel

VSM Applications Originally developed for SMART information retrieval system Used VSMs to find documents related to a search VSM techniques used in modern search engines More recently being applied to multiple choice question answering (TOEFL, SAT)

Why VSMs? Extract knowledge automatically from corpora, so don't require hand coded knowledge or ontologies Word similarity tasks typically require large lexicons (like WordNet), which are much harder to come by the large corpora VSMs are successful in measuring similarity of meaning between words, phrases, and documents

Types of VSMs Types of matrices: Term-document Word-context Pair-pattern

Term-Document VSM Row Vectors : Terms Column Vectors : Documents

Term-Document VSM Relies upon the bag of words hypothesis: -The frequency of words in a document tend to indicate the relevance of the document to a query. The column vector in a term-document matrix tends to indicate what the document is about.

Word-Context VSM Rows : Words Columns : Context Essentially the same as term-documents, but focused on the row vectors

Word-Context VSM Relies upon the distributional hypothesis. -Words that occur in similar contexts tend to have similar meanings. Row vectors in a word-context matrix indicate similar word meanings.

Pair-Pattern VSM Rows : pairs of words (mason:stone, carpenter:wood) Columns : patterns (“X cuts Y”, “X works with Y”) X cuts Y | X works with Y | X breaks Y mason:stone 3 2 3 carpenter:wood 2 2 3 mason:wood 0 0 0

Pair-Pattern VSM Similarity of column in a pair-pattern matrix indicates similarity of the patterns. Extended distributional hypothesis : Patterns that co-occur with similar pairs tend to have similar meanings. The pattern “X solves Y” and “Y is solved by X” tend to co-occur with the same pairs, which indicates that they have similar meanings.

Pair-Pattern VSM Similarity of rows in a pair-pattern matrix indicate similarity of word pairs. Latent relation hypothesis: Pairs of words that co-occur in similar patterns tend to have similar semantic relations. Inverse of previous hypothesis.

Linguistic Processing Tokenization Normalization Annotation

Mathematical Processing Generate frequencies from corpora Adjust weights of elements in matrix Smoothing Measure similarity of vectors

Building Frequency Matrix Scan through corpus, recording events and frequencies into a hash table/database. Use resulting data structure to build a frequency matrix.

Weighting Elements Give more weight to surprising events and less weight to expected events Surprising events are more discriminative of semantic similarity In the context of 'rat' and 'mouse', 'dissect' and 'exterminate' are much more discriminative than 'have' and 'like'

Weighting Elements Term frequency * inverse document frequency Element gets a high weight if the corresponding term occurs frequently in the corresponding document, but rare in other documents. Pointwise Mutual Information (PMI) Length normalization Long documents tend to be favored, which is correct by length normalization Term weighting Terms like 'hostage' and 'hostages' tend to be correlated but not normalized to the same term, so their weights are reduced when the co-occur in a document.

Smoothing Improve performance by limiting the number of vector components Vectors that do not share non-zero coordinates are unrelated, and can be ignored. Smoothing algorithms remove elements below certain weights to zero and keeps elements that are more relevant.

Singular Value Decomposition Decomposes the matrix into three matrices (U,Ʃ,V), where two are in column orthonormal form and the third is a diagonal matrix of singular values. Ʃ : Diagonal matrix of top k values. U, V : Corresponding columns

Comparing the Vectors Most popular way : take the cosine Vectors of frequent words tend to be long, while rare word vectors tend to be short. Cosine ensures vector length is irrelevant to their similarity, as it measures only the angle between them.

Efficient Comparisons Sparse Matrix Multiplication -Throw out vectors that don't share non-zero coordinates Distributed Implementation using MapReduce Randomized Algorithms -Scale large vectors to small vectors while losing minimal information

Implementations Term-Document Matrix : Lucene Content – Webpages, PDF documents, images, video, etc Fields – Developer defined parts of these Content elements. Columns : Documents Fields : Rows

Implementations Word-Context Matrix Semantic Vectors Open-source project to implement VSMs and random projection to measure word similarity. Uses Lucene to create a term-document matrix, and uses random projection to reduce dimensions.

Implementations Pair-Pattern Matrix Latent Relational Analysis Builds pair-pattern matrix using a textual corpus as input. Uses WordNet to mitigate spareness: <Korea, Japan> expanded to <South Korea, Japan>, <Republic of Korea, Japan>, <Korea, Nippon>, etc

Applications Term-Document: Document Retrieval Document Clustering Document Classification Essay grading Document segmentation QA Call Routing

Applications Word-Context: Word similarity Word clustering Word classification Automatic thesaurus generation WSD Context-sensitive spelling correction SRL Query expansion Textual advertising Information Extraction/NER

Applications Pair-pattern: Relational similarity Pattern similarity Relational clustering Relational classification Relational search Automatic thesaurus generation Analogical mapping

The Future VSMs typically don't account for word order (pair- pattern matrices do). Some people (Clark and Pulman 2007, Widdows and Ferraro 2008) are working on handling word order in VSMs.