Katrin Erk Distributional models. Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Text Databases Text Types
2 Information Retrieval System IR System Query String Document corpus Ranked Documents 1. Doc1 2. Doc2 3. Doc3.
Dimensionality Reduction PCA -- SVD
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
Text Similarity David Kauchak CS457 Fall 2011.
INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING Crista Lopes.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Latent Semantic Analysis
Hinrich Schütze and Christina Lioma
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
ISP 433/633 Week 10 Vocabulary Problem & Latent Semantic Indexing Partly based on G.Furnas SI503 slides.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
CSM06 Information Retrieval Lecture 3: Text IR part 2 Dr Andrew Salway
Vector Space Information Retrieval Using Concept Projection Presented by Zhiguo Li
Indexing by Latent Semantic Analysis Written by Deerwester, Dumais, Furnas, Landauer, and Harshman (1990) Reviewed by Cinthia Levy.
Hinrich Schütze and Christina Lioma
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
Minimum Spanning Trees Displaying Semantic Similarity Włodzisław Duch & Paweł Matykiewicz Department of Informatics, UMK Toruń School of Computer Engineering,
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 18: Latent Semantic Indexing 1.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Dimension of Meaning Author: Hinrich Schutze Presenter: Marian Olteanu.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Documents as vectors Each doc j can be viewed as a vector of tf.idf values, one component for each term So we have a vector space terms are axes docs live.
Web search basics (Recap) The Web Web crawler Indexer Search User Indexes Query Engine 1 Ad indexes.
Latent Semantic Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Automated Essay Grading Resources: Introduction to Information Retrieval, Manning, Raghavan, Schutze (Chapter 06 and 18) Automated Essay Scoring with e-rater.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
INF 141 COURSE SUMMARY Crista Lopes. Lecture Objective Know what you know.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
CpSc 881: Information Retrieval. 2 Recall: Term-document matrix This matrix is the basis for computing the similarity between documents and queries. Today:
Term Frequency. Term frequency Two factors: – A term that appears just once in a document is probably not as significant as a term that appears a number.
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Progress Report (Concept Extraction) Presented by: Mohsen Kamyar.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Techniques for Collaboration in Text Filtering 1 Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Outline Problem Background Theory Extending to NLP and Experiment
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Link Distribution on Wikipedia [0407]KwangHee Park.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Natural Language Processing Topics in Information Retrieval August, 2002.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Vector Semantics Dense Vectors.
SIMS 202, Marti Hearst Content Analysis Prof. Marti Hearst SIMS 202, Lecture 15.
From Frequency to Meaning: Vector Space Models of Semantics
The Vector Space Models (VSM)
Plan for Today’s Lecture(s)
Robust Semantics, Information Extraction, and Information Retrieval
Distributional models
Vector-Space (Distributional) Lexical Semantics
Representation of documents and queries
From frequency to meaning: vector space models of semantics
Word Embedding Word2Vec.
CS 430: Information Discovery
Term Frequency–Inverse Document Frequency
Latent Semantic Analysis
Presentation transcript:

Katrin Erk Distributional models

Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild Doc 3: applications documents engines information iterated library metadata precision query statistical web Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you

Representing meaning through collections of words Doc 1: Abdullah boycotting challenger commission dangerous election interest Karzai necessary runoff Sunday Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild Doc 3: applications documents engines information iterated library metadata precision query statistical web Washington Post Oct 24, 2009 on elections in Afghanistan Wikipedia (version Oct 24, 2009) on the movie “Where the Wild Things Are” Wikipedia (version Oct 24, 2009) on Information Retrieval Doc 4: cucumbers crops discoveries enjoyable fill garden leafless plotting Vermont you garden.org: Planning a Vegetable Garden

Representing meaning through a collection of words What parts of the meaning of a document can you capture through an unordered collection of words? How can you make use of such collections?

Representing meaning through a collection of words What parts of the meaning of a document can you capture through an unordered collection of words? General topic information: What is the document about? More specifically: things mentioned in the document How can you make use of such collections? Documents on similar topics contain similar words Use in Information Retrieval (search)

Representing collections of words through tables of counts Doc 2: animatronics are children’s igloo intimidating Jonze kingdom smashing Wild filmwildmaxthatthingsas[edit]himjonzereleased

Representing collections of words through tables of counts We can now compare documents by comparing tables of counts. What can you tell about the second document below? filmwildmaxthatthingsas[edit]himjonzereleased filmwildmaxthatthingsas[edit]himjonzereleased

The “second document”: a more extensive list of words the 167 and 58 of 58 to 56 a 49 in 37 as 36 is 33 victor 30 * 27 with 26 by 23 her 18 film 17 for 16 emily 15 was 15 corpse 14 bride 13 victoria 13 his 13 on 13 from 11 What movie is this?

From tables to vectors Interpret table as a vector: Each entry is a dimension: “film” is a dimension. Document’s coordinate: 24 “wild” is a dimensions. Document’s coordinate: 18 … Then this document is a point in 10-dimensional space filmwildmaxthatthingsas[edit]himjonzereleased

Documents as points in vector space Viewing “Wild Things” and “Corpse Bride” as vectors/points in vector space: Similarity between them as proximity in space Corpse Bride Where the Wild Things Are “Distributional model”, “vector space model”, “semantic space model” used interchangeably here

What have we gained? Representation of document in vector space can be computed completely automatically: Just counts words Similarity in vector space is a good predictor for similarity in topic Documents that contain similar words tend to be about similar things

What do we mean by “similarity” of vectors? Euclidean distance (a dissimilarity measure!): Corpse Bride Where the Wild Things Are

What do we mean by “similarity” of vectors? Cosine similarity: Corpse Bride Where the Wild Things Are

What have we gained? We can compute the similarity of documents through their Euclidean distance or through their cosine We can also represent a query as a vector: Just count the words in the query Now we can search for documents similar to the query

From documents to words Same holds for words as for documents: Context words are a good indicator of meaning Similar words tend to occur in similar contexts What is a context? How do we count here? Take all the occurrences of our target word in a large text Take a context window, e.g. 10 words either side Count all that occurs there

Representing the meaning of a word through a collection of context words Emerging from the earth is Emily, the "Corpse Bride," a beautiful undead girl in a moldy bridal gown who declares Victor her husband. athecorpseemergingfrom isundeadbeautifulmoldybride inearthgirl 111 Counts for target “Emily”, 10 words context either side.

Representing the meaning of a word through a collection of context words Go through all occurrences of “Emily” in a large corpus Count words in 10-word window for each occurrence, sum up athecorpseemergingfrom isundeadbeautifulmoldybride inearthgirl 111

Some co-occurrences: “letter” in “Pride and Prejudice” jane : 12 when : 14 by : 15 which : 16 him : 16 with : 16 elizabeth : 17 but : 17 he : 17 be : 18 s : 20 on : 20 was : 34 it : 35 his : 36 she : 41 her : 50 a : 52 and : 56 of : 72 to : 75 the : 102 not : 21 for : 21 mr : 22 this : 23 as : 23 you : 25 from : 28 i : 28 had : 32 that : 33 in : 34 This is not a large text! Large = something like 100 million words at least

From tables to vectors Interpret table as a vector: Each entry is a dimension: “admirer” is a dimension. Coordinate of “letter”: 1. Coordinate of “surprise”: 0 “all” is a dimensions. Coordinate of “letter”: 8. Coordinate of “surprise: 7 … Then each word is a point in n-dimensional space Counts for “letter” and “surprise” from Pride and Prejudice

What have we gained? Representation of word in vector space can be computed completely automatically: Just counts co-occurring words in all context Similarity in vector space is a good predictor for meaning similarity Words that occur in similar contexts tend to be similar in meaning Synonyms are close together in vector space Antonyms too

Parameters of vector space models W. Lowe (2001): “Towards a theory of semantic space” A semantic space defined as a tuple (A, B, S, M) B: base elements. A: mapping from raw co-occurrence counts to something else, to correct for frequency effects S: similarity measure. M: transformation of the whole space to different dimensions

B: base elements We have seen: context words as base elements Term x document matrix: Represent document as vector of weighted terms Represent term as vector of weighted documents

B: base elements Dimensions: not words in a context window, but dependency paths starting from the target word (Pado & Lapata 07)

A: transforming raw counts Problem with vectors of raw counts: Distortion through frequency of target word Weigh counts: The count on dimension “and” will not be as informative as that on the dimension “angry” For example, using Pointwise Mutual Information between target a and context word b

M: transforming the whole space Dimensionality reduction: Principal Component Analysis (PCA) Singular Value Decomposition (SVD) Latent Semantic Analysis, LSA (also called Latent Semantic Indexing, LSI): Do SVD on term x document representation to induce “latent” dimensions that correspond to topics that a document can be about Landauer & Dumais 1997

Using similarity in vector spaces Search/information retrieval: Given query and document collection, Use term x document representation: Each document is a vector of weighted terms Also represent query as vector of weighted terms Retrieve the documents that are most similar to the query

Using similarity in vector spaces To find synonyms: Synonyms tend to have more similar vectors than non- synonyms: Synonyms occur in the same contexts But the same holds for antonyms: In vector spaces, “good” and “evil” are the same (more or less) So: vector spaces can be used to build a thesaurus automatically

Using similarity in vector spaces In cognitive science, to predict human judgments on how similar pairs of words are (on a scale of 1-10) “priming”

An automatically extracted thesaurus Dekang Lin 1998: For each word, automatically extract similar words vector space representation based on syntactic context of target (dependency parses) similarity measure: based on mutual information (“Lin’s measure”) Large thesaurus, used often in NLP applications

Vectors for word senses Up to now: one vector per word Vector for “bank” conflates financial contexts fishing contexts How to get to vectors for word senses?

Automatically inducing word senses Schütze 1998: one vector per sentence, or per occurrence (token)of “letter” She wrote an angry letter to her niece. He sprayed the word in big letters. The newspaper gets 100 letters from readers every day. Make token vector by adding up the vectors of all other (content) words in the sentence: Cluster token vectors Clusters = induced word senses

A vector for an individual occurrence of a word Avoid having to define word senses Sometimes hard to divide uses into senses: words like “leave”, or “paint” Erk/Pado 2008: Modify vector of “bank” using its syntactic context: ban k break obj ban k fish on

Summary: vector space models Representing meaning through counts Represent document through content words Represent word meaning through context words / parse tree snippets / documents Context items as dimensions, target as vector/point in semantic space Proximity in semantic space ~ similarity between words

Summary: vector space models Uses: Search Inducing ontologies Modeling human judgments of word similarity Represent word senses Cluster sentence vectors Compute vectors for individual occurrences