The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
LBSC 796/INFM 718R: Week 3 Boolean and Vector Space Models Jimmy Lin College of Information Studies University of Maryland Monday, February 13, 2006.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 4.
INFM 603: Information Technology and Organizational Context Jimmy Lin The iSchool University of Maryland Thursday, November 7, 2013 Session 10: Information.
Ranking models in IR Key idea: We wish to return in order the documents most likely to be useful to the searcher To do this, we want to know which documents.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
IR Models: Overview, Boolean, and Vector
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
LBSC 690: Week 11 Information Retrieval and Search Jimmy Lin College of Information Studies University of Maryland Monday, April 16, 2007.
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
LBSC 690: Session 11 Information Retrieval and Search Jimmy Lin College of Information Studies University of Maryland Monday, November 19, 2007.
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
Advance Information Retrieval Topics Hassan Bashiri.
LBSC 690 Information Retrieval and Search
Ranked Retrieval LBSC 796/INFM 718R Session 3 September 24, 2007.
Vector Space Model CS 652 Information Extraction and Integration.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Information retrieval: overview. Information Retrieval and Text Processing Huge literature dating back to the 1950’s! SIGIR/TREC - home for much of this.
LBSC 690 Session #9 Unstructured Information: Search Engines Jimmy Lin The iSchool University of Maryland Wednesday, October 29, 2008 This work is licensed.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.
5 June 2006Polettini Nicola1 Term Weighting in Information Retrieval Polettini Nicola Monday, June 5, 2006 Web Information Retrieval.
The Vector Space Model LBSC 708A/CMSC 838L Session 3, September 18, 2001 Douglas W. Oard.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Structure of IR Systems INST 734 Module 1 Doug Oard.
Ranked Retrieval LBSC 796/INFM 718R Session 3 February 16, 2011.
Language Models LBSC 796/CMSC 828o Session 4, February 16, 2004 Douglas W. Oard.
Vector Space Models.
Structure of IR Systems LBSC 796/INFM 718R Session 1, September 10, 2007 Doug Oard.
CIS 530 Lecture 2 From frequency to meaning: vector space models of semantics.
Evidence from Content INST 734 Module 2 Doug Oard.
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Text Retrieval and Spreadsheets Session 4 LBSC 690 Information Technology.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda Ranked retrieval  Similarity-based ranking Probability-based ranking.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Cross-Language Information Retrieval Applied Natural Language Processing October 29, 2009 Douglas W. Oard.
Probabilistic Retrieval LBSC 708A/CMSC 838L Session 4, October 2, 2001 Philip Resnik.
Automated Information Retrieval
Plan for Today’s Lecture(s)
IR Theory: IR Basics & Vector Space Model
Representation of documents and queries
Structure of IR Systems
IR Theory: IR Basics & Vector Space Model
Presentation transcript:

The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard

Agenda Thinking about search –Design strategies –Decomposing the search component Boolean “free text” retrieval –The “bag of terms” representation –Proximity operators Ranked retrieval –Vector space model –Passage retrieval

Supporting the Search Process Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Indexing Index Acquisition Collection     

Design Strategies Foster human-machine synergy –Exploit complementary strengths –Accommodate shared weaknesses Divide-and-conquer –Divide task into stages with well-defined interfaces –Continue dividing until problems are easily solved Co-design related components –Iterative process of joint optimization

Human-Machine Synergy Machines are good at: –Doing simple things accurately and quickly –Scaling to larger collections in sublinear time People are better at: –Accurately recognizing what they are looking for –Evaluating intangibles such as “quality” Both are pretty bad at: –Mapping consistently between words and concepts

Divide and Conquer Strategy: use encapsulation to limit complexity Approach: –Define interfaces (input and output) for each component Query interface: input terms, output representation –Define the functions performed by each component Remove common words, weight rare terms higher, … –Repeat the process within components as needed Result: a hierarchical decomposition

Search Goal Choose the same documents a human would –Without human intervention (less work) –Faster than a human could (less time) –As accurately as possible (less accuracy) Humans start with an information need –Machines start with a query Humans match documents to information needs –Machines match document & query representations

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Relevance Relevance relates a topic and a document –Duplicates are equally relevant, by definition –Constant over time and across users Pertinence relates a task and a document –Accounts for quality, complexity, language, … Utility relates a user and a document –Accounts for prior knowledge We seek utility, but relevance is what we get!

“Bag of Terms” Representation Bag = a “set” that can contain duplicates  “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} Vector = values recorded in any consistent order  {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [ ]

Bag of Terms Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party Term Document 1Document 2 Stopword List

Boolean “Free Text” Retrieval Limit the bag of words to “absent” and “present” –“Boolean” values, represented as 0 and 1 Represent terms as a “bag of documents” –Same representation, but rows rather than columns Combine the rows using “Boolean operators” –AND, OR, NOT Result set: every document with a 1 remaining

Boolean Operators A OR B A AND BA NOT B A B A B A B B NOT B (= A AND NOT B)

Boolean Free Text Example quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1 Doc Doc 3Doc Doc 5Doc Doc 7Doc 8 dog AND fox –Doc 3, Doc 5 dog NOT fox –Empty fox NOT dog –Doc 7 dog OR fox –Doc 3, Doc 5, Doc 7 good AND party –Doc 6, Doc 8 good AND party NOT over –Doc 6

Why Boolean Retrieval Works Boolean operators approximate natural language –Find documents about a good party that is not over AND can discover relationships between concepts –good party OR can discover alternate terminology –excellent party NOT can discover alternate meanings –Democratic party

The Perfect Query Paradox Every information need has a perfect doc set –If not, there would be no sense doing retrieval Almost every document set has a perfect query –AND every word to get a query for document 1 –Repeat for each document in the set –OR every document query to get the set query But users find Boolean query formulation hard –They get too much, too little, useless stuff, …

Why Boolean Retrieval Fails Natural language is way more complex –She saw the man on the hill with a telescope AND “discovers” nonexistent relationships –Terms in different paragraphs, chapters, … Guessing terminology for OR is hard –good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! –Democratic party, party to a lawsuit, …

Proximity Operators More precise versions of AND –“NEAR n” allows at most n-1 intervening terms –“WITH” requires terms to be adjacent and in order Easy to implement, but less efficient –Store a list of positions for each word in each doc Stopwords become very important! –Perform normal Boolean computations Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example time AND come –Doc 2 time (NEAR 2) come –Empty quick (NEAR 2) fox –Doc 1 quick WITH fox –Empty quick brown fox over lazy dog back now time all good men come jump aid their party 01 (9) Term 1 (13) 1 (6) 1 (7) 1 (8) 1 (16) 1 (1) 1 (2) 1 (15) 1 (4) (5) 1 (9) 1 (3) 1 (4) 1 (8) 1 (6) 1 (10) Doc 1Doc 2

Strengths and Weaknesses Strong points –Accurate, if you know the right strategies –Efficient for the computer Weaknesses –Often results in too many documents, or none –Users must learn Boolean logic –Sometimes finds relationships that don’t exist –Words can have many meanings –Choosing the right words is sometimes hard

Ranked Retrieval Paradigm Exact match retrieval often gives useless sets –No documents at all, or way too many documents Query reformulation is one “solution” –Manually add or delete query terms “Best-first” ranking can be superior –Select every document within reason –Put them in order, with the “best” ones first –Display them one screen at a time

Advantages of Ranked Retrieval Closer to the way people think –Some documents are better than others Enriches browsing behavior –Decide how far down the list to go as you read it Allows more flexible queries –Long and short queries can produce useful results

Ranked Retrieval Challenges “Best first” is easy to say but hard to do! –The best we can hope for is to approximate it Will the user understand the process? –It is hard to use a tool that you don’t understand Efficiency becomes a concern –Only a problem for long queries, though

Partial-Match Ranking Form several result sets from one long query –Query for the first set is the AND of all the terms –Then all but the 1st term, all but the 2nd, … –Then all but the first two terms, … –And so on until each single term query is tried Remove duplicates from subsequent sets Display the sets in the order they were made –Document rank within a set is arbitrary

Partial Match Example information AND retrieval Readings in Information Retrieval Information Storage and Retrieval Speech-Based Information Retrieval for Digital Libraries Word Sense Disambiguation and Information Retrieval information NOT retrieval The State of the Art in Information Filtering Inference Networks for Document Retrieval Content-Based Image Retrieval Systems Video Parsing, Retrieval and Browsing An Approach to Conceptual Text Retrieval Using the EuroWordNet … Cross-Language Retrieval: English/Russian/French retrieval NOT information

Similarity-Based Queries Treat the query as if it were a document –Create a query bag-of-words Find the similarity of each document –Using the coordination measure, for example Rank order the documents by similarity –Most similar to the query first Surprisingly, this works pretty well! –Especially for very short queries

Document Similarity How similar are two documents? –In particular, how similar is their bag of words? : Nuclear fallout contaminated Montana. 2: Information retrieval is interesting. 3: Information retrieval is complicated nuclear fallout siberia contaminated interesting complicated information retrieval 1 123

The Coordination Measure Count the number of terms in common –Based on Boolean bag-of-words Documents 2 and 3 share two common terms –But documents 1 and 2 share no terms at all Useful for “more like this” queries –“more like doc 2” would rank doc 3 ahead of doc 1 Where have you seen this before?

Coordination Measure Example nuclear fallout siberia contaminated interesting complicated information retrieval Query: complicated retrieval Result: 3, 2 Query: information retrieval Result: 2, 3 Query: interesting nuclear fallout Result: 1, 2

Counting Terms Terms tell us about documents –If “rabbit” appears a lot, it may be about rabbits Documents tell us about terms –“the” is in every document -- not discriminating Documents are most likely described well by rare terms that occur in them frequently –Higher “term frequency” is stronger evidence –Low “collection frequency” makes it stronger still

The Document Length Effect Humans look for documents with useful parts –But probabilities are computed for the whole Document lengths vary in many collections –So probability calculations could be inconsistent Two strategies –Adjust probability estimates for document length –Divide the documents into equal “passages”

Incorporating Term Frequency High term frequency is evidence of meaning –And high IDF is evidence of term importance Recompute the bag-of-words –Compute TF * IDF for every element

Weighted Matching Schemes Unweighted queries –Add up the weights for every matching term User specified query term weights –For each term, multiply the query and doc weights –Then add up those values Automatically computed query term weights –Most queries lack useful TF, but IDF may be useful –Used just like user-specified query term weights

TF*IDF Example nuclear fallout siberia contaminated interesting complicated information retrieval Unweighted query: contaminated retrieval Result: 2, 3, 1, 4 Weighted query: contaminated(3) retrieval(1) Result: 1, 3, 2, 4 IDF-weighted query: contaminated retrieval Result: 2, 3, 1, 4

Document Length Normalization Long documents have an unfair advantage –They use a lot of terms So they get more matches than short documents –And they use the same words repeatedly So they have much higher term frequencies Normalization seeks to remove these effects –Related somehow to maximum term frequency –But also sensitive to the of number of terms

“Cosine” Normalization Compute the length of each document vector –Multiply each weight by itself –Add all the resulting values –Take the square root of that sum Divide each weight by that length

Cosine Normalization Example nuclear fallout siberia contaminated interesting complicated information retrieval Length Unweighted query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)

Why Call It “Cosine”?  d2 d1

Interpreting the Cosine Measure Think of a document as a vector from zero Similarity is the angle between two vectors –Small angle = very similar –Large angle = little similarity Passes some key sanity checks –Depends on pattern of word use but not on length –Every document is most similar to itself

“Okapi” Term Weights TF componentIDF component

Passage Retrieval Another approach to long-document problem –Break it up into coherent units Recognizing topic boundaries is hard –But overlapping 300 word passages work fine Document rank is best passage rank –And passage information can help guide browsing

Summary Goal: find documents most similar to the query Compute normalized document term weights –Some combination of TF, DF, and Length Optionally, get query term weights from the user –Estimate of term importance Compute inner product of query and doc vectors –Multiply corresponding elements and then add

Before You Go! On a sheet of paper, please briefly answer the following question (no names): What was the muddiest point in today’s lecture?